We’ve been experiencing random issues where suddenly messages start to pile up in the results and keepalive queues, seemingly without stop until I manually go into RabbitMQ and purge those two queues. It’s happened twice in the last week.
We’re running Sensu 0.26.1 with just over 800 clients and 2000 defined checks/metrics. We have two vhosts in RabbitMQ – one for prod and one for nonprod. About 250 of the 800 clients are prod, the rest are nonprod. Checks are scheduled by sensu-servers as we make heavy use of subscriptions. When things are working properly we’re collecting about 61,000 metrics per minute and sending them to Elasticsearch via an extension and graphite via TCP pipe. Checks themselves have a few handlers (slack, opsgenie, mailer) that are in the process of being converted to extensions.
But heres the interesting part: we’re running 20 sensu-servers for nonprod and 20 sensu-servers for prod (they connect to the respective vhost). I’d think that 40 total sensu-servers would be able to keep up with 800 clients no problem, but alas, random piles up in the rabbitmq cluster. While the queues start to pile up I can spin up another 10 sensu-servers in the problematic vhost and it doesn’t matter, that queue never returns back to normal once the buildup starts to happen. Then come the keepalive spam, etc. What’s interesting is that despite throwing a bunch more sensu-server processes at it, the deliver/get counts when viewing the queue in the RabbitMQ admin panel remains around 30-50/s where as incoming messages get as high as 200/s. But purge the queue and suddenly everything goes back to normal.
I will mention that we’re running sensu-server in Docker, however we’ve experienced this exact same thing when running it in straight Linux VMs on the older 0.20.X versions. The idea behind going to containers was that we can scale out more easily (more sensu-server instances) than having to provision more VMs. Load on the servers isn’t high, logs seem like things are fine.
Has anyone run into similar issues where their queues just start to pile up for seemingly no reason? And are there log flags that can be enabled that might provide more verbose output on the processing of results and keepalives?