Sporadic RabbitMQ result/keepalive queue processing issues

uberamd · November 7, 2016, 11:41am

Hi everyone,

We’ve been experiencing random issues where suddenly messages start to pile up in the results and keepalive queues, seemingly without stop until I manually go into RabbitMQ and purge those two queues. It’s happened twice in the last week.

We’re running Sensu 0.26.1 with just over 800 clients and 2000 defined checks/metrics. We have two vhosts in RabbitMQ – one for prod and one for nonprod. About 250 of the 800 clients are prod, the rest are nonprod. Checks are scheduled by sensu-servers as we make heavy use of subscriptions. When things are working properly we’re collecting about 61,000 metrics per minute and sending them to Elasticsearch via an extension and graphite via TCP pipe. Checks themselves have a few handlers (slack, opsgenie, mailer) that are in the process of being converted to extensions.

But heres the interesting part: we’re running 20 sensu-servers for nonprod and 20 sensu-servers for prod (they connect to the respective vhost). I’d think that 40 total sensu-servers would be able to keep up with 800 clients no problem, but alas, random piles up in the rabbitmq cluster. While the queues start to pile up I can spin up another 10 sensu-servers in the problematic vhost and it doesn’t matter, that queue never returns back to normal once the buildup starts to happen. Then come the keepalive spam, etc. What’s interesting is that despite throwing a bunch more sensu-server processes at it, the deliver/get counts when viewing the queue in the RabbitMQ admin panel remains around 30-50/s where as incoming messages get as high as 200/s. But purge the queue and suddenly everything goes back to normal.

I will mention that we’re running sensu-server in Docker, however we’ve experienced this exact same thing when running it in straight Linux VMs on the older 0.20.X versions. The idea behind going to containers was that we can scale out more easily (more sensu-server instances) than having to provision more VMs. Load on the servers isn’t high, logs seem like things are fine.

Has anyone run into similar issues where their queues just start to pile up for seemingly no reason? And are there log flags that can be enabled that might provide more verbose output on the processing of results and keepalives?

Paneng_Worldwide · December 13, 2016, 10:34am

Hello,

Maybe have a look into the prefetch config option for rabbitmq: https://sensuapp.org/docs/latest/reference/rabbitmq.html

Cheers

Chris

···

On Monday, November 7, 2016 at 6:41:40 PM UTC+7, uberamd wrote:

Hi everyone,

We’ve been experiencing random issues where suddenly messages start to pile up in the results and keepalive queues, seemingly without stop until I manually go into RabbitMQ and purge those two queues. It’s happened twice in the last week.

We’re running Sensu 0.26.1 with just over 800 clients and 2000 defined checks/metrics. We have two vhosts in RabbitMQ – one for prod and one for nonprod. About 250 of the 800 clients are prod, the rest are nonprod. Checks are scheduled by sensu-servers as we make heavy use of subscriptions. When things are working properly we’re collecting about 61,000 metrics per minute and sending them to Elasticsearch via an extension and graphite via TCP pipe. Checks themselves have a few handlers (slack, opsgenie, mailer) that are in the process of being converted to extensions.

But heres the interesting part: we’re running 20 sensu-servers for nonprod and 20 sensu-servers for prod (they connect to the respective vhost). I’d think that 40 total sensu-servers would be able to keep up with 800 clients no problem, but alas, random piles up in the rabbitmq cluster. While the queues start to pile up I can spin up another 10 sensu-servers in the problematic vhost and it doesn’t matter, that queue never returns back to normal once the buildup starts to happen. Then come the keepalive spam, etc. What’s interesting is that despite throwing a bunch more sensu-server processes at it, the deliver/get counts when viewing the queue in the RabbitMQ admin panel remains around 30-50/s where as incoming messages get as high as 200/s. But purge the queue and suddenly everything goes back to normal.

I will mention that we’re running sensu-server in Docker, however we’ve experienced this exact same thing when running it in straight Linux VMs on the older 0.20.X versions. The idea behind going to containers was that we can scale out more easily (more sensu-server instances) than having to provision more VMs. Load on the servers isn’t high, logs seem like things are fine.

Has anyone run into similar issues where their queues just start to pile up for seemingly no reason? And are there log flags that can be enabled that might provide more verbose output on the processing of results and keepalives?

Topic		Replies	Views
Performance Increase: Can we enable RabbitMQ queue sharding with Sensu? How? Sensu Classic (EOL)	6	1861	July 13, 2019
sensu server is not clearing down some random client queues in rabbit. Sensu Classic (EOL)	2	483	July 30, 2016
sensu-servers seem to sometimes just stop processing keepalive events Sensu Classic (EOL)	0	470	March 1, 2016
Sensu RabbitMQ "results" queue piling up with low CPU on servers, RabbitMQ Sensu Classic (EOL)	7	830	November 22, 2018
Sensu not keeping up with rabbit queue Sensu Classic (EOL)	7	538	March 17, 2015

Sporadic RabbitMQ result/keepalive queue processing issues

Related topics