Sensu RabbitMQ "results" queue piling up with low CPU on servers, RabbitMQ

Neil_Hooey · June 13, 2016, 7:56pm

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

Neil_Hooey · June 13, 2016, 8:04pm

Here are some more notes about our cluster:

We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs
The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge

···

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote:

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

Cameron_Johnston · June 13, 2016, 9:00pm

Hi Neil,

Have you looked at adjusting the value for RabbitMQ prefetch attribute in your Sensu configuration? This attribute controls how many unacknowledged messages are retrieved from the RabbitMQ broker at once. Prefetch defaults to a value of 1; adjusting this value upward can have a big impact on message throughput. See https://sensuapp.org/docs/latest/reference/rabbitmq.html for details.

···

On Monday, June 13, 2016 at 2:04:58 PM UTC-6, Neil Hooey wrote:

Here are some more notes about our cluster:

We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs

The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote:

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

Neil_Hooey · June 15, 2016, 2:18pm

I set the prefetch parameter to 50 and that has helped a lot with throughput, however the CPUs on the Sensu servers are still at around 50%, with negligible network traffic and disk activity, so I’m not sure what they’re doing.

Is there a way to profile Sensu server?

···

On Monday, June 13, 2016 at 5:00:11 PM UTC-4, Cameron Johnston wrote:

Hi Neil,

Have you looked at adjusting the value for RabbitMQ prefetch attribute in your Sensu configuration? This attribute controls how many unacknowledged messages are retrieved from the RabbitMQ broker at once. Prefetch defaults to a value of 1; adjusting this value upward can have a big impact on message throughput. See https://sensuapp.org/docs/latest/reference/rabbitmq.html for details.

On Monday, June 13, 2016 at 2:04:58 PM UTC-6, Neil Hooey wrote:

Here are some more notes about our cluster:

We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs

The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote:

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

Cameron_Johnston · June 17, 2016, 7:57pm

I believe that profiling sensu-server itself is possible using standard Ruby tools, but these won’t provide insight into the performance of the plugins (e.g. handlers, mutators) the server is executing. I tend to think that pipe handlers and mutators have an underestimated impact on Sensu’s performance.

···

On Wed, Jun 15, 2016 at 8:19 AM Neil Hooey nhooey@gmail.com wrote:

I set the prefetch parameter to 50 and that has helped a lot with throughput, however the CPUs on the Sensu servers are still at around 50%, with negligible network traffic and disk activity, so I’m not sure what they’re doing.

Is there a way to profile Sensu server?

On Monday, June 13, 2016 at 5:00:11 PM UTC-4, Cameron Johnston wrote:

Hi Neil,

Have you looked at adjusting the value for RabbitMQ prefetch attribute in your Sensu configuration? This attribute controls how many unacknowledged messages are retrieved from the RabbitMQ broker at once. Prefetch defaults to a value of 1; adjusting this value upward can have a big impact on message throughput. See https://sensuapp.org/docs/latest/reference/rabbitmq.html for details.

On Monday, June 13, 2016 at 2:04:58 PM UTC-6, Neil Hooey wrote:

Here are some more notes about our cluster:

We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs

The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote:

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

Neil_Hooey · June 18, 2016, 12:31am

With enough clients and events, pipe handlers are a complete disaster and should be entirely replaced with extension handlers.

Fortunately in my case I don’t have any handlers or mutators enabled and am still seeing low throughput. I’ll try the Ruby profiling tools.

···

On Jun 17, 2016, at 15:57, Cameron Johnston cameron@heavywater.io wrote:

I believe that profiling sensu-server itself is possible using standard Ruby tools, but these won’t provide insight into the performance of the plugins (e.g. handlers, mutators) the server is executing. I tend to think that pipe handlers and mutators have an underestimated impact on Sensu’s performance.

On Wed, Jun 15, 2016 at 8:19 AM Neil Hooey nhooey@gmail.com wrote:

I set the prefetch parameter to 50 and that has helped a lot with throughput, however the CPUs on the Sensu servers are still at around 50%, with negligible network traffic and disk activity, so I’m not sure what they’re doing.

Is there a way to profile Sensu server?

On Monday, June 13, 2016 at 5:00:11 PM UTC-4, Cameron Johnston wrote:

Hi Neil,

Have you looked at adjusting the value for RabbitMQ prefetch attribute in your Sensu configuration? This attribute controls how many unacknowledged messages are retrieved from the RabbitMQ broker at once. Prefetch defaults to a value of 1; adjusting this value upward can have a big impact on message throughput. See https://sensuapp.org/docs/latest/reference/rabbitmq.html for details.

On Monday, June 13, 2016 at 2:04:58 PM UTC-6, Neil Hooey wrote:

Here are some more notes about our cluster:

We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs

The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote:

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

Chris_Thomas · November 24, 2016, 12:08pm

Did you ever get anywhere with the profiling? I’ve got what sounds like a similar problem with results piling up on the results queue.

With enough clients and events, pipe handlers are a complete disaster and should be entirely replaced with extension handlers.

Fortunately in my case I don’t have any handlers or mutators enabled and am still seeing low throughput. I’ll try the Ruby profiling tools.

I believe that profiling sensu-server itself is possible using standard Ruby tools, but these won’t provide insight into the performance of the plugins (e.g. handlers, mutators) the server is executing. I tend to think that pipe handlers and mutators have an underestimated impact on Sensu’s performance.

I set the prefetch parameter to 50 and that has helped a lot with throughput, however the CPUs on the Sensu servers are still at around 50%, with negligible network traffic and disk activity, so I’m not sure what they’re doing.

Is there a way to profile Sensu server?

Hi Neil,

Have you looked at adjusting the value for RabbitMQ prefetch attribute in your Sensu configuration? This attribute controls how many unacknowledged messages are retrieved from the RabbitMQ broker at once. Prefetch defaults to a value of 1; adjusting this value upward can have a big impact on message throughput. See https://sensuapp.org/docs/latest/reference/rabbitmq.html for details.

Here are some more notes about our cluster:

We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs

The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

···

On Saturday, 18 June 2016 01:31:45 UTC+1, Neil Hooey wrote:

On Jun 17, 2016, at 15:57, Cameron Johnston cam...@heavywater.io wrote:

On Wed, Jun 15, 2016 at 8:19 AM Neil Hooey nho...@gmail.com wrote:

On Monday, June 13, 2016 at 5:00:11 PM UTC-4, Cameron Johnston wrote:

On Monday, June 13, 2016 at 2:04:58 PM UTC-6, Neil Hooey wrote:

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote:

Topic		Replies	Views
Sensu not keeping up with rabbit queue Sensu Classic (EOL)	7	552	March 17, 2015
Sensu performance/architecture Sensu Classic (EOL)	11	474	September 29, 2015
Sensu for large environments Sensu Classic (EOL)	4	489	November 22, 2018
RabbitMQ clustering, node failure and Sensu Sensu Classic (EOL)	5	510	June 14, 2014
Clustered Sensu with RabbitMQ Sensu Classic (EOL)	8	571	November 22, 2018

Sensu RabbitMQ "results" queue piling up with low CPU on servers, RabbitMQ

Related topics