Sensu RabbitMQ "results" queue piling up with low CPU on servers, RabbitMQ

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

Here are some more notes about our cluster:

  • We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs
  • The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge
···

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote:

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

Hi Neil,

Have you looked at adjusting the value for RabbitMQ prefetch attribute in your Sensu configuration? This attribute controls how many unacknowledged messages are retrieved from the RabbitMQ broker at once. Prefetch defaults to a value of 1; adjusting this value upward can have a big impact on message throughput. See https://sensuapp.org/docs/latest/reference/rabbitmq.html for details.

···

On Monday, June 13, 2016 at 2:04:58 PM UTC-6, Neil Hooey wrote:

Here are some more notes about our cluster:

  • We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs
  • The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote:

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

I set the prefetch parameter to 50 and that has helped a lot with throughput, however the CPUs on the Sensu servers are still at around 50%, with negligible network traffic and disk activity, so I’m not sure what they’re doing.

Is there a way to profile Sensu server?

···

On Monday, June 13, 2016 at 5:00:11 PM UTC-4, Cameron Johnston wrote:

Hi Neil,

Have you looked at adjusting the value for RabbitMQ prefetch attribute in your Sensu configuration? This attribute controls how many unacknowledged messages are retrieved from the RabbitMQ broker at once. Prefetch defaults to a value of 1; adjusting this value upward can have a big impact on message throughput. See https://sensuapp.org/docs/latest/reference/rabbitmq.html for details.

On Monday, June 13, 2016 at 2:04:58 PM UTC-6, Neil Hooey wrote:

Here are some more notes about our cluster:

  • We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs
  • The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote:

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

I believe that profiling sensu-server itself is possible using standard Ruby tools, but these won’t provide insight into the performance of the plugins (e.g. handlers, mutators) the server is executing. I tend to think that pipe handlers and mutators have an underestimated impact on Sensu’s performance.

···

On Wed, Jun 15, 2016 at 8:19 AM Neil Hooey nhooey@gmail.com wrote:

I set the prefetch parameter to 50 and that has helped a lot with throughput, however the CPUs on the Sensu servers are still at around 50%, with negligible network traffic and disk activity, so I’m not sure what they’re doing.

Is there a way to profile Sensu server?

On Monday, June 13, 2016 at 5:00:11 PM UTC-4, Cameron Johnston wrote:

Hi Neil,

Have you looked at adjusting the value for RabbitMQ prefetch attribute in your Sensu configuration? This attribute controls how many unacknowledged messages are retrieved from the RabbitMQ broker at once. Prefetch defaults to a value of 1; adjusting this value upward can have a big impact on message throughput. See https://sensuapp.org/docs/latest/reference/rabbitmq.html for details.

On Monday, June 13, 2016 at 2:04:58 PM UTC-6, Neil Hooey wrote:

Here are some more notes about our cluster:

  • We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs
  • The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote:

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

With enough clients and events, pipe handlers are a complete disaster and should be entirely replaced with extension handlers.

Fortunately in my case I don’t have any handlers or mutators enabled and am still seeing low throughput. I’ll try the Ruby profiling tools.

···

On Jun 17, 2016, at 15:57, Cameron Johnston cameron@heavywater.io wrote:

I believe that profiling sensu-server itself is possible using standard Ruby tools, but these won’t provide insight into the performance of the plugins (e.g. handlers, mutators) the server is executing. I tend to think that pipe handlers and mutators have an underestimated impact on Sensu’s performance.

On Wed, Jun 15, 2016 at 8:19 AM Neil Hooey nhooey@gmail.com wrote:

I set the prefetch parameter to 50 and that has helped a lot with throughput, however the CPUs on the Sensu servers are still at around 50%, with negligible network traffic and disk activity, so I’m not sure what they’re doing.

Is there a way to profile Sensu server?

On Monday, June 13, 2016 at 5:00:11 PM UTC-4, Cameron Johnston wrote:

Hi Neil,

Have you looked at adjusting the value for RabbitMQ prefetch attribute in your Sensu configuration? This attribute controls how many unacknowledged messages are retrieved from the RabbitMQ broker at once. Prefetch defaults to a value of 1; adjusting this value upward can have a big impact on message throughput. See https://sensuapp.org/docs/latest/reference/rabbitmq.html for details.

On Monday, June 13, 2016 at 2:04:58 PM UTC-6, Neil Hooey wrote:

Here are some more notes about our cluster:

  • We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs
  • The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote:

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

Did you ever get anywhere with the profiling? I’ve got what sounds like a similar problem with results piling up on the results queue.

With enough clients and events, pipe handlers are a complete disaster and should be entirely replaced with extension handlers.

Fortunately in my case I don’t have any handlers or mutators enabled and am still seeing low throughput. I’ll try the Ruby profiling tools.

I believe that profiling sensu-server itself is possible using standard Ruby tools, but these won’t provide insight into the performance of the plugins (e.g. handlers, mutators) the server is executing. I tend to think that pipe handlers and mutators have an underestimated impact on Sensu’s performance.

I set the prefetch parameter to 50 and that has helped a lot with throughput, however the CPUs on the Sensu servers are still at around 50%, with negligible network traffic and disk activity, so I’m not sure what they’re doing.

Is there a way to profile Sensu server?

Hi Neil,

Have you looked at adjusting the value for RabbitMQ prefetch attribute in your Sensu configuration? This attribute controls how many unacknowledged messages are retrieved from the RabbitMQ broker at once. Prefetch defaults to a value of 1; adjusting this value upward can have a big impact on message throughput. See https://sensuapp.org/docs/latest/reference/rabbitmq.html for details.

Here are some more notes about our cluster:

  • We have 6 Sensu Server nodes that are Digital Ocean 2x CPU VMs
  • The RabbitMQ stats for the last 10 minutes are: 204/s Publish, 123/s Deliver, 123/s Acknowledge

Occasionally our Sensu cluster gets in to a state where the RabbitMQ “results” queue piles up with thousands of events, and keeps getting bigger, while all Sensu server nodes have low 50% CPU usage on them.

Previously we’ve solved this problem by stopping all Sensu server nodes, which deletes the “results” queue, then clearing all “history:" and "results:” keys from Redis so fewer events get generated when the Sensu server nodes start up again.

While purging queues and Redis keys has worked in the past, it’s not working now. Our Sensu cluster keeps piling up messages.

Does anyone have any idea on what to do to solve this?

We’re not running any handlers or filters, so the Sensu servers aren’t busy waiting on I/O. CPU usage on the RabbitMQ node is less than 25%. We’re running Sensu 0.24.1-1 from the Ubuntu Apt repository.

···

On Saturday, 18 June 2016 01:31:45 UTC+1, Neil Hooey wrote:

On Jun 17, 2016, at 15:57, Cameron Johnston cam...@heavywater.io wrote:

On Wed, Jun 15, 2016 at 8:19 AM Neil Hooey nho...@gmail.com wrote:

On Monday, June 13, 2016 at 5:00:11 PM UTC-4, Cameron Johnston wrote:

On Monday, June 13, 2016 at 2:04:58 PM UTC-6, Neil Hooey wrote:

On Monday, June 13, 2016 at 3:56:58 PM UTC-4, Neil Hooey wrote: