Performance Increase: Can we enable RabbitMQ queue sharding with Sensu? How?


#1

We’re monitoring 30k+ servers with about 45 checks using a single 16core 64GB Rabbitmq-server and two 4core 8GB sensu servers. Our (results and keepalive) queues are consistently piling up. Since the past month, the max queue count has increased significantly from 100k to about 300k as of today (our clients increase daily).
Previously when this (queue message accumulation) happened , we scaled up our 8 core RabbitMQ to the 16 core as it stands now which solved the problem for about 3 months. Same thing happened before that when we scaled up to 8 cores (from 4 cores) which also solved the problem for about 8 months. Now it’s going over budget so we can’t scale RabbitMQ vertically.

  1. Can we somehow use the RabbitMQ sharding functionality with sensu? I believe the bottleneck is RabbitMQ (I know this because adding more sensu servers does not help the cause). My previous attempts at creating a RabbitMQ cluster gave poor results due to slow syncing issues. By sharding the keepalive and results queues, we can hopefully improve performance. So is it possible?
  2. If not, is there something else that I can tune to improve performance.

Extra info:

  • All servers running Debian 8.
  • Prefetch 50 (This is my tested optimum, anything more or less causes degradation.)
  • Sensu version 0.29
  • We don’t run any mutators, but run some filters for handlers and send data to whisper/carbon through relays.
  • Our checks run with a minimum of 5 minute interval. Needless to say our cooling off period is not the problem.

I have a slight hunch that the sensu servers relay is the problem. Sending each metric of 30K+ servers to an external server would keep the sensu servers busy in relaying instead of consuming from the queue. But, alas, I don’t know how to test/prove this.

Any help will be greatly appreciated. Thanks.


#2

Hi there! :wave: That sounds like a pretty awesome Sensu deployment! We should send you a t-shirt. :smile:

I’m not an expert regarding RabbitMQ performance tuning, but I do have a few observations/questions that might help someone else help you:

  • Sensu version 0.29 is pretty old; released April 2017 (changelog). Is there any possibility of an upgrade? I don’t know if it would help in your case, but there have been some recent RabbitMQ performance improvements in Sensu Core (changelog, version 1.6, released October 2018)
  • When you say you have a “hunch that the sensu servers relay is the problem” in the context of sending metrics, can you elaborate on that configuration a bit? It sounds like you’re sending metrics to a Graphite backend. Are you doing that with a sensu tcp handler, pipe handler plugin (run once) or a sensu handler extension (long-running process)? If you’re using a pipe handler plugin this is usually non-optimal for metrics at any kind of scale.

I hope this helps!


#3

@calebhailey

That sounds like a pretty awesome Sensu deployment! We should send you a t-shirt.

Thanks. I’ll take a US mens size S, please. :wink:

On a serious note:

Is there any possibility of an upgrade?

Yes. This is one of my priorities, but won’t be immediately possible. I know for a fact that this will help because the last upgrade (0.16 to 0.29) also had a major impact. In the mean time, I’m looking for other optimizations.

Are you doing that with a sensu tcp handler, pipe handler plugin (run once) or a sensu handler extension (long-running process)?

We use two types of handlers:

  • tcp - to send metrics for the 30k hosts to Carbon (Graphite/Whisper).
  • pipe - to a ruby script which sends “check” alerts to a server on the same VPC.
  1. I’ve been thinking about using a transport handler instead of tcp to put the metrics back in a different RabbitMQ queue for the graphite backend to process, but have not gotten a chance to load test this out. Do you think this approach will be faster/better?
  2. As for the “check” alerts, we “curl” a web service under our control. I’ll face resistance, but I can make a case for tcp’ing “check” alerts. Would that have a major impact?

Thanks for the help.


#4

Are you literally shelling out to curl as a handler? If so, a TCP handler or Extension would almost certainly improve performance.

Generally speaking I would say that pipe handlers are computationally expensive at scale, spinning up a separate executable (and the corresponding runtime, if applicable) for each event. Rewriting the pipe handler as a Sensu Extension could improve performance, even if it is still communicating with the remote server via HTTP. Sensu Extensions are long-running which help avoid “expensive” process start/stop time, but they can only be written in Ruby at this time.

SHAMELESS PLUG: This is how all of the Sensu Enterprise integrations have been implemented. They are just bundled Sensu Extensions, for optimal performance at scale. I’m not sure if we have the integration you’re looking for (see here for a complete list), but if it is, Sensu Enterprise is a drop-in replacement for Sensu Core, just sayin’! :smile:

If your current implementation is literally executing curl commands, this utility could prove useful for rewriting them in Ruby and running them as Sensu extensions: https://jhawthorn.github.io/curl-to-ruby/ (I’m embarrassed to how much mileage I’ve gotten from that). :laughing:

I hope that helps!


#5

Hi @t-sawyer! :wave:

I understand you are currently looking for options which might help you improve this situation without upgrading, but I suspect you may be experiencing some of the same behaviors which prompted the implementation RabbitMQ connection handling changes in Sensu Core 1.6, and subsequently released in Sensu Enterprise 3.3 as well.

Our (results and keepalive) queues are consistently piling up. Since the past month, the max queue count has increased significantly from 100k to about 300k as of today (our clients increase daily).

Do you have a sense for whether or not flow control is being applied to Sensu connections to RabbitMQ? Using the RabbitMQ management plugin – either via the web interface or HTTP API – you should be able to get a sense as to whether RabbitMQ is applying flow control to Sensu connections. Prior to Sensu Core 1.6 and Enterprise 3.3, server connections under flow control are likely to see their rate of message consumption slow dramatically.

As a result of these changes, Sensu now uses multiple connections to RabbitMQ – instead of multiple channels in the same connection – to avoid scenarios where flow control negatively impacts the rate of message consumption. This does not mean that connections will no longer have flow control applied, but where Sensu servers are concerned we do expect these changes to significantly reduce the impact of flow control on message consumption rates.

Can we somehow use the RabbitMQ sharding functionality with sensu?

I believe this would require sharding Sensu itself – splitting your large Sensu instance into multiple instances, each with it’s own Redis and RabbitMQ instance.

I believe the bottleneck is RabbitMQ (I know this because adding more sensu servers does not help the cause). My previous attempts at creating a RabbitMQ cluster gave poor results due to slow syncing issues. By sharding the keepalive and results queues, we can hopefully improve performance. So is it possible?

On the whole, clustering will not help with message throughput because replicating messages across the cluster introduces overhead. It’s possible that RabbitMQ CPU utilization is the bottleneck, but without upgrading to Sensu Core 1.6 or Sensu Enterprise 3.3 it’s hard to eliminate connection flow control as a potential source of the behavior you’re seeing.

I’ve been thinking about using a transport handler instead of tcp to put the metrics back in a different RabbitMQ queue for the graphite backend to process, but have not gotten a chance to load test this out. Do you think this approach will be faster/better?

Using a transport handler reuses your existing connections to RabbitMQ, so it would improve performance by removing TCP handshake overhead for each metric currently being sent by TCP handler. That said, if you believe RabbitMQ CPU utilization is the bottleneck in your system, increasing load on RabbitMQ sounds like it might hurt more than it could help.