Sensu for large environments

David_R · February 10, 2017, 7:37pm

Hello,

Anyone out there running sensu in large production environments? If so, do you have any tips?

I’m currently implementing Sensu in an environment with 1300+ machines and am running into some performance issues and idiosyncrasies with my server reporting data accurately, consistently, and in a timely manner.

Thanks!

Bryan_Brandau · February 10, 2017, 8:40pm

David,

We run multiple Sensu setups across different DC’s that flex up and down with a large number of machines.

It sounds like your queues are getting backed up. We’re not doing much special except we run two node setups. This gives us two sensu-servers to process messages and handle events. One of the nodes takes the brunt of running RabbitMQ and Redis as well. These range from 4 core for our physical DC’s and 8 cores in our cloud environments because we have a lot more churn up and down.

Are your boxes CPU bound? What does your queue depth look like? Every setup can be different based on number of checks, machines and what your handlers are doing. If you are having event storms and lots of handlers are being spun up, you’ll need more to deal with that. Since this sounds like a new setup, do you have a lot of events firing because you still have check tuning to do?

I’d start with looking at where you’re being bound up. If things aren’t timely, look at how backed up your queues are getting.

-Bryan

···

On Fri, Feb 10, 2017 at 1:37 PM, David R drhey@umich.edu wrote:

Hello,

Anyone out there running sensu in large production environments? If so, do you have any tips?

I’m currently implementing Sensu in an environment with 1300+ machines and am running into some performance issues and idiosyncrasies with my server reporting data accurately, consistently, and in a timely manner.

Thanks!

David_R · February 13, 2017, 2:19pm

Bryan,

Thanks for responding! The queues are definitely getting backed up, and Uchiwa is displaying data intermittently and not consistently (e.g. sometimes critical check results are displayed, but 0 clients are shown as being registered). I’ve tried tuning kernel parameters a bit per RabbitMQ’s documentation(Networking and RabbitMQ — RabbitMQ), as well as increased max open file sizes for the rabbitmq user (and redis, and sensu, users) to no real avail.

The box I’m running on has 12 cores and 48gb of ram. I was recently advised to set up a secondary server like you mentioned, but the parameters were a bit ambiguous to me. I have a second server we use for testing and am able to link into that, but I am not sure to what degree. I’ve heard that the secondary box need only run sensu-server and sensu-api. Does that imply that redis and rabbitmq are NOT meant to be running on the secondary server?

Right now, the boxes I’m working with are CPU bound as they’re physical hosts. That could change in the future. I am not sure what you mean by queue depth (I’m still rather new to some of the concepts RE: rabbitmq, so I apologize for my ignorance), but would be happy to find that information. Right now, I just have my clients reporting in keepalives and nothing else (no handlers or additional checks implemented yet due to the accordion-like nature of clients reporting and disappearing in Uchiwa).

The queues can go from having 0 items in them to 87k (at the time of this response). Eventually they go back down, but there’s no rhyme/reason to this that I can see so far. And this is after tailing logs for rabbitmq and redis. When I think I’ve spotted a culprit, it disappears and feels more like a red herring.

Again, thanks for the input and insight!

···

On Friday, February 10, 2017 at 3:40:41 PM UTC-5, agent462 wrote:

David,

We run multiple Sensu setups across different DC’s that flex up and down with a large number of machines.

It sounds like your queues are getting backed up. We’re not doing much special except we run two node setups. This gives us two sensu-servers to process messages and handle events. One of the nodes takes the brunt of running RabbitMQ and Redis as well. These range from 4 core for our physical DC’s and 8 cores in our cloud environments because we have a lot more churn up and down.

Are your boxes CPU bound? What does your queue depth look like? Every setup can be different based on number of checks, machines and what your handlers are doing. If you are having event storms and lots of handlers are being spun up, you’ll need more to deal with that. Since this sounds like a new setup, do you have a lot of events firing because you still have check tuning to do?

I’d start with looking at where you’re being bound up. If things aren’t timely, look at how backed up your queues are getting.

-Bryan

On Fri, Feb 10, 2017 at 1:37 PM, David R dr...@umich.edu wrote:

Hello,

Anyone out there running sensu in large production environments? If so, do you have any tips?

I’m currently implementing Sensu in an environment with 1300+ machines and am running into some performance issues and idiosyncrasies with my server reporting data accurately, consistently, and in a timely manner.

Thanks!

David_R · March 16, 2017, 3:33pm

Just a note, in case some person years from how has a similar issue but can’t think of things to check or is at their wits end: look at the arp cache on your RabbitMQ server. If it’s set to something like 1024, but you have several thousand machines attempting to report into RabbitMQ, you might need to tinker with the kernel parameters for arp:

e.g. net.ipv4.neigh.default.gc_thresh1, net.ipv4.neigh.default.gc_thresh2, net.ipv4.neigh.default.gc_thresh3

HTH!

···

On Friday, February 10, 2017 at 2:37:41 PM UTC-5, David R wrote:

Hello,

Anyone out there running sensu in large production environments? If so, do you have any tips?

I’m currently implementing Sensu in an environment with 1300+ machines and am running into some performance issues and idiosyncrasies with my server reporting data accurately, consistently, and in a timely manner.

Thanks!

Topic		Replies	Views
Sensu RabbitMQ "results" queue piling up with low CPU on servers, RabbitMQ Sensu Classic (EOL)	7	839	November 22, 2018
Sensu performance/architecture Sensu Classic (EOL)	11	454	September 29, 2015
Sensu not keeping up with rabbit queue Sensu Classic (EOL)	7	547	March 17, 2015
Performance Increase: Can we enable RabbitMQ queue sharding with Sensu? How? Sensu Classic (EOL)	6	1869	July 13, 2019
Clustered Sensu with RabbitMQ Sensu Classic (EOL)	8	566	November 22, 2018

Sensu for large environments

Related topics