Anyone out there running sensu in large production environments? If so, do you have any tips?
I’m currently implementing Sensu in an environment with 1300+ machines and am running into some performance issues and idiosyncrasies with my server reporting data accurately, consistently, and in a timely manner.
We run multiple Sensu setups across different DC’s that flex up and down with a large number of machines.
It sounds like your queues are getting backed up. We’re not doing much special except we run two node setups. This gives us two sensu-servers to process messages and handle events. One of the nodes takes the brunt of running RabbitMQ and Redis as well. These range from 4 core for our physical DC’s and 8 cores in our cloud environments because we have a lot more churn up and down.
Are your boxes CPU bound? What does your queue depth look like? Every setup can be different based on number of checks, machines and what your handlers are doing. If you are having event storms and lots of handlers are being spun up, you’ll need more to deal with that. Since this sounds like a new setup, do you have a lot of events firing because you still have check tuning to do?
I’d start with looking at where you’re being bound up. If things aren’t timely, look at how backed up your queues are getting.
-Bryan
···
On Fri, Feb 10, 2017 at 1:37 PM, David R drhey@umich.edu wrote:
Hello,
Anyone out there running sensu in large production environments? If so, do you have any tips?
I’m currently implementing Sensu in an environment with 1300+ machines and am running into some performance issues and idiosyncrasies with my server reporting data accurately, consistently, and in a timely manner.
Thanks for responding! The queues are definitely getting backed up, and Uchiwa is displaying data intermittently and not consistently (e.g. sometimes critical check results are displayed, but 0 clients are shown as being registered). I’ve tried tuning kernel parameters a bit per RabbitMQ’s documentation(Networking and RabbitMQ — RabbitMQ), as well as increased max open file sizes for the rabbitmq user (and redis, and sensu, users) to no real avail.
The box I’m running on has 12 cores and 48gb of ram. I was recently advised to set up a secondary server like you mentioned, but the parameters were a bit ambiguous to me. I have a second server we use for testing and am able to link into that, but I am not sure to what degree. I’ve heard that the secondary box need only run sensu-server and sensu-api. Does that imply that redis and rabbitmq are NOT meant to be running on the secondary server?
Right now, the boxes I’m working with are CPU bound as they’re physical hosts. That could change in the future. I am not sure what you mean by queue depth (I’m still rather new to some of the concepts RE: rabbitmq, so I apologize for my ignorance), but would be happy to find that information. Right now, I just have my clients reporting in keepalives and nothing else (no handlers or additional checks implemented yet due to the accordion-like nature of clients reporting and disappearing in Uchiwa).
The queues can go from having 0 items in them to 87k (at the time of this response). Eventually they go back down, but there’s no rhyme/reason to this that I can see so far. And this is after tailing logs for rabbitmq and redis. When I think I’ve spotted a culprit, it disappears and feels more like a red herring.
Again, thanks for the input and insight!
···
On Friday, February 10, 2017 at 3:40:41 PM UTC-5, agent462 wrote:
David,
We run multiple Sensu setups across different DC’s that flex up and down with a large number of machines.
It sounds like your queues are getting backed up. We’re not doing much special except we run two node setups. This gives us two sensu-servers to process messages and handle events. One of the nodes takes the brunt of running RabbitMQ and Redis as well. These range from 4 core for our physical DC’s and 8 cores in our cloud environments because we have a lot more churn up and down.
Are your boxes CPU bound? What does your queue depth look like? Every setup can be different based on number of checks, machines and what your handlers are doing. If you are having event storms and lots of handlers are being spun up, you’ll need more to deal with that. Since this sounds like a new setup, do you have a lot of events firing because you still have check tuning to do?
I’d start with looking at where you’re being bound up. If things aren’t timely, look at how backed up your queues are getting.
-Bryan
On Fri, Feb 10, 2017 at 1:37 PM, David R dr...@umich.edu wrote:
Hello,
Anyone out there running sensu in large production environments? If so, do you have any tips?
I’m currently implementing Sensu in an environment with 1300+ machines and am running into some performance issues and idiosyncrasies with my server reporting data accurately, consistently, and in a timely manner.
Just a note, in case some person years from how has a similar issue but can’t think of things to check or is at their wits end: look at the arp cache on your RabbitMQ server. If it’s set to something like 1024, but you have several thousand machines attempting to report into RabbitMQ, you might need to tinker with the kernel parameters for arp:
e.g. net.ipv4.neigh.default.gc_thresh1, net.ipv4.neigh.default.gc_thresh2, net.ipv4.neigh.default.gc_thresh3
HTH!
···
On Friday, February 10, 2017 at 2:37:41 PM UTC-5, David R wrote:
Hello,
Anyone out there running sensu in large production environments? If so, do you have any tips?
I’m currently implementing Sensu in an environment with 1300+ machines and am running into some performance issues and idiosyncrasies with my server reporting data accurately, consistently, and in a timely manner.