We’re monitoring 30k+ servers with about 45 checks using a single 16core 64GB Rabbitmq-server and two 4core 8GB sensu servers. Our (results and keepalive) queues are consistently piling up. Since the past month, the max queue count has increased significantly from 100k to about 300k as of today (our clients increase daily).
Previously when this (queue message accumulation) happened , we scaled up our 8 core RabbitMQ to the 16 core as it stands now which solved the problem for about 3 months. Same thing happened before that when we scaled up to 8 cores (from 4 cores) which also solved the problem for about 8 months. Now it’s going over budget so we can’t scale RabbitMQ vertically.
- Can we somehow use the RabbitMQ sharding functionality with sensu? I believe the bottleneck is RabbitMQ (I know this because adding more sensu servers does not help the cause). My previous attempts at creating a RabbitMQ cluster gave poor results due to slow syncing issues. By sharding the keepalive and results queues, we can hopefully improve performance. So is it possible?
- If not, is there something else that I can tune to improve performance.
Extra info:
- All servers running Debian 8.
- Prefetch 50 (This is my tested optimum, anything more or less causes degradation.)
- Sensu version 0.29
- We don’t run any mutators, but run some filters for handlers and send data to whisper/carbon through relays.
- Our checks run with a minimum of 5 minute interval. Needless to say our cooling off period is not the problem.
I have a slight hunch that the sensu servers relay is the problem. Sending each metric of 30K+ servers to an external server would keep the sensu servers busy in relaying instead of consuming from the queue. But, alas, I don’t know how to test/prove this.
Any help will be greatly appreciated. Thanks.