I run five sensu server instances, have around 200 or so hosts being monitored, and have an incoming stream of about 500-600 check results/s. Sometimes the keepalive queue unacked messages grows to the point where the sensu servers alert on the hosts being down. Often restarting the sensu servers fixes this, and sometimes purging the queue does, but I can’t figure out the root cause. It seems from the logs like some of the servers just decide to stop processing any events. No errors though. All of our handlers are extensions (because otherwise the hosts get horribly overloaded forking scripts many times) so I guess something in one of these is holding things up, but I can’t think of a good way to debug it. Most of the time it works fine, but recently, about once every few days, it all goes haywire.
Any tips on where to start looking would be greatly appreciated!