I’ve been building out sensu monitoring for a cluster of about 40 machines.
I have two VMs dedicated to sensu. RabbitMQ on both, redis on one, api, client, server, and dashboard on both.
I have two graphite machines, collecting metrics that are echoed to RabbitMQ by the sensu server through the graphite mutator.
I have a handful of checks and metrics, with more to come – disk, ntp, memory, load, a few application endpoints. I have an email handler for events, forwarding to gmail.
Right now my two sensu machines are running a load average around 10-20, usually about 18.
Typically my waiting processes include a few mailer handlers (eight when I glanced just now) and a few graphite mutators (four at this glance).
I’m not sure why the mailers are running and waiting. I see them there whether there’s an event to report or not.
Given that it’s a brain-dead configuration, is there something obviously wrong? The machines have 2GB memory, running at about 75% (no swap).
I can think of a few things to do:
Reduce the frequency of the checks (all at 60s right now)
Split off RabbitMQ to other machines
Split off sensu-server or sensu-api
Discover something in the code that’s driving the load (mailer? graphite-mutator?)