Mojo, correct me if I’m wrong but your question seems to be for both the scheduling of checks (which Greg responded to) but also the execution of checks.
My question is only around the execution of checks. At the moment we only have 1 sensu server but for HA purposes plan to add a second soon, so I’m thinking through how this will work. For many checks I want each sensu agent to execute the check - no problems there. However, there are some checks that I only want executed on 1 server. This currently isn’t a problem - the subscriber for that check is a topic that only the sensu server subscribes to - the sensu server executes these checks. With only 1 server at the moment this works well, the check is executed by only 1 server. But now I want to add a second sensu server, ideally identical, but that means it would also subscribe to the same topic and I now have 2 servers executing the check.
It seems there are various options, like what you’ve mentioned Nick, but they seem like hacks. I’m sure I’m not the first person to have this challenge so was curious how others are dealing with it. If I lose 1 sensu server then ideally I wouldn’t have to make any changes, the 1 remaining server would handle all “single instance” checks. If the 2nd sensu server comes back online, they don’t both execute the checks (ideally they would split them up).
A couple examples of these single instance checks:
Checking the aggregate fail rate or response time for an application (the aggregate is already a single metric in graphite). If more than 1 server executes the check then I’ll get multiple alerts (think of a farm with many servers, each generating the same alert for a metric that isn’t even specific to that server but rather the same metric).
A metric check for a network device (something that can’t run the sensu agent and therefore snmp is used). If more than 1 server executes the check I’m just retrieving the same value from the device more than once.
The single instance checks don’t have to execute on the sensu server, I could easily create another system just for that purpose, and then perhaps use another HA solution for ensuring only one of them is active at a time. Perhaps the answer is something outside of sensu.
On Tuesday, March 4, 2014 11:01:22 AM UTC-6, Nick Stielau wrote:
Could you elaborate on the issues with the setup you described above? Is it HA for having the checks run if one of the checkers goes down? Is it throughput on each checker?
Here are some potentially unhelpful ideas to get you thinking:
Maybe have two checking servers, and update the check-graphite.rb or whatever script you are using to write to the /stashes API as a kind of scheduling semaphore. Haven’t quite thought of how that could degrade gracefully if one server went down. This would basically let one client no-op at a time.
Make a subscription-per-host, schedule the checks on both subscriptions, and make one a ‘dependency’ of the other, so they would not both alert at the same time. This would still show both events in the dashboard on failure. This would basically just prevent dual alerting.
On Monday, March 3, 2014 12:49:59 PM UTC-8, Brian Boyd wrote:
How are others doing this with Sensu? I have the same question.
We have multiple sensu servers. Metrics are in graphite so all checks are performed by querying graphite. Many metrics are not server-specific so there should only be 1 check performed. With 1 sensu server it’s easy because it can have its own subscription and those checks can reference the sensu server subscription. But once I add more sensu servers, each server would belong to that subscription and thus execute the check. Creating a unique subscription for each server creates other problems.
On Monday, February 24, 2014 6:45:26 PM UTC-6, Greg Poirier wrote:
So, at the moment we have a single machine that executes these, but it’s untenable–particularly as we transition more and more checks from things like “execute this local command on a machine” to metrics-based checks that query a metric store and examine what is returned. This is largely just a limitation of our current monitoring, though.
The model we are moving toward now will probably remove checks entirely from Sensu. It is my experience that most of the things being monitored by checks are simply applications that lack instrumentation. So, we are in nearly every case, working to replace all of our checks with a metric.
It is our intention to then use another service to inspect those metrics in real-time. Sensu’s responsibility will be to query services for metrics and then store those metrics in our metric store.
The model we have now where a single machine executes all of our “examine this http endpoint” monitoring has served us fairly well for quite some time, though.