server-scheduled checks

If I have two machines (monitor1, monitor2) running sensu-server, and both have a non-standalone check defined,

… will they both servers schedule the check, or notice that the other one has and not do it within the check period?

… Does only one one client subscriber perform the check, or all clients that subscribe to it?

I’m working under the assumption that internal checks (load average, disk usage) should be standalone, and external checks (http health checks) should be server-scheduled non-standalone.

Mojo

With multiple sensu-server processes, only one is scheduling checks at any given time. This is coordinated via a master lock in the shared Redis instance.

All clients with the subscription associated with the check will execute the check.

In our installation, we do not run any standalone checks. Having Sensu publish/schedule check requests is the model to use.

···

On Mon, Feb 24, 2014 at 1:06 PM, Mojo mojo.la@gmail.com wrote:

If I have two machines (monitor1, monitor2) running sensu-server, and both have a non-standalone check defined,

… will they both servers schedule the check, or notice that the other one has and not do it within the check period?

… Does only one one client subscriber perform the check, or all clients that subscribe to it?

I’m working under the assumption that internal checks (load average, disk usage) should be standalone, and external checks (http health checks) should be server-scheduled non-standalone.

Mojo

Thanks, this is interesting.

So what is your configuration for checks that run outside of the machine being tested? I’m talking about things like a check of a web server or REST server from outside the server itself.

Do you have one (or two or three) client(s) on a subscription for those checks?

Do the subscription names take on the flavor of the test grouping somehow?

Mojo

···

On Mon, Feb 24, 2014 at 3:15 PM, Greg Poirier greg.poirier@opower.com wrote:

With multiple sensu-server processes, only one is scheduling checks at any given time. This is coordinated via a master lock in the shared Redis instance.

All clients with the subscription associated with the check will execute the check.

In our installation, we do not run any standalone checks. Having Sensu publish/schedule check requests is the model to use.

On Mon, Feb 24, 2014 at 1:06 PM, Mojo mojo.la@gmail.com wrote:

If I have two machines (monitor1, monitor2) running sensu-server, and both have a non-standalone check defined,

… will they both servers schedule the check, or notice that the other one has and not do it within the check period?

… Does only one one client subscriber perform the check, or all clients that subscribe to it?

I’m working under the assumption that internal checks (load average, disk usage) should be standalone, and external checks (http health checks) should be server-scheduled non-standalone.

Mojo

So, at the moment we have a single machine that executes these, but it’s untenable–particularly as we transition more and more checks from things like “execute this local command on a machine” to metrics-based checks that query a metric store and examine what is returned. This is largely just a limitation of our current monitoring, though.

The model we are moving toward now will probably remove checks entirely from Sensu. It is my experience that most of the things being monitored by checks are simply applications that lack instrumentation. So, we are in nearly every case, working to replace all of our checks with a metric.

It is our intention to then use another service to inspect those metrics in real-time. Sensu’s responsibility will be to query services for metrics and then store those metrics in our metric store.

The model we have now where a single machine executes all of our “examine this http endpoint” monitoring has served us fairly well for quite some time, though.

···

On Mon, Feb 24, 2014 at 4:19 PM, Mojo mojo.la@gmail.com wrote:

Thanks, this is interesting.

So what is your configuration for checks that run outside of the machine being tested? I’m talking about things like a check of a web server or REST server from outside the server itself.

Do you have one (or two or three) client(s) on a subscription for those checks?

Do the subscription names take on the flavor of the test grouping somehow?

Mojo

On Mon, Feb 24, 2014 at 3:15 PM, Greg Poirier greg.poirier@opower.com wrote:

With multiple sensu-server processes, only one is scheduling checks at any given time. This is coordinated via a master lock in the shared Redis instance.

All clients with the subscription associated with the check will execute the check.

In our installation, we do not run any standalone checks. Having Sensu publish/schedule check requests is the model to use.

On Mon, Feb 24, 2014 at 1:06 PM, Mojo mojo.la@gmail.com wrote:

If I have two machines (monitor1, monitor2) running sensu-server, and both have a non-standalone check defined,

… will they both servers schedule the check, or notice that the other one has and not do it within the check period?

… Does only one one client subscriber perform the check, or all clients that subscribe to it?

I’m working under the assumption that internal checks (load average, disk usage) should be standalone, and external checks (http health checks) should be server-scheduled non-standalone.

Mojo

How are others doing this with Sensu? I have the same question.

We have multiple sensu servers. Metrics are in graphite so all checks are performed by querying graphite. Many metrics are not server-specific so there should only be 1 check performed. With 1 sensu server it’s easy because it can have its own subscription and those checks can reference the sensu server subscription. But once I add more sensu servers, each server would belong to that subscription and thus execute the check. Creating a unique subscription for each server creates other problems.

brian

···

On Monday, February 24, 2014 6:45:26 PM UTC-6, Greg Poirier wrote:

So, at the moment we have a single machine that executes these, but it’s untenable–particularly as we transition more and more checks from things like “execute this local command on a machine” to metrics-based checks that query a metric store and examine what is returned. This is largely just a limitation of our current monitoring, though.

The model we are moving toward now will probably remove checks entirely from Sensu. It is my experience that most of the things being monitored by checks are simply applications that lack instrumentation. So, we are in nearly every case, working to replace all of our checks with a metric.

It is our intention to then use another service to inspect those metrics in real-time. Sensu’s responsibility will be to query services for metrics and then store those metrics in our metric store.

The model we have now where a single machine executes all of our “examine this http endpoint” monitoring has served us fairly well for quite some time, though.

Could you elaborate on the issues with the setup you described above? Is it HA for having the checks run if one of the checkers goes down? Is it throughput on each checker?

Here are some potentially unhelpful ideas to get you thinking:

Maybe have two checking servers, and update the check-graphite.rb or whatever script you are using to write to the /stashes API as a kind of scheduling semaphore. Haven’t quite thought of how that could degrade gracefully if one server went down. This would basically let one client no-op at a time.

Make a subscription-per-host, schedule the checks on both subscriptions, and make one a ‘dependency’ of the other, so they would not both alert at the same time. This would still show both events in the dashboard on failure. This would basically just prevent dual alerting.

···

On Monday, March 3, 2014 12:49:59 PM UTC-8, Brian Boyd wrote:

How are others doing this with Sensu? I have the same question.

We have multiple sensu servers. Metrics are in graphite so all checks are performed by querying graphite. Many metrics are not server-specific so there should only be 1 check performed. With 1 sensu server it’s easy because it can have its own subscription and those checks can reference the sensu server subscription. But once I add more sensu servers, each server would belong to that subscription and thus execute the check. Creating a unique subscription for each server creates other problems.

brian

On Monday, February 24, 2014 6:45:26 PM UTC-6, Greg Poirier wrote:

So, at the moment we have a single machine that executes these, but it’s untenable–particularly as we transition more and more checks from things like “execute this local command on a machine” to metrics-based checks that query a metric store and examine what is returned. This is largely just a limitation of our current monitoring, though.

The model we are moving toward now will probably remove checks entirely from Sensu. It is my experience that most of the things being monitored by checks are simply applications that lack instrumentation. So, we are in nearly every case, working to replace all of our checks with a metric.

It is our intention to then use another service to inspect those metrics in real-time. Sensu’s responsibility will be to query services for metrics and then store those metrics in our metric store.

The model we have now where a single machine executes all of our “examine this http endpoint” monitoring has served us fairly well for quite some time, though.

Mojo, correct me if I’m wrong but your question seems to be for both the scheduling of checks (which Greg responded to) but also the execution of checks.

My question is only around the execution of checks. At the moment we only have 1 sensu server but for HA purposes plan to add a second soon, so I’m thinking through how this will work. For many checks I want each sensu agent to execute the check - no problems there. However, there are some checks that I only want executed on 1 server. This currently isn’t a problem - the subscriber for that check is a topic that only the sensu server subscribes to - the sensu server executes these checks. With only 1 server at the moment this works well, the check is executed by only 1 server. But now I want to add a second sensu server, ideally identical, but that means it would also subscribe to the same topic and I now have 2 servers executing the check.

It seems there are various options, like what you’ve mentioned Nick, but they seem like hacks. I’m sure I’m not the first person to have this challenge so was curious how others are dealing with it. If I lose 1 sensu server then ideally I wouldn’t have to make any changes, the 1 remaining server would handle all “single instance” checks. If the 2nd sensu server comes back online, they don’t both execute the checks (ideally they would split them up).

A couple examples of these single instance checks:

  • Checking the aggregate fail rate or response time for an application (the aggregate is already a single metric in graphite). If more than 1 server executes the check then I’ll get multiple alerts (think of a farm with many servers, each generating the same alert for a metric that isn’t even specific to that server but rather the same metric).

  • A metric check for a network device (something that can’t run the sensu agent and therefore snmp is used). If more than 1 server executes the check I’m just retrieving the same value from the device more than once.

The single instance checks don’t have to execute on the sensu server, I could easily create another system just for that purpose, and then perhaps use another HA solution for ensuring only one of them is active at a time. Perhaps the answer is something outside of sensu.

brian

···

On Tuesday, March 4, 2014 11:01:22 AM UTC-6, Nick Stielau wrote:

Could you elaborate on the issues with the setup you described above? Is it HA for having the checks run if one of the checkers goes down? Is it throughput on each checker?

Here are some potentially unhelpful ideas to get you thinking:

Maybe have two checking servers, and update the check-graphite.rb or whatever script you are using to write to the /stashes API as a kind of scheduling semaphore. Haven’t quite thought of how that could degrade gracefully if one server went down. This would basically let one client no-op at a time.

Make a subscription-per-host, schedule the checks on both subscriptions, and make one a ‘dependency’ of the other, so they would not both alert at the same time. This would still show both events in the dashboard on failure. This would basically just prevent dual alerting.

On Monday, March 3, 2014 12:49:59 PM UTC-8, Brian Boyd wrote:

How are others doing this with Sensu? I have the same question.

We have multiple sensu servers. Metrics are in graphite so all checks are performed by querying graphite. Many metrics are not server-specific so there should only be 1 check performed. With 1 sensu server it’s easy because it can have its own subscription and those checks can reference the sensu server subscription. But once I add more sensu servers, each server would belong to that subscription and thus execute the check. Creating a unique subscription for each server creates other problems.

brian

On Monday, February 24, 2014 6:45:26 PM UTC-6, Greg Poirier wrote:

So, at the moment we have a single machine that executes these, but it’s untenable–particularly as we transition more and more checks from things like “execute this local command on a machine” to metrics-based checks that query a metric store and examine what is returned. This is largely just a limitation of our current monitoring, though.

The model we are moving toward now will probably remove checks entirely from Sensu. It is my experience that most of the things being monitored by checks are simply applications that lack instrumentation. So, we are in nearly every case, working to replace all of our checks with a metric.

It is our intention to then use another service to inspect those metrics in real-time. Sensu’s responsibility will be to query services for metrics and then store those metrics in our metric store.

The model we have now where a single machine executes all of our “examine this http endpoint” monitoring has served us fairly well for quite some time, though.

I agree that this particular class of monitoring isn’t super elegant right now with Sensu.

I too would like to have a check executed by the sensu-client that is running on the sensu-servers, but only the current master.

Also, I want the “client name” to be “masqueraded” to be the remote thing I’m checking. (but I still want the originating host that happened to run the check to be in the client data)

https://github.com/sensu/sensu/issues/541

(at least, I think I want this)

It would be cool if this was part of the core functionality (like if it was a different class of check, like subscribed/standalone/server-side?). But if not, as long as the the client data / masquerading feature was implemented, I think I would be happy hacking it if need be. (using the redis lock as a semaphore)

···

On Tue, Mar 4, 2014 at 7:27 PM, Brian Boyd brian@boydduo.com wrote:

Mojo, correct me if I’m wrong but your question seems to be for both the scheduling of checks (which Greg responded to) but also the execution of checks.

My question is only around the execution of checks. At the moment we only have 1 sensu server but for HA purposes plan to add a second soon, so I’m thinking through how this will work. For many checks I want each sensu agent to execute the check - no problems there. However, there are some checks that I only want executed on 1 server. This currently isn’t a problem - the subscriber for that check is a topic that only the sensu server subscribes to - the sensu server executes these checks. With only 1 server at the moment this works well, the check is executed by only 1 server. But now I want to add a second sensu server, ideally identical, but that means it would also subscribe to the same topic and I now have 2 servers executing the check.

It seems there are various options, like what you’ve mentioned Nick, but they seem like hacks. I’m sure I’m not the first person to have this challenge so was curious how others are dealing with it. If I lose 1 sensu server then ideally I wouldn’t have to make any changes, the 1 remaining server would handle all “single instance” checks. If the 2nd sensu server comes back online, they don’t both execute the checks (ideally they would split them up).

A couple examples of these single instance checks:

  • Checking the aggregate fail rate or response time for an application (the aggregate is already a single metric in graphite). If more than 1 server executes the check then I’ll get multiple alerts (think of a farm with many servers, each generating the same alert for a metric that isn’t even specific to that server but rather the same metric).
  • A metric check for a network device (something that can’t run the sensu agent and therefore snmp is used). If more than 1 server executes the check I’m just retrieving the same value from the device more than once.

The single instance checks don’t have to execute on the sensu server, I could easily create another system just for that purpose, and then perhaps use another HA solution for ensuring only one of them is active at a time. Perhaps the answer is something outside of sensu.

brian

On Tuesday, March 4, 2014 11:01:22 AM UTC-6, Nick Stielau wrote:

Could you elaborate on the issues with the setup you described above? Is it HA for having the checks run if one of the checkers goes down? Is it throughput on each checker?

Here are some potentially unhelpful ideas to get you thinking:

Maybe have two checking servers, and update the check-graphite.rb or whatever script you are using to write to the /stashes API as a kind of scheduling semaphore. Haven’t quite thought of how that could degrade gracefully if one server went down. This would basically let one client no-op at a time.

Make a subscription-per-host, schedule the checks on both subscriptions, and make one a ‘dependency’ of the other, so they would not both alert at the same time. This would still show both events in the dashboard on failure. This would basically just prevent dual alerting.

On Monday, March 3, 2014 12:49:59 PM UTC-8, Brian Boyd wrote:

How are others doing this with Sensu? I have the same question.

We have multiple sensu servers. Metrics are in graphite so all checks are performed by querying graphite. Many metrics are not server-specific so there should only be 1 check performed. With 1 sensu server it’s easy because it can have its own subscription and those checks can reference the sensu server subscription. But once I add more sensu servers, each server would belong to that subscription and thus execute the check. Creating a unique subscription for each server creates other problems.

brian

On Monday, February 24, 2014 6:45:26 PM UTC-6, Greg Poirier wrote:

So, at the moment we have a single machine that executes these, but it’s untenable–particularly as we transition more and more checks from things like “execute this local command on a machine” to metrics-based checks that query a metric store and examine what is returned. This is largely just a limitation of our current monitoring, though.

The model we are moving toward now will probably remove checks entirely from Sensu. It is my experience that most of the things being monitored by checks are simply applications that lack instrumentation. So, we are in nearly every case, working to replace all of our checks with a metric.

It is our intention to then use another service to inspect those metrics in real-time. Sensu’s responsibility will be to query services for metrics and then store those metrics in our metric store.

The model we have now where a single machine executes all of our “examine this http endpoint” monitoring has served us fairly well for quite some time, though.