Sensu Checks Not Running After Sensu Server Restart In Multiple Instance Deployment

I’m running 3 sensu instances each with sensu-server, sensu-client, and sensu-api using Redis (which uses Sentinel) as both the transport and data store. When I define a health check on one sensu server which runs for a subscription that each of the sensu server’s client is subscribed to, I have to first stop all sensu-server, sensu-client, and sensu-api instances on all machines, then AFTER they are all down then bring them all back up. If I don’t bring them all down first then the health check never starts running. So once everything is back up and running and the health check is running on all three instances. If I stop one sensu-server, the health checks stop running on the remaining instances. If I bring the server back up, the health check still does not run. In order to get the health check to run again automatically I must stop ALL sensu servers, clients, and api’s, then AFTER all are down, start them back up again. If I try to restart each one without first bringing them all down it fails to run just like when adding a new health check.

From what I can tell, there are no errors in any of the logs.

Is this expected behavior? If not, any ideas on what the issue could be.

Did some more investigation and it looks like if I stop a non-leader sensu server then the checks continue. However, stopping the leader causes the checks to stop. I can see in the logs where a new leader is elected after about a minute, however the health checks never resume. Just continually prints out the following every 30 seconds.

{“timestamp”:“2016-08-12T18:01:40.912790+0000”,“level”:“info”,“message”:“i am now the leader”}

{“timestamp”:“2016-08-12T18:02:10.912819+0000”,“level”:“info”,“message”:“determining stale clients”}

{“timestamp”:“2016-08-12T18:02:10.913089+0000”,“level”:“info”,“message”:“determining stale check results”}

{“timestamp”:“2016-08-12T18:02:40.913197+0000”,“level”:“info”,“message”:“determining stale check results”}

{“timestamp”:“2016-08-12T18:03:10.914089+0000”,“level”:“info”,“message”:“determining stale clients”}

{“timestamp”:“2016-08-12T18:03:10.914443+0000”,“level”:“info”,“message”:“determining stale check results”}

{“timestamp”:“2016-08-12T18:03:40.915434+0000”,“level”:“info”,“message”:“determining stale clients”}

{“timestamp”:“2016-08-12T18:03:40.915750+0000”,“level”:“info”,“message”:“determining stale check results”}

Not sure if this is a bug or an issue with my setup, so wondering if anyone else is seeing this issue or know of a bug that is causing this.

Thanks!

···

On Friday, August 12, 2016 at 9:43:06 AM UTC-7, Kevin Lee wrote:

I’m running 3 sensu instances each with sensu-server, sensu-client, and sensu-api using Redis (which uses Sentinel) as both the transport and data store. When I define a health check on one sensu server which runs for a subscription that each of the sensu server’s client is subscribed to, I have to first stop all sensu-server, sensu-client, and sensu-api instances on all machines, then AFTER they are all down then bring them all back up. If I don’t bring them all down first then the health check never starts running. So once everything is back up and running and the health check is running on all three instances. If I stop one sensu-server, the health checks stop running on the remaining instances. If I bring the server back up, the health check still does not run. In order to get the health check to run again automatically I must stop ALL sensu servers, clients, and api’s, then AFTER all are down, start them back up again. If I try to restart each one without first bringing them all down it fails to run just like when adding a new health check.

From what I can tell, there are no errors in any of the logs.

Is this expected behavior? If not, any ideas on what the issue could be.

Issue was that I did not realize the check definition needs to be deployed to each server so that when it fails over that particular server can load the check. Was under the false assumption that the definition was somewhere in Redis. Things are working as expected now that the check configuration is deployed on each server.

···

On Friday, August 12, 2016 at 11:06:15 AM UTC-7, Kevin Lee wrote:

Did some more investigation and it looks like if I stop a non-leader sensu server then the checks continue. However, stopping the leader causes the checks to stop. I can see in the logs where a new leader is elected after about a minute, however the health checks never resume. Just continually prints out the following every 30 seconds.

{“timestamp”:“2016-08-12T18:01:40.912790+0000”,“level”:“info”,“message”:“i am now the leader”}

{“timestamp”:“2016-08-12T18:02:10.912819+0000”,“level”:“info”,“message”:“determining stale clients”}

{“timestamp”:“2016-08-12T18:02:10.913089+0000”,“level”:“info”,“message”:“determining stale check results”}

{“timestamp”:“2016-08-12T18:02:40.913197+0000”,“level”:“info”,“message”:“determining stale check results”}

{“timestamp”:“2016-08-12T18:03:10.914089+0000”,“level”:“info”,“message”:“determining stale clients”}

{“timestamp”:“2016-08-12T18:03:10.914443+0000”,“level”:“info”,“message”:“determining stale check results”}

{“timestamp”:“2016-08-12T18:03:40.915434+0000”,“level”:“info”,“message”:“determining stale clients”}

{“timestamp”:“2016-08-12T18:03:40.915750+0000”,“level”:“info”,“message”:“determining stale check results”}

Not sure if this is a bug or an issue with my setup, so wondering if anyone else is seeing this issue or know of a bug that is causing this.

Thanks!

On Friday, August 12, 2016 at 9:43:06 AM UTC-7, Kevin Lee wrote:

I’m running 3 sensu instances each with sensu-server, sensu-client, and sensu-api using Redis (which uses Sentinel) as both the transport and data store. When I define a health check on one sensu server which runs for a subscription that each of the sensu server’s client is subscribed to, I have to first stop all sensu-server, sensu-client, and sensu-api instances on all machines, then AFTER they are all down then bring them all back up. If I don’t bring them all down first then the health check never starts running. So once everything is back up and running and the health check is running on all three instances. If I stop one sensu-server, the health checks stop running on the remaining instances. If I bring the server back up, the health check still does not run. In order to get the health check to run again automatically I must stop ALL sensu servers, clients, and api’s, then AFTER all are down, start them back up again. If I try to restart each one without first bringing them all down it fails to run just like when adding a new health check.

From what I can tell, there are no errors in any of the logs.

Is this expected behavior? If not, any ideas on what the issue could be.