Did some more investigation and it looks like if I stop a non-leader sensu server then the checks continue. However, stopping the leader causes the checks to stop. I can see in the logs where a new leader is elected after about a minute, however the health checks never resume. Just continually prints out the following every 30 seconds.
{“timestamp”:“2016-08-12T18:01:40.912790+0000”,“level”:“info”,“message”:“i am now the leader”}
{“timestamp”:“2016-08-12T18:02:10.912819+0000”,“level”:“info”,“message”:“determining stale clients”}
{“timestamp”:“2016-08-12T18:02:10.913089+0000”,“level”:“info”,“message”:“determining stale check results”}
{“timestamp”:“2016-08-12T18:02:40.913197+0000”,“level”:“info”,“message”:“determining stale check results”}
{“timestamp”:“2016-08-12T18:03:10.914089+0000”,“level”:“info”,“message”:“determining stale clients”}
{“timestamp”:“2016-08-12T18:03:10.914443+0000”,“level”:“info”,“message”:“determining stale check results”}
{“timestamp”:“2016-08-12T18:03:40.915434+0000”,“level”:“info”,“message”:“determining stale clients”}
{“timestamp”:“2016-08-12T18:03:40.915750+0000”,“level”:“info”,“message”:“determining stale check results”}
Not sure if this is a bug or an issue with my setup, so wondering if anyone else is seeing this issue or know of a bug that is causing this.
Thanks!
···
On Friday, August 12, 2016 at 9:43:06 AM UTC-7, Kevin Lee wrote:
I’m running 3 sensu instances each with sensu-server, sensu-client, and sensu-api using Redis (which uses Sentinel) as both the transport and data store. When I define a health check on one sensu server which runs for a subscription that each of the sensu server’s client is subscribed to, I have to first stop all sensu-server, sensu-client, and sensu-api instances on all machines, then AFTER they are all down then bring them all back up. If I don’t bring them all down first then the health check never starts running. So once everything is back up and running and the health check is running on all three instances. If I stop one sensu-server, the health checks stop running on the remaining instances. If I bring the server back up, the health check still does not run. In order to get the health check to run again automatically I must stop ALL sensu servers, clients, and api’s, then AFTER all are down, start them back up again. If I try to restart each one without first bringing them all down it fails to run just like when adding a new health check.
From what I can tell, there are no errors in any of the logs.
Is this expected behavior? If not, any ideas on what the issue could be.