One one of the Sensu deployments, I am running into an issue
where subscription checks are not getting scheduled. The sensu-server (running
on CentOS + official sensu rpm from yum repo) is on 1.4.2 while the clients are
a mix of 1.3.3 and 1.4.2. There are over 50 clients in the deployment.
Standalone checks and keepalives from each client are working though. What
I have tried so far:
- Ensure time is in sync on sensu-master and sensu-clients
- I am using Redis, so flushdb and flushall executed to clean the state
- Downgraded sensu-server to 1.3.3
I did manage to make things work briefly by doing the below steps:
- Stop sensu-client, sensu-server and sensu-api on the designated master node
- Flush Redis and restarted it to listen only on loopback
- Started sensu-client, sensu-server and sensu-api on the designated master node to connect to redis loopback
- the sensu-client running on the master node showed all the subscription checks after this
- Restarted redis after updating configs to bind to all interfaces
- All the other 49 nodes now connected and scheduled checks successfully
But after 2 days, the checks went stale again. The team has updated all the
sensu-clients to 1.4.2 but the problem persists.
Anyidea what could be happening here? The setup was working without issues for over 3 months. The logs
were not very helpful - I am yet to see any errors related to publish check messages. In fact, I dont
see any publish check requests in the logs.