Hi!
I’ve got a service that is built and deployed using Jenkins. It absolutely has to be stopped to deploy a new version (no rolling upgrades possible). There’s a sensu check that monitors the availability of that service.
So before starting the deployment, I’m setting up a silence using sensuctl (in the jenkins job):
sensuctl silenced create -r “deployment” -s entity:server -c service-check
I’ve got the “not_silenced” filter set up in the handler, so I’m not receiving any alarms, when the service is brought down. I’ve also got a “state_changed” filter, which is essentially “event.check.occurrences == 1” to only trigger notifications when event state has changed. So far so good.
Where I’m stuck at, is the point where deployment has finished, but the service has not actually recovered yet (god bless java/spring and the application startup times). I’m trying to find a nice way of clearing the silence but hiding the alarm resolution (service recovery) message.
I could simply delete the alarm after deployment finished:
sensuctl silenced delete -s entity:server -c service-check
But since the service takes some time to recover after deployment finishes, I’ll get the notification about service recovery (which I don’t want).
Next, I tried adding alarms that auto clear when the service recoveres and expire after 10 minutes, but this also fails to hide the recovery messages, I suspect because the silence gets cleared the moment service is recovered.
So… Now I’m not really sure what to do. I want to hide both alert and recovery messages during deployment, but still receive all other event outside of the deployment.
I could either add a sufficient “sleep” before the silence is deleted, or I could write a script to poll the service status until recovery before deleting silence, but this either slows down the deployment or adds complexity somewhere else.
I hope there is a better way to do it using sensu. Is there a way to deduce if previously silenced event just cleared or similar? Any ideas?