Silencing alarms for the duration of deployment

Hi!

I’ve got a service that is built and deployed using Jenkins. It absolutely has to be stopped to deploy a new version (no rolling upgrades possible). There’s a sensu check that monitors the availability of that service.

So before starting the deployment, I’m setting up a silence using sensuctl (in the jenkins job):

sensuctl silenced create -r “deployment” -s entity:server -c service-check

I’ve got the “not_silenced” filter set up in the handler, so I’m not receiving any alarms, when the service is brought down. I’ve also got a “state_changed” filter, which is essentially “event.check.occurrences == 1” to only trigger notifications when event state has changed. So far so good.

Where I’m stuck at, is the point where deployment has finished, but the service has not actually recovered yet (god bless java/spring and the application startup times). I’m trying to find a nice way of clearing the silence but hiding the alarm resolution (service recovery) message.

I could simply delete the alarm after deployment finished:

sensuctl silenced delete -s entity:server -c service-check

But since the service takes some time to recover after deployment finishes, I’ll get the notification about service recovery (which I don’t want).

Next, I tried adding alarms that auto clear when the service recoveres and expire after 10 minutes, but this also fails to hide the recovery messages, I suspect because the silence gets cleared the moment service is recovered.

So… Now I’m not really sure what to do. I want to hide both alert and recovery messages during deployment, but still receive all other event outside of the deployment.

I could either add a sufficient “sleep” before the silence is deleted, or I could write a script to poll the service status until recovery before deleting silence, but this either slows down the deployment or adds complexity somewhere else.

I hope there is a better way to do it using sensu. Is there a way to deduce if previously silenced event just cleared or similar? Any ideas?

1 Like

Hey Mort,

I think you are on the right track, have you tried specifying an expires value rather than a sleep? I have not messed with this much in sensu-go, from what I read here it feels like we should have some good constructs to work with: Silencing reference - Sensu Docs. I will also shamelessly link you to an older article talking about how to silence (in the old sensu-ruby world) for a similar use case although maybe not quite the same.

I hope this helps and look forward to seeing if we can get you something that works for your needs.

Thanks,

Hey Mort,

I am not sure but I think its intentional that the resolve events are always handled.

Consider the following scenario:

  1. alert fires off 1000 times across a number of services
  2. on-call engineer silences everything so that they can continue to get monitoring intelligence but avoiding any handlers being fired off for alerting (pagerduty, slack, etc)
  3. on-call engineer resolves issue and deletes the silence
  4. If say there was an event opened with pagerduty (during the initial storm) we want to ensure that we communicate with pagerduty to let them know those events have been resolved.

Can you help me better understand the impact of always handling resolved event types in your use case?

Thanks,

Hey,

To be clear when you say “recovery message” do you mean a sensu event from the check with status=0 state=“ok” ?

Sorry for not answering earlier. It’s an exception case for me, when I simply don’t want to see messages triggered by routine deployments to development environments, unless they fail.

I eventually wrote a script that actually polls event status using sensuctl and waits until sensu considers the services as recovered. I remove the silences after that and this effectively masks the whole deal.

It probably won’t even be an issue, if I add Pagerduty to the chain and optimize how alerts are handled, but right now it’s all going to a dev chat and I don’t want to train people to ignore what’s going on in there.

Hey
Can you share that scripting with sensuctl.
Is this something that would make sense to provide as a reusable sensuctl command asset to extend sensuctl with additional functionality widely for other people to use?