Reducing Alert Floods

Hi,

I was wondering what people do with regards to reducing floods of alerts.

e.g:

  • if a load of checks are behind the same connection

  • if there are problems with the connectivity of the sensu box

In either case, this results in potentially hundreds of e-mails from sensu sat in my inbox.

For the former, one solution would be to be able to specify a dependancy for a check - so all checks might have a dependency of a ping to a router they are behind, or a hypervisor a load of VMs were hosted on. That way if the connection or hypervisor goes down, that check will alert, but all of the others, while marked down, would not send alerts.

For the latter, maybe the ability to have multiple sensu instances and for it to only alert if it shows down from two or more locations - like Pingdom does. Or even just a way to only alert if one or more of a list of external ip addresses are successful.

Do any of these options exist? - i’ve not been able to see them in the docs.

Interested how others deal with this?

Thanks!

Ian

For dependencies, see here:
https://sensuapp.org/docs/latest/reference/plugins#check-definition-attributes

For the second idea (multi-instance quorum check), that would pretty fancy.

At work we have something like that, except it is still on the same cluster

https://github.com/Yelp/puppet-monitoring_check/blob/master/files/check-cluster.rb

Where you would define multiple “check_foo” (that don’t page) and then one “cluster_check_foo” (that does page).

Our canonical use case for that is for thing like check_http on webservers, and we add a long alert_after and set them to ticket us. (A webserver down, ticket me if it is down for 4 hours)

And then cluster_check_http that inspects the state of all the check_https’, and pages us after 1 minute. (>70% check_https failing? page me asap)

···

On Thu, Jan 26, 2017 at 1:45 AM, Ian Chilton ian.chilton@gmail.com wrote:

Hi,

I was wondering what people do with regards to reducing floods of alerts.

e.g:

  • if a load of checks are behind the same connection
  • if there are problems with the connectivity of the sensu box

In either case, this results in potentially hundreds of e-mails from sensu sat in my inbox.

For the former, one solution would be to be able to specify a dependancy for a check - so all checks might have a dependency of a ping to a router they are behind, or a hypervisor a load of VMs were hosted on. That way if the connection or hypervisor goes down, that check will alert, but all of the others, while marked down, would not send alerts.

For the latter, maybe the ability to have multiple sensu instances and for it to only alert if it shows down from two or more locations - like Pingdom does. Or even just a way to only alert if one or more of a list of external ip addresses are successful.

Do any of these options exist? - i’ve not been able to see them in the docs.

Interested how others deal with this?

Thanks!

Ian

Hi Ian,

Comments inline.

Hi,

I was wondering what people do with regards to reducing floods of alerts.

A combination of local sensu masters, check dependencies, filters and aggregated checks is
what I use to reduce alerts.

e.g:

- if a load of checks are behind the same connection

Ideal place to run local sensu master(s).

- if there are problems with the connectivity of the sensu box

Solvable by local sensu master(s).

In either case, this results in potentially hundreds of e-mails from sensu sat in my inbox.

I also have Prom, Influx, ELK etc all local to the co-lo or AWS region to ensure
that I don't miss events in case of connectivity issues.

For the former, one solution would be to be able to specify a dependancy for a check - so all checks might have a dependency of a ping to a router they are behind, or a hypervisor a load of VMs were hosted on. That way if the connection or hypervisor goes down, that check will alert, but all of the others, while marked down, would not send alerts.

Have a look at Sensu Filters. Its typical use case is to solve alert fatigue.

https://sensuapp.org/docs/0.26/reference/filters.html

For the latter, maybe the ability to have multiple sensu instances and for it to only alert if it shows down from two or more locations - like Pingdom does. Or even just a way to only alert if one or more of a list of external ip addresses are successful.

Do any of these options exist? - i've not been able to see them in the docs.

Also look at https://github.com/sensu-plugins/sensu-plugins-sensu’s check-aggregate.rb.

Interested how others deal with this?

It is important for me to know if my external services are reachable from various customer locations. So
I don't try to combine location based checks and try to alert only if multiple reachability checks fail.

On the other hand, most of my services run behind load balancers where I’d like to have alerts
in case a certain %age (or count) of backends fail a check. For this I use aggregated checks + check-aggregate.rb.

https://sensuapp.org/docs/latest/reference/aggregates.html

Hth.

···

On 26-Jan-2017, at 3:15 PM, Ian Chilton <ian.chilton@gmail.com> wrote:


@shankerbalan
DevOps Consultant