Handler behavior when check goes from warning to critical


#1

Hi all,

Hopefully this is an easy one to figure out. We have setup a couple of handlers using the community ones for notifications, one which emails the team, and another which creates an event in PagerDuty. Both specify severities they should handle:

mailer.json:

“severities”: [

“ok”,

“warning”,

“critical”,

“unknown”

],


``

pagerduty.json:


“severities”: [

“critical”,

“ok”

],

``

Normally, this works quite well for us to only get paged about critical events. Where we’ve run into trouble recently is the situation where a check goes from warning to critical. The PagerDuty handler does fire after the alert hits critical, however, the event has already occurred more than once, and the refresh value hasn’t been hit yet, so it takes no action. ‘“output”:“only handling every 1440 occurrences”’ get logged.

Anyone has a solution to not filter the event when the event goes from a lower severity to a higher severity? We really don’t want to lower our refresh interval (to make this workable and actually have it fired we’d need it to be far too low anyway).


#2

Hmm, sensu really only has the concept of creating an event (non-zero)
or closing an event (zero status code):

You only get one filter per check, the only way I can think to do what
you want is to have two checks, one with the relaxed threshold that
goes to email, and the other one with a real critical threshold that
goes to PD (and doesn't fire early).

···

On Thu, Aug 21, 2014 at 2:27 PM, Wyatt Walter <wwalter@sugarcrm.com> wrote:

Hi all,

Hopefully this is an easy one to figure out. We have setup a couple of
handlers using the community ones for notifications, one which emails the
team, and another which creates an event in PagerDuty. Both specify
severities they should handle:

mailer.json:
...
      "severities": [
        "ok",
        "warning",
        "critical",
        "unknown"
      ],
...

pagerduty.json:
...
      "severities": [
        "critical",
        "ok"
      ],
...

Normally, this works quite well for us to only get paged about critical
events. Where we've run into trouble recently is the situation where a check
goes from warning to critical. The PagerDuty handler does fire after the
alert hits critical, however, the event has already occurred more than once,
and the refresh value hasn't been hit yet, so it takes no action.
'"output":"only handling every 1440 occurrences"' get logged.

Anyone has a solution to not filter the event when the event goes from a
lower severity to a higher severity? We really don't want to lower our
refresh interval (to make this workable and actually have it fired we'd need
it to be far too low anyway).


#3

This is very weird. Today we had a check go from warning to critical and it acted as I was expecting. Maybe the behavior I saw before was just me being impatient when I didn’t see a page.

···

On Saturday, August 23, 2014 11:08:56 AM UTC-5, Kyle Anderson wrote:

Hmm, sensu really only has the concept of creating an event (non-zero)

or closing an event (zero status code):

https://github.com/sensu/sensu/blob/c81c50a54bb746cfc6be952e75ed5ffcda2ccfb1/lib/sensu/server.rb#L416-L428

You only get one filter per check, the only way I can think to do what

you want is to have two checks, one with the relaxed threshold that

goes to email, and the other one with a real critical threshold that

goes to PD (and doesn’t fire early).

On Thu, Aug 21, 2014 at 2:27 PM, Wyatt Walter wwa...@sugarcrm.com wrote:

Hi all,

Hopefully this is an easy one to figure out. We have setup a couple of

handlers using the community ones for notifications, one which emails the

team, and another which creates an event in PagerDuty. Both specify

severities they should handle:

mailer.json:

  "severities": [
    "ok",
    "warning",
    "critical",
    "unknown"
  ],

pagerduty.json:

  "severities": [
    "critical",
    "ok"
  ],

Normally, this works quite well for us to only get paged about critical

events. Where we’ve run into trouble recently is the situation where a check

goes from warning to critical. The PagerDuty handler does fire after the

alert hits critical, however, the event has already occurred more than once,

and the refresh value hasn’t been hit yet, so it takes no action.

‘“output”:“only handling every 1440 occurrences”’ get logged.

Anyone has a solution to not filter the event when the event goes from a

lower severity to a higher severity? We really don’t want to lower our

refresh interval (to make this workable and actually have it fired we’d need

it to be far too low anyway).