Jef Practice: Time based, keepalive alert fatigue filters

It’s been a while… but I have a new Jef Practice…

It’s possible to construct filter expressions that conditionally use entity or check level annotations, with fallback values if not defined, as a reusable filter that you can tune per entity or check.

The full story
A commonly desired filter pattern for alerts is to be able to specify an alert cadence based on clock time instead of on alert occurrence. I’m going to show you what I use as a javascript filter expression for the agent keepalive as an example of the type of filter pattern you can construct using annotation conditionals.

Note: the keepalive check is a little bit different than most other checks because it relies on the check.timeout exclusively to produce non-zero status, so what I’m showing below is only appropriate for keepalive checks. Its possible to create similar logic that also uses check interval for interval scheduled service checks, but the logic would rely on check.interval instead of check.timeout. For cron-like scheduled checks, its a bit more complicated and to really handle that it makes the most sense to build a dedicated javascript function as an asset (which i haven’t done yet myself)

Okay so here we go…
My keepalive alert fatigue filter

type: EventFilter
api_version: core/v2
  name: keepalive_alert_fatigue
  action: allow
    - is_incident
    - "event.check.occurrences == 1 || event.check.occurrences % parseInt( 60 * ( 'keepalive_alert_minutes' in event.entity.annotations ? parseInt(event.entity.annotations.keepalive_alert_minutes): 15) / event.check.timeout ) == 0" 

The breakdown of how the filter works

  1. I’m ensuring I get an alert on first occurrence of any status change.

  2. I’m conditionally converting an entity annotation "keepalive_alert_minutes" if it exists into an integer representing elapsed minutes, else using 15 as a fallback value if the annotation is not defined.

  3. I’m calculated expected number of occurrences based in a an elapsed minutes assuming check.timeout cadence. This works for keepalive checks as a special case because of the way agent keepalive warning timeout maps to the check timeout for the keepalive check.
    Note: parseInt() effectively floors the float calculated of the expected number of occurrences elapsed.

  4. I prototyped this expression using and the otto javascript VM before testing it inside Sensu.