Handler not always sending resolve events

We have a sensu check that runs every 5 minutes & creates other checks. Think querying a list of >1000 endpoints, and dynamically creating individual checks any of them are failing. We have a default handler set that sends an email to pagerduty about each failing one. This works fine, we always get separate alerts opened in pagerduty for the failing endpoints

If a TTL expires for one of these dynamic checks, that means the parent check hasn’t updated it & we want it to be resolved. We’ve accomplished this with a ruby handler, that looks something like this:

  handler - delete_check_result.rb
  <snip>
    check_status = @event['check']['status']
    if check_status <= 0
        client_name = @event['client']['name']
        check_name = @event['check']['name']
        response = api_request(:DELETE, "/events/#{client_name}/#{check_name}").code
        response = api_request(:DELETE, "/results/#{client_name}/#{check_name}").code
    end
  </snip>

This all works fine. If a TTL expires for a check with this handler, it is removed correctly from sensu & disappears from uchiwa.

The issue we’re facing is for most (80%) of the checks a resolve email is never sent to pagerduty. We find this message in the sensu log (some info redacted)

{
    "timestamp": "2019-10-23T02:43:37.546025-0700",
    "level": "info",
    "message": "handler does not handle event severity",
    "handler": {
        "type": "pipe",
        "command": "/etc/sensu/handlers/email.rb -j email_critical",
        "severities": ["critical"],
        "filters": ["filter_weekly_maintenance", "occurrences"],
        "name": "email_critical"
    },
    "event": {
        "id": "3fe567dd-88c0-4285-9103-c4ec062847df",
        "client": {
            "name": "redacted",
            "address": "redacted",
            "socket": {
                "bind": "172.17.0.1",
                "port": 3030
            },
            "subscriptions": ["redacted"],
            "version": "1.7.0",
            "timestamp": 1571823815
        },
        "check": {
            "type": "metric",
            "occurrences": 5,
            "ttl": 360,
            "ttl_status": -2,
            "handlers": ["default", "delete_check_result"],
            "metric_tags": {},
            "name": "some-dynamic-check-1",
            "status": 0,
            "output": "Resolving on request of the API",
            "additional": {
                "warnOnly": true
            },
            "executed": 1571823817,
            "issued": 1571823817,
            "interval": 30,
            "total_state_change": 0,
            "force_resolve": true,
            "history": ["0"]
        },
        "occurrences": 1,
        "occurrences_watermark": 1,
        "last_ok": 1571823817,
        "action": "resolve",
        "timestamp": 1571823817,
        "last_state_change": 1571823817,
        "silenced": false,
        "silenced_by": []
    }
}

~20% of the alerts do get an email sent & resolved properly. I’m thinking there is an issue because both the email & delete_check_result handler fire at the same time. If delete_check_result manages to clear out the history first, then the default handler refuses to send the resolve email because it can’t find the critical severity in the check history.

The strange thing is this worked reliably for a long time, I’m not sure what changed that caused this to start happening. We are considering moving to sensu-go but will take some time & would like to get this resolved in the meantime

Is there a better way to have a check be resolved / deleted if the TTL expires?

  1. What have you already tried? Please include links to gists and/or code blocks (if relatively small)
  • Adding a 5 second sleep between the DELETE to the /events & /results api call in the delete_check_result handler. This led to alerts not being resolved properly, we got a large number of UNKNOWN alerts like Last check execution was xxx seconds ago
  • Adding an ok severity the email handler. We starting getting alerts for OK status from another extension we have running
  1. Tell us about your setup, this should include OS, version of Sensu, version of Sensu components (redis, rabbitmq), plugin versions (if applicable), anything special about your setup such as an airgapped network or strict ACLs
    CentOS 7
    Sensu 1.7.0
    rabbitmq-server 3.6.2-1
    redis 2.8.19-2

  2. Is there a Github issue related to your issue? No

  3. Is there anything else that can help us effectively help you?

Hey!
I think you may be right, that does appear to be a race due to the fact that handlers are run in parallel in a non-deterministic order before making a request for a copy of the event.

My best advice to you is to see if you can reconstruct your handler logic so that all conditional actions are done inside of a single handler instead of running two handlers. It might be as simple as writing a small shell script to conditionally choose which handler commands to run…and run that bash script as the handler command.

Also… my current understanding is that this issue is fixed in the design of Sensu Go.

In Sensu Go all handlers associated with a check are given the same copy of the event. So even though they are run in parallel, they are all given a copy of the same input event to start with.

Thanks for the response.

I wound up putting a 3 second sleep back into our delete_check_result handler between the DELETE calls to /events & /results API. This seemed to allow time for the default handler to sent a resolve before the event is deleted.

I had previously tried this but wound up getting a bunch of UNKNOWN alerts. That seemed to be caused by unrelated infrastructure problems, and some custom code we use to dynamically generate checks.

This will get us by for now, we are working on moving to sensu go. Thanks again!