We have a sensu check that runs every 5 minutes & creates other checks. Think querying a list of >1000 endpoints, and dynamically creating individual checks any of them are failing. We have a default handler set that sends an email to pagerduty about each failing one. This works fine, we always get separate alerts opened in pagerduty for the failing endpoints
If a TTL expires for one of these dynamic checks, that means the parent check hasn’t updated it & we want it to be resolved. We’ve accomplished this with a ruby handler, that looks something like this:
handler - delete_check_result.rb
<snip>
check_status = @event['check']['status']
if check_status <= 0
client_name = @event['client']['name']
check_name = @event['check']['name']
response = api_request(:DELETE, "/events/#{client_name}/#{check_name}").code
response = api_request(:DELETE, "/results/#{client_name}/#{check_name}").code
end
</snip>
This all works fine. If a TTL expires for a check with this handler, it is removed correctly from sensu & disappears from uchiwa.
The issue we’re facing is for most (80%) of the checks a resolve email is never sent to pagerduty. We find this message in the sensu log (some info redacted)
{
"timestamp": "2019-10-23T02:43:37.546025-0700",
"level": "info",
"message": "handler does not handle event severity",
"handler": {
"type": "pipe",
"command": "/etc/sensu/handlers/email.rb -j email_critical",
"severities": ["critical"],
"filters": ["filter_weekly_maintenance", "occurrences"],
"name": "email_critical"
},
"event": {
"id": "3fe567dd-88c0-4285-9103-c4ec062847df",
"client": {
"name": "redacted",
"address": "redacted",
"socket": {
"bind": "172.17.0.1",
"port": 3030
},
"subscriptions": ["redacted"],
"version": "1.7.0",
"timestamp": 1571823815
},
"check": {
"type": "metric",
"occurrences": 5,
"ttl": 360,
"ttl_status": -2,
"handlers": ["default", "delete_check_result"],
"metric_tags": {},
"name": "some-dynamic-check-1",
"status": 0,
"output": "Resolving on request of the API",
"additional": {
"warnOnly": true
},
"executed": 1571823817,
"issued": 1571823817,
"interval": 30,
"total_state_change": 0,
"force_resolve": true,
"history": ["0"]
},
"occurrences": 1,
"occurrences_watermark": 1,
"last_ok": 1571823817,
"action": "resolve",
"timestamp": 1571823817,
"last_state_change": 1571823817,
"silenced": false,
"silenced_by": []
}
}
~20% of the alerts do get an email sent & resolved properly. I’m thinking there is an issue because both the email & delete_check_result handler fire at the same time. If delete_check_result manages to clear out the history first, then the default handler refuses to send the resolve email because it can’t find the critical severity in the check history.
The strange thing is this worked reliably for a long time, I’m not sure what changed that caused this to start happening. We are considering moving to sensu-go but will take some time & would like to get this resolved in the meantime
Is there a better way to have a check be resolved / deleted if the TTL expires?
- What have you already tried? Please include links to gists and/or code blocks (if relatively small)
- Adding a 5 second sleep between the DELETE to the /events & /results api call in the delete_check_result handler. This led to alerts not being resolved properly, we got a large number of UNKNOWN alerts like Last check execution was xxx seconds ago
- Adding an ok severity the email handler. We starting getting alerts for OK status from another extension we have running
-
Tell us about your setup, this should include OS, version of Sensu, version of Sensu components (redis, rabbitmq), plugin versions (if applicable), anything special about your setup such as an airgapped network or strict ACLs
CentOS 7
Sensu 1.7.0
rabbitmq-server 3.6.2-1
redis 2.8.19-2 -
Is there a Github issue related to your issue? No
-
Is there anything else that can help us effectively help you?