Sensu server losing events

Anthony_Nowell1 · October 22, 2013, 6:57pm

Recently, we discovered an event that didn’t alert properly. It was set to be handled after 3 occurrences and in digging through the sensu-server log we found the 2nd and 4th event occurrences, but no mention of the 3rd occurrence (the one that would have been handled). This prompted us to audit sensu events a little closer, and we soon discovered lost events to be quite common, though most were silently missed because we only handle every 30 occurrences. Basically, it seemed like an issue that was easy to miss until we knew what to look for (grep the log by check name and client name, and then just walk the occurrence counts looking for gaps).

The attached sensu-server log snippet shows the 46th and 48th occurrences of an event, with 3 handlers rightly identified, but no mention of a 47th occurrence. Our understanding is that since the occurrence count continues to increment, the sensu-server is receiving the event, and successfully updating the count in redis, but since the 47th “handling event” message is missing, and doesn’t seem to meet any condition that would filter out any handlers (we’re not filtering or subduing, it’s not flapping, and the severity is the unchanged), it seems that handle_event was simply never called.

We’ve been digging through em-redis on the assumption that the callback block passed into @redis.hset isn’t getting called by the EventMachine thread. Currently, we’re wondering if there are thread safety issues around @redis_callbacks, specifically appending to it from one thread (in raw_call_command), and then shifting off of it from the EM thread that invokes receive_data. It might be more of a stretch, but we also wonder if it’s possible that EM invokes receive_data before our callback is even added to @redis_callbacks, since the callback is added after calling send_data.

We’d love to be wrong here and told that we’re running down the wrong rabbit-hole, but short of some other good pointers, our next steps, would likely be to test these theories by modifying em-redis directly. We’re using a single sensu-server with v0.11.0-1 (redis and rabbitmq on same server). Any insights would be appreciated.

sensu-missing-alert.log (3.46 KB)

Sean_Porter · October 22, 2013, 9:09pm

There was a rogue Sensu server, that had connected to RabbitMQ and Redis, essentially acting as a black hole for occurrences.

···

On Tuesday, 22 October 2013 11:57:41 UTC-7, Anthony Nowell wrote:

Recently, we discovered an event that didn’t alert properly. It was set to be handled after 3 occurrences and in digging through the sensu-server log we found the 2nd and 4th event occurrences, but no mention of the 3rd occurrence (the one that would have been handled). This prompted us to audit sensu events a little closer, and we soon discovered lost events to be quite common, though most were silently missed because we only handle every 30 occurrences. Basically, it seemed like an issue that was easy to miss until we knew what to look for (grep the log by check name and client name, and then just walk the occurrence counts looking for gaps).

The attached sensu-server log snippet shows the 46th and 48th occurrences of an event, with 3 handlers rightly identified, but no mention of a 47th occurrence. Our understanding is that since the occurrence count continues to increment, the sensu-server is receiving the event, and successfully updating the count in redis, but since the 47th “handling event” message is missing, and doesn’t seem to meet any condition that would filter out any handlers (we’re not filtering or subduing, it’s not flapping, and the severity is the unchanged), it seems that handle_event was simply never called.

We’ve been digging through em-redis on the assumption that the callback block passed into @redis.hset isn’t getting called by the EventMachine thread. Currently, we’re wondering if there are thread safety issues around @redis_callbacks, specifically appending to it from one thread (in raw_call_command), and then shifting off of it from the EM thread that invokes receive_data. It might be more of a stretch, but we also wonder if it’s possible that EM invokes receive_data before our callback is even added to @redis_callbacks, since the callback is added after calling send_data.

We’d love to be wrong here and told that we’re running down the wrong rabbit-hole, but short of some other good pointers, our next steps, would likely be to test these theories by modifying em-redis directly. We’re using a single sensu-server with v0.11.0-1 (redis and rabbitmq on same server). Any insights would be appreciated.

Anthony_Nowell1 · October 23, 2013, 6:23pm

Thanks portertech. We did a poor job of decommissioning a previous sensu-server by neglecting to remove sensu server startup scripts. Our black hole is gone, and we’re no longer running down the wrong rabbit hole.

···

On Tuesday, October 22, 2013 2:09:35 PM UTC-7, portertech wrote:

There was a rogue Sensu server, that had connected to RabbitMQ and Redis, essentially acting as a black hole for occurrences.

On Tuesday, 22 October 2013 11:57:41 UTC-7, Anthony Nowell wrote:

Recently, we discovered an event that didn’t alert properly. It was set to be handled after 3 occurrences and in digging through the sensu-server log we found the 2nd and 4th event occurrences, but no mention of the 3rd occurrence (the one that would have been handled). This prompted us to audit sensu events a little closer, and we soon discovered lost events to be quite common, though most were silently missed because we only handle every 30 occurrences. Basically, it seemed like an issue that was easy to miss until we knew what to look for (grep the log by check name and client name, and then just walk the occurrence counts looking for gaps).

The attached sensu-server log snippet shows the 46th and 48th occurrences of an event, with 3 handlers rightly identified, but no mention of a 47th occurrence. Our understanding is that since the occurrence count continues to increment, the sensu-server is receiving the event, and successfully updating the count in redis, but since the 47th “handling event” message is missing, and doesn’t seem to meet any condition that would filter out any handlers (we’re not filtering or subduing, it’s not flapping, and the severity is the unchanged), it seems that handle_event was simply never called.

We’ve been digging through em-redis on the assumption that the callback block passed into @redis.hset isn’t getting called by the EventMachine thread. Currently, we’re wondering if there are thread safety issues around @redis_callbacks, specifically appending to it from one thread (in raw_call_command), and then shifting off of it from the EM thread that invokes receive_data. It might be more of a stretch, but we also wonder if it’s possible that EM invokes receive_data before our callback is even added to @redis_callbacks, since the callback is added after calling send_data.

We’d love to be wrong here and told that we’re running down the wrong rabbit-hole, but short of some other good pointers, our next steps, would likely be to test these theories by modifying em-redis directly. We’re using a single sensu-server with v0.11.0-1 (redis and rabbitmq on same server). Any insights would be appreciated.

Topic		Replies	Views
Sensu-Server crash (Redis related) Sensu Classic (EOL)	3	468	August 6, 2014
Sensu Client not sending Metrics data to the Server Sensu Classic (EOL)	30	1410	June 8, 2015
Sensu server not picking up handlers? Sensu Classic (EOL)	5	544	July 8, 2016
Sensu filter(?) for state change based on check output. How-to create? Sensu Classic (EOL)	4	756	April 1, 2019
How sensu produce an event? Sensu Classic (EOL)	5	485	February 11, 2015

Sensu server losing events

Related topics