Sensu server losing events

Recently, we discovered an event that didn’t alert properly. It was set to be handled after 3 occurrences and in digging through the sensu-server log we found the 2nd and 4th event occurrences, but no mention of the 3rd occurrence (the one that would have been handled). This prompted us to audit sensu events a little closer, and we soon discovered lost events to be quite common, though most were silently missed because we only handle every 30 occurrences. Basically, it seemed like an issue that was easy to miss until we knew what to look for (grep the log by check name and client name, and then just walk the occurrence counts looking for gaps).

The attached sensu-server log snippet shows the 46th and 48th occurrences of an event, with 3 handlers rightly identified, but no mention of a 47th occurrence. Our understanding is that since the occurrence count continues to increment, the sensu-server is receiving the event, and successfully updating the count in redis, but since the 47th “handling event” message is missing, and doesn’t seem to meet any condition that would filter out any handlers (we’re not filtering or subduing, it’s not flapping, and the severity is the unchanged), it seems that handle_event was simply never called.

We’ve been digging through em-redis on the assumption that the callback block passed into @redis.hset isn’t getting called by the EventMachine thread. Currently, we’re wondering if there are thread safety issues around @redis_callbacks, specifically appending to it from one thread (in raw_call_command), and then shifting off of it from the EM thread that invokes receive_data. It might be more of a stretch, but we also wonder if it’s possible that EM invokes receive_data before our callback is even added to @redis_callbacks, since the callback is added after calling send_data.

We’d love to be wrong here and told that we’re running down the wrong rabbit-hole, but short of some other good pointers, our next steps, would likely be to test these theories by modifying em-redis directly. We’re using a single sensu-server with v0.11.0-1 (redis and rabbitmq on same server). Any insights would be appreciated.

sensu-missing-alert.log (3.46 KB)

There was a rogue Sensu server, that had connected to RabbitMQ and Redis, essentially acting as a black hole for occurrences.

···

On Tuesday, 22 October 2013 11:57:41 UTC-7, Anthony Nowell wrote:

Recently, we discovered an event that didn’t alert properly. It was set to be handled after 3 occurrences and in digging through the sensu-server log we found the 2nd and 4th event occurrences, but no mention of the 3rd occurrence (the one that would have been handled). This prompted us to audit sensu events a little closer, and we soon discovered lost events to be quite common, though most were silently missed because we only handle every 30 occurrences. Basically, it seemed like an issue that was easy to miss until we knew what to look for (grep the log by check name and client name, and then just walk the occurrence counts looking for gaps).

The attached sensu-server log snippet shows the 46th and 48th occurrences of an event, with 3 handlers rightly identified, but no mention of a 47th occurrence. Our understanding is that since the occurrence count continues to increment, the sensu-server is receiving the event, and successfully updating the count in redis, but since the 47th “handling event” message is missing, and doesn’t seem to meet any condition that would filter out any handlers (we’re not filtering or subduing, it’s not flapping, and the severity is the unchanged), it seems that handle_event was simply never called.

We’ve been digging through em-redis on the assumption that the callback block passed into @redis.hset isn’t getting called by the EventMachine thread. Currently, we’re wondering if there are thread safety issues around @redis_callbacks, specifically appending to it from one thread (in raw_call_command), and then shifting off of it from the EM thread that invokes receive_data. It might be more of a stretch, but we also wonder if it’s possible that EM invokes receive_data before our callback is even added to @redis_callbacks, since the callback is added after calling send_data.

We’d love to be wrong here and told that we’re running down the wrong rabbit-hole, but short of some other good pointers, our next steps, would likely be to test these theories by modifying em-redis directly. We’re using a single sensu-server with v0.11.0-1 (redis and rabbitmq on same server). Any insights would be appreciated.

Thanks portertech. We did a poor job of decommissioning a previous sensu-server by neglecting to remove sensu server startup scripts. Our black hole is gone, and we’re no longer running down the wrong rabbit hole.

···

On Tuesday, October 22, 2013 2:09:35 PM UTC-7, portertech wrote:

There was a rogue Sensu server, that had connected to RabbitMQ and Redis, essentially acting as a black hole for occurrences.

On Tuesday, 22 October 2013 11:57:41 UTC-7, Anthony Nowell wrote:

Recently, we discovered an event that didn’t alert properly. It was set to be handled after 3 occurrences and in digging through the sensu-server log we found the 2nd and 4th event occurrences, but no mention of the 3rd occurrence (the one that would have been handled). This prompted us to audit sensu events a little closer, and we soon discovered lost events to be quite common, though most were silently missed because we only handle every 30 occurrences. Basically, it seemed like an issue that was easy to miss until we knew what to look for (grep the log by check name and client name, and then just walk the occurrence counts looking for gaps).

The attached sensu-server log snippet shows the 46th and 48th occurrences of an event, with 3 handlers rightly identified, but no mention of a 47th occurrence. Our understanding is that since the occurrence count continues to increment, the sensu-server is receiving the event, and successfully updating the count in redis, but since the 47th “handling event” message is missing, and doesn’t seem to meet any condition that would filter out any handlers (we’re not filtering or subduing, it’s not flapping, and the severity is the unchanged), it seems that handle_event was simply never called.

We’ve been digging through em-redis on the assumption that the callback block passed into @redis.hset isn’t getting called by the EventMachine thread. Currently, we’re wondering if there are thread safety issues around @redis_callbacks, specifically appending to it from one thread (in raw_call_command), and then shifting off of it from the EM thread that invokes receive_data. It might be more of a stretch, but we also wonder if it’s possible that EM invokes receive_data before our callback is even added to @redis_callbacks, since the callback is added after calling send_data.

We’d love to be wrong here and told that we’re running down the wrong rabbit-hole, but short of some other good pointers, our next steps, would likely be to test these theories by modifying em-redis directly. We’re using a single sensu-server with v0.11.0-1 (redis and rabbitmq on same server). Any insights would be appreciated.