Approach for custom checks

I see existing plugins that will substitute nicely for my existing Monit checks (CPU, disk space, temps etc) but I’d like to leverage the platform further. However I’m struggling to figure out the standard approach to be taken with custom checks.

If we take systemd timers as a case study, I would ultimately like to alert if a timer hasn’t triggered in a timely manner, and perhaps if a timer fails too.

  1. should the sensu agent poll and parse systemctl list-timers? Should it always generate an (informational I guess) event for each entry when it does so, or only when there’s an issue (late timer or failure).

  2. Or should the timers themselves trigger an event when they are fired?

  3. Notifications can happen on an event, but how do you notify on more abstract things like the lack of an expected event?

What you’d normally do is have a check script which checked the things you want to look at (so yes you can run systemctl list-timers and parse the output), then exits with the correct code: 0 for OK, 1 for WARNING or 2 for CRITICAL (see: https://docs.sensu.io/sensu-go/latest/reference/checks/).

You add that as a check in sensu and it’ll automatically add events and call any handlers you associate with the check.

Ian

Sorry, for question 3 - check out the ‘ttl’ attribute of a check. If you set that, it creates a “dead mans switch”. You can have it alert if something doesn’t run after a specified amount of time.

Ian

Thanks @ichilton. My confusion however now lies with the relationship between the check frequency, and the timings of the trigger.

For instance, I have a backup timer that runs once a day. Should sensu be checking systemd list-timers every minute? If so, what should the statuses be say across a day? Would that be an event every minute, or only on change?

Or should the backup itself emit an event at the end? I can more easily see how this would work, except it’s no longer a generic check and onerous if I have many timers across many hosts.

Ah right - I think I see better what you are trying to do now!

For that, i’d emit a sensu event at the end of the backup with the state - was it successful or did it fail?

I’d set a TTL in the check section of that event of something greater than the normal backup frequency.

What that means is you can easily have an alert if the backup fails and you can have an alert if the backup hasn’t completed for a certain amount of time.

You can easily add an event by using the API on the agent itself, which is neat:
https://docs.sensu.io/sensu-go/latest/reference/agent/#create-monitoring-events-using-the-agent-api

Ian

Ok thanks.

I was looking for a more generic way to monitor systemd timers (the backup was an illustrative example). Although for something as important as a backup it makes sense to have a specific event generated I was looking for a way to have a generic list of timer events monitored, perhaps by looking at the “last run” times. I guess by nature that would require a poll.