generating alerts from Graphite - how to include every server that exceeds the threshold?


All the metrics I’m collecting now with sensu-client get pushed in Graphite. I’ve a couple data trees there, with a good number of metrics from all instances. A path in one tree leading to the CPU user metric for one node might look like:

I’m testing check-data.rb like this:

check-data.rb -t “www.....os.cpu.user” -w 30 -c 50

In other words, match .os.cpu.user for all instances in that tree, all providers, all regions, etc. It works, but only the first offender is returned. If there were other instances that should also be flagged, I don’t get an alert for them. That’s a problem.

A few possible solutions:

  1. In addition to collecting metrics with sensu-client, I could also do checks, locally, on each instance. That seems wasteful, and it’s precisely what I’ve tried to avoid by pushing all metrics into Graphite and trying to alert against Graphite.

  2. Hack check-data.rb - doable theoretically, but it’s a significant project.