Single monitoring instance with sensu-server, RabbitMQ, and Graphite, and its own local sensu-client doing some checks against metrics stored locally in Graphite.
A bunch of remote instances in the cloud, each with its own sensu-client, collecting metrics such as CPU usage, memory usage, disk, network, etc. and pushing that stuff back home into sensu-server.
Those metrics are then fed by sensu-server into Graphite (and are then checked by the local sensu-client running on the monitoring instance).
Typical client config for the dozens of remote instances:
{
“client”: {
"name": "foobar",
"address": "..."
"root": "foostuff",
...bunch of other things...
"subscriptions": [ "system-health", "foostuff" ]
}
}
On the server I had a file called metric_system_health_relay_graphite.json with all the checks that the clients are supposed to run; here’s an excerpt:
{
“checks”: {
"cpu": {
"command": "cpu-pcnt-usage-metrics.rb --scheme :::root:::.:::provider:::.:::datacenter:::.:::group:::.:::name:::.os.cpu",
"type": "metric",
"interval": 60,
"handlers": [ "default", "relay" ],
"subscribers": [ "system-health" ]
},
"load": {
"command": "load-metrics.rb --scheme :::root:::.:::provider:::.:::datacenter:::.:::group:::.:::name:::.os",
...
},
"memory": {
"command": "memory-metrics.rb -p --scheme :::root:::.:::provider:::.:::datacenter:::.:::group:::.:::name:::.os.memory",
...
},
"disk_usage": {
...
},
"disk_io": {
...
},
"network": {
...
}
}
}
Then graphite_relay.json has this:
{
“relay”: {
"graphite": {
"host": "127.0.0.1",
"port": 2003
}
}
}
The local checks against Graphite are in check_foo_metrics.json on the server, like this:
{
“checks”: {
"acc_cpu": {
"handlers": ["default", "slack"],
"command": "check-stats.rb -h localhost:8080 -t 'foostuff.*.*.*.*.os.cpu.user' -w 30 -c 50 -p -5mins",
"interval": 60,
"subscribers": [ "foometrics" ]
}
}
}
There are 29 files total in /etc/sensu/conf.d on the monitoring server.
Today I added a few more checks in check_foo_metrics.json. Then I noticed something strange: the idle CPU usage on some, but not all remote instances dropped from ~99% to ~85%. The problem was, that was a false reading, as CPU idle was clearly 99%, as seen with top on any instance. After more poking around, I came to the conclusion that it was CPU usage generated by the sensu-client itself, as it was collecting metrics from the instances. I tried to change a few things, nothing worked.
Except one thing did work: the file metric_system_health_relay_graphite.json on the server, that orchestrates all the checks performed by the remote clients, I renamed it aaa_metric_system_health_relay_graphite.json and restarted everything sensu-* on the monitoring instance, thinking that this would push the check to the head of the queue, before CPU usage has time to spike up.
BAM! CPU idle on all clients went back to ~99% as soon as I made that change.
Check the attached screenshot. You can see CPU idle hovering around 100% on all instances, and CPU user, nice, iowait, etc hovering low - until 15:30. Then I made the changes, and CPU idle dropped on a lot of instances to 85%. Then at 16:35 I renamed that config file on the server and CPU idle went back to 100%.
Am I right making all these assumptions? If so, is there a better way to control the order that the checks are executed on remote clients?