Hey everyone, yesterday our sensu stopped emailing us and it took a little bit for me to notice. I would like to be a little more proactive and put in place some monitoring to monitor our monitoring 
Do you all have any ideas or systems you use?
Thanks
I'm also in the need of such a thing as I test my sensu clusters to
replace our nagios clusters.
I was thinking of writing a kind of "end-to-end" sensu check that our
nagios servers could nrpe.
I'm thinking it would do things in order, and be smart about the
output to help other members of my ops team diagnose what might be
wrong: (as sensu has lots of pieces)
- check ping? Is the sensu vip even pinging?
- check redis
- check rabbitmq port
- check rabbitmq end-to-end (publish and consume a test message?)
- check sensu-server? (make sure the server is at least running)
- check handler_* (try to make a handler do something and make sure it
did something?)
- check dashboard (probably just warn if down for me)
Yes, you could go through the work of setting up individual checks and
dependencies and stuff in nagios, but I want to avoid that as I am
trying to deprecate it.
At least I can start easy and get progressively more complex as I
iterate. I should be able to leverage existing nagios check scripts
that I have to do all of this easily. (except for handler checks?
Might be harder, but would be nice to know if sensu can't send emails
or whatever)
So those are my ideas. In my case my system is nagios, but it could be
as simple as an xinetd that returns http 200 or http 500 to like a
pingdom http check or whatever?
I haven't written this yet, but I need to soon. I am also up for
suggestions and input from other sensu-users.
···
On Tue, Jan 14, 2014 at 5:34 PM, Micah Hoffmann <micah@pointinside.com> wrote:
Hey everyone, yesterday our sensu stopped emailing us and it took a little
bit for me to notice. I would like to be a little more proactive and put in
place some monitoring to monitor our monitoring 
Do you all have any ideas or systems you use?
Thanks
The sensu-api exposes a /health endpoint since 0.9.13 - http://sensuapp.org/docs/0.12/api-health
It will help you determine if there are consumers (sensu-server instances) connected to rabbit and how many msgs are in the results queue.
That said, I’m a pessimist sysadmin and I have never been 100% comfortable with /health as the only way to check sensu’s health. Something that did a more thorough end-to-end automated check all the way through to a handler would be interesting, please share what you come up with!
Occasionally, on an ad-hoc basis, we will send an event via netcat and make sure that we can see it on both our sensu-dashboard and that it makes it to pagerduty, eg: https://gist.github.com/nstielau/3797054 (change the handler to your pagerduty handler of course)
···
On Wed, Jan 15, 2014 at 8:37 AM, Kyle Anderson kyle@xkyle.com wrote:
I’m also in the need of such a thing as I test my sensu clusters to
replace our nagios clusters.
I was thinking of writing a kind of “end-to-end” sensu check that our
nagios servers could nrpe.
I’m thinking it would do things in order, and be smart about the
output to help other members of my ops team diagnose what might be
wrong: (as sensu has lots of pieces)
-
check ping? Is the sensu vip even pinging?
-
check redis
-
check rabbitmq port
-
check rabbitmq end-to-end (publish and consume a test message?)
-
check sensu-server? (make sure the server is at least running)
-
check handler_* (try to make a handler do something and make sure it
did something?)
- check dashboard (probably just warn if down for me)
Yes, you could go through the work of setting up individual checks and
dependencies and stuff in nagios, but I want to avoid that as I am
trying to deprecate it.
At least I can start easy and get progressively more complex as I
iterate. I should be able to leverage existing nagios check scripts
that I have to do all of this easily. (except for handler checks?
Might be harder, but would be nice to know if sensu can’t send emails
or whatever)
So those are my ideas. In my case my system is nagios, but it could be
as simple as an xinetd that returns http 200 or http 500 to like a
pingdom http check or whatever?
I haven’t written this yet, but I need to soon. I am also up for
suggestions and input from other sensu-users.
On Tue, Jan 14, 2014 at 5:34 PM, Micah Hoffmann micah@pointinside.com wrote:
Hey everyone, yesterday our sensu stopped emailing us and it took a little
bit for me to notice. I would like to be a little more proactive and put in
place some monitoring to monitor our monitoring 
Do you all have any ideas or systems you use?
Thanks