Intro
I'm please to announce the unofficial open-sourcing of some of the
Sensu code we use at Yelp that allows us to work in a multi-team
environment. (An official announcement will come later)
Special thanks to bobtfish and hashbrowncipher for helping me get it
into a releasable shape.
This is some of the code referenced in this talk:
Warning: Please do not try to use these unless you are already
familiar with how Sensu works. If you are deploying Sensu for the
first time, this code is not for you. (probably)
Warning: There is sure to be plenty of "yelpy" assumptions in this
code that don't apply to you. These are bugs. Please open issue, and
bonus points for pull requests. To repeat: If you have to fork this
code to make it usable at your organization, it is a bug. (probably)
Also docs or tests are lacking. Open an issue if something is not clear.
Onto the code!
These are the handlers we use at Yelp. The big idea here is that we
have one "default" handler that behaves differently depending on the
event data that it receives. (Contrast this to most sensu
installations have pre-defined handlers, so a check's behavior is
determined server-side) This change is what allows us to have our
handlers be "team-aware". Each check lets the handler know which team
is responsible for it!
We use the following handlers:
* nodebot (irc)
* mailer (email)
* pagerduty (pages)
* jira (tickets)
This covers most of our needs, and covers the gambit of urgent =>
not-urgent + important.
https://github.com/Yelp/puppet-monitoring_check
This is a custom puppet type that allows us to deploy sensu checks
with puppet that are team-aware, that the above handlers understand.
(just a wrapper of the upstream sensu::check that does team stuff +
validation).
https://github.com/Yelp/pysensu-yelp
This is a python library allow python code to emit sensu events that
are compatible with the above handlers. Some developers use this
library directly for special things, most developers use the generic
sensu checks that every SOA service comes with. (each SOA service has
a yaml file that declares the sensu team responsible for it, and their
alerting parameters)
The Why
We decided to add all this code and complexity because it lowers the
barrier to adding monitoring to a system. This is very important to
me, especially when in a large org with lots of "teams". I personally
believe that this is the future of.... "devops", contrast to the
traditional "Ops" team or "Noc" that gets all alerts and acts as human
filters.
Once you have good handlers and good libraries in place, interesting
things become possible that were crazy to think of before. Here are
some examples:
* We can use the "owner" tag in EC2 to figure out where to route
alerts for system-level alerts. (Ops doesn't own every box)
* The Infra team deploying Netflix ICE (AWS costs tracking) can easily
make the system open tickets when certain price thresholds are
reached. (important, but not urgent issue)
* Backend teams that write batch jobs that page their team's oncall if
they fail say, 3 times in a row. (no Ops involvement necessary!)
* Shared dev boxes can route load/cpu/disk alerts to the local users
for self-regulation. (blame the heavy users, don't wake up ops)
* Alerts for a stage environment can get routed to the team that has
reserved that stage environment for use for that time. (fast feedback
to the relevant team)
Giving a python function to send alerts to developers is an empowering
thing. We make heavy used of the local sensu socket
(https://github.com/sensu/sensu-docs/pull/132/files) to create events
when no such check is defined client-side.
Conclusion
Three cheers for a modern, flexible, and dynamic monitoring framework
upon which to build great things!