Open Sourcing of Yelp's Sensu Stuff


#1

Intro

I'm please to announce the unofficial open-sourcing of some of the
Sensu code we use at Yelp that allows us to work in a multi-team
environment. (An official announcement will come later)
Special thanks to bobtfish and hashbrowncipher for helping me get it
into a releasable shape.

This is some of the code referenced in this talk:


Warning: Please do not try to use these unless you are already
familiar with how Sensu works. If you are deploying Sensu for the
first time, this code is not for you. (probably)

Warning: There is sure to be plenty of "yelpy" assumptions in this
code that don't apply to you. These are bugs. Please open issue, and
bonus points for pull requests. To repeat: If you have to fork this
code to make it usable at your organization, it is a bug. (probably)
Also docs or tests are lacking. Open an issue if something is not clear.

Onto the code!


These are the handlers we use at Yelp. The big idea here is that we
have one "default" handler that behaves differently depending on the
event data that it receives. (Contrast this to most sensu
installations have pre-defined handlers, so a check's behavior is
determined server-side) This change is what allows us to have our
handlers be "team-aware". Each check lets the handler know which team
is responsible for it!

We use the following handlers:
* nodebot (irc)
* mailer (email)
* pagerduty (pages)
* jira (tickets)

This covers most of our needs, and covers the gambit of urgent =>
not-urgent + important.


This is a custom puppet type that allows us to deploy sensu checks
with puppet that are team-aware, that the above handlers understand.
(just a wrapper of the upstream sensu::check that does team stuff +
validation).


This is a python library allow python code to emit sensu events that
are compatible with the above handlers. Some developers use this
library directly for special things, most developers use the generic
sensu checks that every SOA service comes with. (each SOA service has
a yaml file that declares the sensu team responsible for it, and their
alerting parameters)

The Why

We decided to add all this code and complexity because it lowers the
barrier to adding monitoring to a system. This is very important to
me, especially when in a large org with lots of "teams". I personally
believe that this is the future of.... "devops", contrast to the
traditional "Ops" team or "Noc" that gets all alerts and acts as human
filters.

Once you have good handlers and good libraries in place, interesting
things become possible that were crazy to think of before. Here are
some examples:

* We can use the "owner" tag in EC2 to figure out where to route
alerts for system-level alerts. (Ops doesn't own every box)
* The Infra team deploying Netflix ICE (AWS costs tracking) can easily
make the system open tickets when certain price thresholds are
reached. (important, but not urgent issue)
* Backend teams that write batch jobs that page their team's oncall if
they fail say, 3 times in a row. (no Ops involvement necessary!)
* Shared dev boxes can route load/cpu/disk alerts to the local users
for self-regulation. (blame the heavy users, don't wake up ops)
* Alerts for a stage environment can get routed to the team that has
reserved that stage environment for use for that time. (fast feedback
to the relevant team)

Giving a python function to send alerts to developers is an empowering
thing. We make heavy used of the local sensu socket
(https://github.com/sensu/sensu-docs/pull/132/files) to create events
when no such check is defined client-side.

Conclusion

Three cheers for a modern, flexible, and dynamic monitoring framework
upon which to build great things!


#2

This is really good, thanks for open-sourcing this.

We’re going through a similar process of de-opsifying our alerting over here.
It’s actually proving popular with the teams as they get much better feedback.

We’re also using the pattern of people shipping alerting config along with their

service config.

Would love to hear more about how this is working for people since the OP.

Cheers,

Andrew

···

On Tuesday, 21 October 2014 04:29:35 UTC+1, Kyle Anderson wrote:

Intro

I’m please to announce the unofficial open-sourcing of some of the

Sensu code we use at Yelp that allows us to work in a multi-team

environment. (An official announcement will come later)

Special thanks to bobtfish and hashbrowncipher for helping me get it

into a releasable shape.

This is some of the code referenced in this talk:

http://www.slideshare.net/solarkennedy/sensu-yelp-a-guided-tour

https://vimeo.com/92770954

Warning: Please do not try to use these unless you are already

familiar with how Sensu works. If you are deploying Sensu for the

first time, this code is not for you. (probably)

Warning: There is sure to be plenty of “yelpy” assumptions in this

code that don’t apply to you. These are bugs. Please open issue, and

bonus points for pull requests. To repeat: If you have to fork this

code to make it usable at your organization, it is a bug. (probably)

Also docs or tests are lacking. Open an issue if something is not clear.

Onto the code!

https://github.com/Yelp/sensu_handlers

These are the handlers we use at Yelp. The big idea here is that we

have one “default” handler that behaves differently depending on the

event data that it receives. (Contrast this to most sensu

installations have pre-defined handlers, so a check’s behavior is

determined server-side) This change is what allows us to have our

handlers be “team-aware”. Each check lets the handler know which team

is responsible for it!

We use the following handlers:

  • nodebot (irc)

  • mailer (email)

  • pagerduty (pages)

  • jira (tickets)

This covers most of our needs, and covers the gambit of urgent =>

not-urgent + important.

https://github.com/Yelp/puppet-monitoring_check

This is a custom puppet type that allows us to deploy sensu checks

with puppet that are team-aware, that the above handlers understand.

(just a wrapper of the upstream sensu::check that does team stuff +

validation).

https://github.com/Yelp/pysensu-yelp

This is a python library allow python code to emit sensu events that

are compatible with the above handlers. Some developers use this

library directly for special things, most developers use the generic

sensu checks that every SOA service comes with. (each SOA service has

a yaml file that declares the sensu team responsible for it, and their

alerting parameters)

The Why

We decided to add all this code and complexity because it lowers the

barrier to adding monitoring to a system. This is very important to

me, especially when in a large org with lots of “teams”. I personally

believe that this is the future of… “devops”, contrast to the

traditional “Ops” team or “Noc” that gets all alerts and acts as human

filters.

Once you have good handlers and good libraries in place, interesting

things become possible that were crazy to think of before. Here are

some examples:

  • We can use the “owner” tag in EC2 to figure out where to route

alerts for system-level alerts. (Ops doesn’t own every box)

  • The Infra team deploying Netflix ICE (AWS costs tracking) can easily

make the system open tickets when certain price thresholds are

reached. (important, but not urgent issue)

  • Backend teams that write batch jobs that page their team’s oncall if

they fail say, 3 times in a row. (no Ops involvement necessary!)

  • Shared dev boxes can route load/cpu/disk alerts to the local users

for self-regulation. (blame the heavy users, don’t wake up ops)

  • Alerts for a stage environment can get routed to the team that has

reserved that stage environment for use for that time. (fast feedback

to the relevant team)

Giving a python function to send alerts to developers is an empowering

thing. We make heavy used of the local sensu socket

(https://github.com/sensu/sensu-docs/pull/132/files) to create events

when no such check is defined client-side.

Conclusion

Three cheers for a modern, flexible, and dynamic monitoring framework

upon which to build great things!


#3

We at have also released much of our custom sensu plugins and misc code
https://github.com/yieldbot/sensu-yieldbot-plugins

At some point it may be nice to gather all of these links and shared practical knowledge/configurations in a central location for others.

···

On Thu, Feb 12, 2015 at 5:20 AM, andrew.hodgson@digital.hmrc.gov.uk wrote:

This is really good, thanks for open-sourcing this.

We’re going through a similar process of de-opsifying our alerting over here.
It’s actually proving popular with the teams as they get much better feedback.

We’re also using the pattern of people shipping alerting config along with their

service config.

Would love to hear more about how this is working for people since the OP.

Cheers,

Andrew

http://github.com/hmrc

On Tuesday, 21 October 2014 04:29:35 UTC+1, Kyle Anderson wrote:

Intro

I’m please to announce the unofficial open-sourcing of some of the
Sensu code we use at Yelp that allows us to work in a multi-team
environment. (An official announcement will come later)
Special thanks to bobtfish and hashbrowncipher for helping me get it
into a releasable shape.

This is some of the code referenced in this talk:
http://www.slideshare.net/solarkennedy/sensu-yelp-a-guided-tour

https://vimeo.com/92770954

Warning: Please do not try to use these unless you are already
familiar with how Sensu works. If you are deploying Sensu for the
first time, this code is not for you. (probably)

Warning: There is sure to be plenty of “yelpy” assumptions in this
code that don’t apply to you. These are bugs. Please open issue, and
bonus points for pull requests. To repeat: If you have to fork this
code to make it usable at your organization, it is a bug. (probably)
Also docs or tests are lacking. Open an issue if something is not clear.

Onto the code!

https://github.com/Yelp/sensu_handlers

These are the handlers we use at Yelp. The big idea here is that we
have one “default” handler that behaves differently depending on the
event data that it receives. (Contrast this to most sensu
installations have pre-defined handlers, so a check’s behavior is
determined server-side) This change is what allows us to have our
handlers be “team-aware”. Each check lets the handler know which team
is responsible for it!

We use the following handlers:

  • nodebot (irc)
  • mailer (email)
  • pagerduty (pages)
  • jira (tickets)

This covers most of our needs, and covers the gambit of urgent =>
not-urgent + important.

https://github.com/Yelp/puppet-monitoring_check

This is a custom puppet type that allows us to deploy sensu checks
with puppet that are team-aware, that the above handlers understand.
(just a wrapper of the upstream sensu::check that does team stuff +
validation).

https://github.com/Yelp/pysensu-yelp

This is a python library allow python code to emit sensu events that
are compatible with the above handlers. Some developers use this
library directly for special things, most developers use the generic
sensu checks that every SOA service comes with. (each SOA service has
a yaml file that declares the sensu team responsible for it, and their
alerting parameters)

The Why

We decided to add all this code and complexity because it lowers the
barrier to adding monitoring to a system. This is very important to
me, especially when in a large org with lots of “teams”. I personally
believe that this is the future of… “devops”, contrast to the
traditional “Ops” team or “Noc” that gets all alerts and acts as human
filters.

Once you have good handlers and good libraries in place, interesting
things become possible that were crazy to think of before. Here are
some examples:

  • We can use the “owner” tag in EC2 to figure out where to route
    alerts for system-level alerts. (Ops doesn’t own every box)
  • The Infra team deploying Netflix ICE (AWS costs tracking) can easily
    make the system open tickets when certain price thresholds are
    reached. (important, but not urgent issue)
  • Backend teams that write batch jobs that page their team’s oncall if
    they fail say, 3 times in a row. (no Ops involvement necessary!)
  • Shared dev boxes can route load/cpu/disk alerts to the local users
    for self-regulation. (blame the heavy users, don’t wake up ops)
  • Alerts for a stage environment can get routed to the team that has
    reserved that stage environment for use for that time. (fast feedback
    to the relevant team)

Giving a python function to send alerts to developers is an empowering
thing. We make heavy used of the local sensu socket
(https://github.com/sensu/sensu-docs/pull/132/files) to create events
when no such check is defined client-side.

Conclusion

Three cheers for a modern, flexible, and dynamic monitoring framework
upon which to build great things!


#4

Hmm, I never did an official announcement on the yelp thing. Consider
it official?

@matty I agree, with such a flexible framework, it isn't super obvious
how to put Sensu together into a big holistic solution.
We could just make a page on on sensu-docs page? Case studies or
something? Real-world examples?

···

On Thu, Feb 12, 2015 at 3:27 AM, matty jones <urlugal@gmail.com> wrote:

We at have also released much of our custom sensu plugins and misc code
https://github.com/yieldbot/sensu-yieldbot-plugins

At some point it may be nice to gather all of these links and shared
practical knowledge/configurations in a central location for others.

On Thu, Feb 12, 2015 at 5:20 AM, <andrew.hodgson@digital.hmrc.gov.uk> wrote:

This is really good, thanks for open-sourcing this.

We're going through a similar process of de-opsifying our alerting over
here.
It's actually proving popular with the teams as they get much better
feedback.

We're also using the pattern of people shipping alerting config along with
their
service config.

Would love to hear more about how this is working for people since the OP.

Cheers,
Andrew
http://github.com/hmrc

On Tuesday, 21 October 2014 04:29:35 UTC+1, Kyle Anderson wrote:

Intro

I'm please to announce the unofficial open-sourcing of some of the
Sensu code we use at Yelp that allows us to work in a multi-team
environment. (An official announcement will come later)
Special thanks to bobtfish and hashbrowncipher for helping me get it
into a releasable shape.

This is some of the code referenced in this talk:
http://www.slideshare.net/solarkennedy/sensu-yelp-a-guided-tour
https://vimeo.com/92770954

Warning: Please do not try to use these unless you are already
familiar with how Sensu works. If you are deploying Sensu for the
first time, this code is not for you. (probably)

Warning: There is sure to be plenty of "yelpy" assumptions in this
code that don't apply to you. These are bugs. Please open issue, and
bonus points for pull requests. To repeat: If you have to fork this
code to make it usable at your organization, it is a bug. (probably)
Also docs or tests are lacking. Open an issue if something is not clear.

Onto the code!

https://github.com/Yelp/sensu_handlers
These are the handlers we use at Yelp. The big idea here is that we
have one "default" handler that behaves differently depending on the
event data that it receives. (Contrast this to most sensu
installations have pre-defined handlers, so a check's behavior is
determined server-side) This change is what allows us to have our
handlers be "team-aware". Each check lets the handler know which team
is responsible for it!

We use the following handlers:
* nodebot (irc)
* mailer (email)
* pagerduty (pages)
* jira (tickets)

This covers most of our needs, and covers the gambit of urgent =>
not-urgent + important.

https://github.com/Yelp/puppet-monitoring_check
This is a custom puppet type that allows us to deploy sensu checks
with puppet that are team-aware, that the above handlers understand.
(just a wrapper of the upstream sensu::check that does team stuff +
validation).

https://github.com/Yelp/pysensu-yelp
This is a python library allow python code to emit sensu events that
are compatible with the above handlers. Some developers use this
library directly for special things, most developers use the generic
sensu checks that every SOA service comes with. (each SOA service has
a yaml file that declares the sensu team responsible for it, and their
alerting parameters)

The Why

We decided to add all this code and complexity because it lowers the
barrier to adding monitoring to a system. This is very important to
me, especially when in a large org with lots of "teams". I personally
believe that this is the future of.... "devops", contrast to the
traditional "Ops" team or "Noc" that gets all alerts and acts as human
filters.

Once you have good handlers and good libraries in place, interesting
things become possible that were crazy to think of before. Here are
some examples:

* We can use the "owner" tag in EC2 to figure out where to route
alerts for system-level alerts. (Ops doesn't own every box)
* The Infra team deploying Netflix ICE (AWS costs tracking) can easily
make the system open tickets when certain price thresholds are
reached. (important, but not urgent issue)
* Backend teams that write batch jobs that page their team's oncall if
they fail say, 3 times in a row. (no Ops involvement necessary!)
* Shared dev boxes can route load/cpu/disk alerts to the local users
for self-regulation. (blame the heavy users, don't wake up ops)
* Alerts for a stage environment can get routed to the team that has
reserved that stage environment for use for that time. (fast feedback
to the relevant team)

Giving a python function to send alerts to developers is an empowering
thing. We make heavy used of the local sensu socket
(https://github.com/sensu/sensu-docs/pull/132/files) to create events
when no such check is defined client-side.

Conclusion

Three cheers for a modern, flexible, and dynamic monitoring framework
upon which to build great things!


#5

Thats should like a good idea, I was also think of maybe putting it on either the Development page or the Downloads page. The Downloads page may be nice as at some point it will contain a list of all available gems, their current version, and description. We could simply add a section called ‘User Sponsored’ and any non-official plugins or those maintained by an outside group could be placed there.

···

On Thu, Feb 12, 2015 at 11:48 AM, Kyle Anderson kyle@xkyle.com wrote:

Hmm, I never did an official announcement on the yelp thing. Consider

it official?

@matty I agree, with such a flexible framework, it isn’t super obvious

how to put Sensu together into a big holistic solution.

We could just make a page on on sensu-docs page? Case studies or

something? Real-world examples?

On Thu, Feb 12, 2015 at 3:27 AM, matty jones urlugal@gmail.com wrote:

We at have also released much of our custom sensu plugins and misc code

https://github.com/yieldbot/sensu-yieldbot-plugins

At some point it may be nice to gather all of these links and shared

practical knowledge/configurations in a central location for others.

On Thu, Feb 12, 2015 at 5:20 AM, andrew.hodgson@digital.hmrc.gov.uk wrote:

This is really good, thanks for open-sourcing this.

We’re going through a similar process of de-opsifying our alerting over

here.

It’s actually proving popular with the teams as they get much better

feedback.

We’re also using the pattern of people shipping alerting config along with

their

service config.

Would love to hear more about how this is working for people since the OP.

Cheers,

Andrew

http://github.com/hmrc

On Tuesday, 21 October 2014 04:29:35 UTC+1, Kyle Anderson wrote:

Intro

I’m please to announce the unofficial open-sourcing of some of the

Sensu code we use at Yelp that allows us to work in a multi-team

environment. (An official announcement will come later)

Special thanks to bobtfish and hashbrowncipher for helping me get it

into a releasable shape.

This is some of the code referenced in this talk:

http://www.slideshare.net/solarkennedy/sensu-yelp-a-guided-tour

https://vimeo.com/92770954

Warning: Please do not try to use these unless you are already

familiar with how Sensu works. If you are deploying Sensu for the

first time, this code is not for you. (probably)

Warning: There is sure to be plenty of “yelpy” assumptions in this

code that don’t apply to you. These are bugs. Please open issue, and

bonus points for pull requests. To repeat: If you have to fork this

code to make it usable at your organization, it is a bug. (probably)

Also docs or tests are lacking. Open an issue if something is not clear.

Onto the code!

https://github.com/Yelp/sensu_handlers

These are the handlers we use at Yelp. The big idea here is that we

have one “default” handler that behaves differently depending on the

event data that it receives. (Contrast this to most sensu

installations have pre-defined handlers, so a check’s behavior is

determined server-side) This change is what allows us to have our

handlers be “team-aware”. Each check lets the handler know which team

is responsible for it!

We use the following handlers:

  • nodebot (irc)
  • mailer (email)
  • pagerduty (pages)
  • jira (tickets)

This covers most of our needs, and covers the gambit of urgent =>

not-urgent + important.

https://github.com/Yelp/puppet-monitoring_check

This is a custom puppet type that allows us to deploy sensu checks

with puppet that are team-aware, that the above handlers understand.

(just a wrapper of the upstream sensu::check that does team stuff +

validation).

https://github.com/Yelp/pysensu-yelp

This is a python library allow python code to emit sensu events that

are compatible with the above handlers. Some developers use this

library directly for special things, most developers use the generic

sensu checks that every SOA service comes with. (each SOA service has

a yaml file that declares the sensu team responsible for it, and their

alerting parameters)

The Why

We decided to add all this code and complexity because it lowers the

barrier to adding monitoring to a system. This is very important to

me, especially when in a large org with lots of “teams”. I personally

believe that this is the future of… “devops”, contrast to the

traditional “Ops” team or “Noc” that gets all alerts and acts as human

filters.

Once you have good handlers and good libraries in place, interesting

things become possible that were crazy to think of before. Here are

some examples:

  • We can use the “owner” tag in EC2 to figure out where to route

alerts for system-level alerts. (Ops doesn’t own every box)

  • The Infra team deploying Netflix ICE (AWS costs tracking) can easily

make the system open tickets when certain price thresholds are

reached. (important, but not urgent issue)

  • Backend teams that write batch jobs that page their team’s oncall if

they fail say, 3 times in a row. (no Ops involvement necessary!)

  • Shared dev boxes can route load/cpu/disk alerts to the local users

for self-regulation. (blame the heavy users, don’t wake up ops)

  • Alerts for a stage environment can get routed to the team that has

reserved that stage environment for use for that time. (fast feedback

to the relevant team)

Giving a python function to send alerts to developers is an empowering

thing. We make heavy used of the local sensu socket

(https://github.com/sensu/sensu-docs/pull/132/files) to create events

when no such check is defined client-side.

Conclusion

Three cheers for a modern, flexible, and dynamic monitoring framework

upon which to build great things!