Sensu Check Retry (Soft State)

#1

I’m trying to determine what in sensu would equate to the soft/hard state along with check obession from Nagios and its clones.

It looks like from reading the docs and searching the mailing lists that occurrences seem to be the current solution to try to mimic,

the functiontailtiy. Though it does seem that this doesn’t quite handle spammy alerts as well.

Correctly me if my thinking is incorrect, but if you have the occurrences set to a low value like 2 the check

still has to wait two times the interval cycle to trigger a handler. So if you have a check that runs every 5 min,

this wouldn’t trigger a handler for 10min. If the occurrence value is set to low like 1, then you can get a bunch of spammy alerts.

An example would be something on the system spiking CPU for a very short period of them like 1-2 seconds, with a occurrence value of 1 this

could trigger a spammy handler. Where if the check was run again it would be determined OK, and no handler would be triggered.

Is there a recommened method for force checks to retry upon failure before triggering a handler ?

v/r

STEVE

#2

The current potential solutions that I have though of or come across are currently

Sensu checking issues.

If you have the occurrences set to a low value like 2 the check still has to wait two times the interval cycle to trigger a handler. So if you have a check that runs every 5 min,

this wouldn’t trigger a handler for 10min. If the occurrence value is set to low like 1, then you can get a bunch of spammy alerts.

Option 1: Spend cycles balancing the (occurrence vs retry vs interval) per check

Pros:

  • Allows you to have less spammy alerts

Cons:

  • Takes cycles to balance the interval vs occurrence vs retry options

  • Might miss alerts while the system is still being tuned

  • Potential for lots of trial and error

Option 2: Set checks to a small occurrence value but larger refresh value

( checks become more sensative, but alert ( trigger handlers ) less on the same events.

Pros:

  • Simpilar configuration setup

  • Don’t miss on critical events for a larger window of time

Cons:

  • Potentially lead to a lot of ‘false’ positive alerts. (Spammy)

  • Potential for lots of trial and error

Option 3: Set checks to to a small run interval but higher occurrence trigger

Pros:

  • Potentially don’t miss on critical events for a larger window of time (depends on default values)

Cons:

  • Takes cycles to balance the interval vs occurrence options

  • Might miss alerts while the system is still being tuned

  • Potentially drives up load on system

  • Potential for lots of trial and error

···

On Monday, February 22, 2016 at 6:12:26 AM UTC-5, Steve Bambling wrote:

I’m trying to determine what in sensu would equate to the soft/hard state along with check obession from Nagios and its clones.

It looks like from reading the docs and searching the mailing lists that occurrences seem to be the current solution to try to mimic,

the functiontailtiy. Though it does seem that this doesn’t quite handle spammy alerts as well.

Correctly me if my thinking is incorrect, but if you have the occurrences set to a low value like 2 the check

still has to wait two times the interval cycle to trigger a handler. So if you have a check that runs every 5 min,

this wouldn’t trigger a handler for 10min. If the occurrence value is set to low like 1, then you can get a bunch of spammy alerts.

An example would be something on the system spiking CPU for a very short period of them like 1-2 seconds, with a occurrence value of 1 this

could trigger a spammy handler. Where if the check was run again it would be determined OK, and no handler would be triggered.

Is there a recommened method for force checks to retry upon failure before triggering a handler ?

v/r

STEVE

#3

So I’m a new guy with Sensu and I’m also struggling with the relationship between checks -> occurrences/refresh -> handlers -> filters.
While I can see this is much more flexible than what Nagios-like environments provide, it feels so much harder and awkward to achieve similar results.

I don’t have any solutions but ,aybe we can help each other unless someone else can chime in as we’re trying to do just about the same thing.

So, I have something that looks a bit like this:

https://gist.github.com/dmsimard/2c9cbee0d803ba83220c

  • I have a client, this client is subscribed to default.
  • I have a check: check-cpu that is published to default.
    I don’t want this check to “trigger” my handler unless 5 consecutive checks have gone bad, thus “occurrences” is set to 5.
  • I have a handler: Fairly unrelevant to the issue, just a bot that handles notification logic to IRC - this works.
  • I have a filter: Basically extracted from this as is. The idea being that okay, notify me once you have a problem but then don’t bother me for the next hour.
    Now, with that, I have check-cpu notifying my handler and triggering an event on the first occurrence.

So this is more than likely the filter where “occurrence == 1” overriding the “occurrences” parameter from the check.

So I feel the filter should be something more like (pseudocode):
“eval: event[:occurrences] == check[:occurrences] || event[:occurrences] % check[:occurrences] == 0”

``

Or more accurately with a custom check parameter ?

“eval: event[:occurrences] == check[:occurrences] || event[:occurrences] % check[:retry] == 0”

``

However, I have no clue what variables I can access from the eval and how to access them to do something like that, it’s a black box.

I’ve searched around and it looks like some people have just given up on trying to handle this flow in Sensu and are just handling the notification interval logic right within the handler.

For example, Yelp: https://github.com/Yelp/puppet-monitoring_check

In their handlers: https://github.com/Yelp/sensu_handlers/blob/ee91619406502a2e77512d5f529d34ca8b2dab31/files/base.rb#L163-L214

I’m about to do something similar if it keeps up but I’d rather avoid having to manage that logic.

···

On Monday, 22 February 2016 06:12:26 UTC-5, Steve Bambling wrote:

I’m trying to determine what in sensu would equate to the soft/hard state along with check obession from Nagios and its clones.

It looks like from reading the docs and searching the mailing lists that occurrences seem to be the current solution to try to mimic,

the functiontailtiy. Though it does seem that this doesn’t quite handle spammy alerts as well.

Correctly me if my thinking is incorrect, but if you have the occurrences set to a low value like 2 the check

still has to wait two times the interval cycle to trigger a handler. So if you have a check that runs every 5 min,

this wouldn’t trigger a handler for 10min. If the occurrence value is set to low like 1, then you can get a bunch of spammy alerts.

An example would be something on the system spiking CPU for a very short period of them like 1-2 seconds, with a occurrence value of 1 this

could trigger a spammy handler. Where if the check was run again it would be determined OK, and no handler would be triggered.

Is there a recommened method for force checks to retry upon failure before triggering a handler ?

v/r

STEVE

#4

So I'm a new guy with Sensu and I'm also struggling with the relationship
between checks -> occurrences/refresh -> handlers -> filters.
While I can see this is much more flexible than what Nagios-like
environments provide, it feels so much harder and awkward to achieve similar
results.

I don't have any solutions but ,aybe we can help each other unless someone
else can chime in as we're trying to do just about the same thing.

So, I have something that looks a bit like this:
https://gist.github.com/dmsimard/2c9cbee0d803ba83220c

I have a client, this client is subscribed to default.
I have a check: check-cpu that is published to default.
I don't want this check to "trigger" my handler unless 5 consecutive checks
have gone bad, thus "occurrences" is set to 5.
I have a handler: Fairly unrelevant to the issue, just a bot that handles
notification logic to IRC - this works.
I have a filter: Basically extracted from this as is. The idea being that
okay, notify me once you have a problem but then don't bother me for the
next hour.

Now, with that, I have check-cpu notifying my handler and triggering an
event on the first occurrence.
So this is more than likely the filter where "occurrence == 1" overriding
the "occurrences" parameter from the check.

So I feel the filter should be something more like (pseudocode):
"eval: event[:occurrences] == check[:occurrences] || event[:occurrences] %
check[:occurrences] == 0"

Or more accurately with a custom check parameter ?
"eval: event[:occurrences] == check[:occurrences] || event[:occurrences] %
check[:retry] == 0"

However, I have no clue what variables I can access from the eval and how to
access them to do something like that, it's a black box.

Take a look at the sensu server log when an even comes in, anything in
that JSON dictionary is fair game.
I think the docs are pretty good, but nothing beats being able to see
*real* event data from your own logs to see what you actually have to
work with.

I think I did an "OK" job of covering this in my intermediate sensu training:

Contact me off list and I'll give you a free coupon if you want, but
it sounds like you already have a good grasp of them.

I've searched around and it looks like some people have just given up on
trying to handle this flow in Sensu and are just handling the notification
interval logic right within the handler.
For example, Yelp: https://github.com/Yelp/puppet-monitoring_check
In their handlers:
https://github.com/Yelp/sensu_handlers/blob/ee91619406502a2e77512d5f529d34ca8b2dab31/files/base.rb#L163-L214

I was part of the team that wrote the Yelp handlers.
In retrospect... I still think it was worth it. I say just being able
to implement exponential backoff on the alerts was enough to make it
worth it.

Another thing to mention in retrospect was that at Yelp, we knew we
were going to train lots of engineers to use this system. (not just a
one-man-ops-shop)
We figured if they are going to learn new words for this, they might
as well be words that make sense to *us*. (check_every, alert_after,
realert_every)
The idea was that you could read it in a sentence for humans:
check_apache: check_every => 5m, alert_after => 10m

I also think it is a good testament to Sensu's flexibility that we
were able to write our own logic as we see fit without any changes to
Sensu itself.

I'm about to do something similar if it keeps up but I'd rather avoid having
to manage that logic.

Yea, I advise to avoid as much as you can. As soon as you define a
custom base handler you have to modify existing handlers to use it.

···

On Mon, Feb 22, 2016 at 2:37 PM, David Moreau Simard <moi@dmsimard.com> wrote:

On Monday, 22 February 2016 06:12:26 UTC-5, Steve Bambling wrote:

I'm trying to determine what in sensu would equate to the soft/hard state
along with check obession from Nagios and its clones.
It looks like from reading the docs and searching the mailing lists that
occurrences seem to be the current solution to try to mimic,
the functiontailtiy. Though it does seem that this doesn't quite handle
spammy alerts as well.

Correctly me if my thinking is incorrect, but if you have the occurrences
set to a low value like 2 the check
still has to wait two times the interval cycle to trigger a handler. So if
you have a check that runs every 5 min,
this wouldn't trigger a handler for 10min. If the occurrence value is set
to low like 1, then you can get a bunch of spammy alerts.

An example would be something on the system spiking CPU for a very short
period of them like 1-2 seconds, with a occurrence value of 1 this
could trigger a spammy handler. Where if the check was run again it would
be determined OK, and no handler would be triggered.

Is there a recommened method for force checks to retry upon failure before
triggering a handler ?

v/r

STEVE

#5

Thanks for the reply Kyle.

What I’m trying to understand is if it is possible to filter against a value other than one from the field we’re testing on ?

I think I could get my workflow to work properly if I am able to do that.

For example, consider the following:

{
  "filters": {
    "first-occurrence": {
      "attributes": {
        "occurrences": "eval: value == 1"
      }
    }
  }
}

``

This is a filter called ‘first-occurrence’ that will only trigger on the first occurrence (evaluating the field “occurrences” of the event).

But let’s pretend I only want the filter to let things through if the value of the event “occurrences” is equal or greater than the value of the check “occurrences” - so, basically, the same default behavior as using only the “occurrences” field on the check without a filter.

So, in order to do that, I’d need to compare these two values together.

My understanding from the documentation is that you can filter on any field from the event or the check but only against themselves.

So you could filter that the “environment” key of a client is equal to “production” with that hardcoded in but you couldn’t filter that the “environment” key is equal to the “environment” key of a check.

I don’t really need to do the above, I’m just trying to come up with simple examples to show that I can do this and then I can work on that.

I tried different ways of comparing values from different fields but I’m getting errors like the following:

{“timestamp”:“2016-02-23T21:31:33.777479+0000”,“level”:“error”,“message”:“filter attribute eval error”,“raw_eval_string”:“eval: event[:occurrences] >= check[:occurrences]”,“value”:67,“error”:“undefined local variable or method `event’ for Kernel:Module”}

{“timestamp”:“2016-02-23T21:34:44.144980+0000”,“level”:“error”,“message”:“filter attribute eval error”,“raw_eval_string”:“eval: value >= check[:occurrences]”,“value”:608,“error”:“undefined local variable or method `check’ for Kernel:Module”}

{“timestamp”:“2016-02-23T21:41:33.794716+0000”,“level”:“error”,“message”:“filter attribute eval error”,“raw_eval_string”:“eval: value >= attributes[‘check’][‘occurrences’]”,“value”:72,“error”:“undefined local variable or method `attributes’ for Kernel:Module”}

{“timestamp”:“2016-02-23T21:43:33.776419+0000”,“level”:“error”,“message”:“filter attribute eval error”,“raw_eval_string”:“eval: value >= check[‘occurrences’]”,“value”:73,“error”:“undefined local variable or method `check’ for Kernel:Module”}

``

The structure of the data available is pretty opaque.

Thanks,

···

On Tuesday, 23 February 2016 11:08:24 UTC-5, Kyle Anderson wrote:

On Mon, Feb 22, 2016 at 2:37 PM, David Moreau Simard m...@dmsimard.com wrote:

So I’m a new guy with Sensu and I’m also struggling with the relationship
between checks -> occurrences/refresh -> handlers -> filters.
While I can see this is much more flexible than what Nagios-like
environments provide, it feels so much harder and awkward to achieve similar
results.

I don’t have any solutions but ,aybe we can help each other unless someone
else can chime in as we’re trying to do just about the same thing.

So, I have something that looks a bit like this:
https://gist.github.com/dmsimard/2c9cbee0d803ba83220c

I have a client, this client is subscribed to default.
I have a check: check-cpu that is published to default.
I don’t want this check to “trigger” my handler unless 5 consecutive checks
have gone bad, thus “occurrences” is set to 5.
I have a handler: Fairly unrelevant to the issue, just a bot that handles
notification logic to IRC - this works.
I have a filter: Basically extracted from this as is. The idea being that
okay, notify me once you have a problem but then don’t bother me for the
next hour.

Now, with that, I have check-cpu notifying my handler and triggering an
event on the first occurrence.
So this is more than likely the filter where “occurrence == 1” overriding
the “occurrences” parameter from the check.

So I feel the filter should be something more like (pseudocode):
“eval: event[:occurrences] == check[:occurrences] || event[:occurrences] %
check[:occurrences] == 0”

Or more accurately with a custom check parameter ?
“eval: event[:occurrences] == check[:occurrences] || event[:occurrences] %
check[:retry] == 0”

However, I have no clue what variables I can access from the eval and how to
access them to do something like that, it’s a black box.

Take a look at the sensu server log when an even comes in, anything in
that JSON dictionary is fair game.
I think the docs are pretty good, but nothing beats being able to see
real event data from your own logs to see what you actually have to
work with.

I think I did an “OK” job of covering this in my intermediate sensu training:
https://github.com/solarkennedy/sensu-training/tree/master/intermediate/lectures/Handlers%2C%20Filters%2C%20and%20Subdued%20Checks

Contact me off list and I’ll give you a free coupon if you want, but
it sounds like you already have a good grasp of them.

I’ve searched around and it looks like some people have just given up on
trying to handle this flow in Sensu and are just handling the notification
interval logic right within the handler.
For example, Yelp: https://github.com/Yelp/puppet-monitoring_check

In their handlers:
https://github.com/Yelp/sensu_handlers/blob/ee91619406502a2e77512d5f529d34ca8b2dab31/files/base.rb#L163-L214

I was part of the team that wrote the Yelp handlers.
In retrospect… I still think it was worth it. I say just being able
to implement exponential backoff on the alerts was enough to make it
worth it.

Another thing to mention in retrospect was that at Yelp, we knew we
were going to train lots of engineers to use this system. (not just a
one-man-ops-shop)
We figured if they are going to learn new words for this, they might
as well be words that make sense to us. (check_every, alert_after,
realert_every)
The idea was that you could read it in a sentence for humans:
check_apache: check_every => 5m, alert_after => 10m

I also think it is a good testament to Sensu’s flexibility that we
were able to write our own logic as we see fit without any changes to
Sensu itself.

I’m about to do something similar if it keeps up but I’d rather avoid having
to manage that logic.

Yea, I advise to avoid as much as you can. As soon as you define a
custom base handler you have to modify existing handlers to use it.

On Monday, 22 February 2016 06:12:26 UTC-5, Steve Bambling wrote:

I’m trying to determine what in sensu would equate to the soft/hard state
along with check obession from Nagios and its clones.
It looks like from reading the docs and searching the mailing lists that
occurrences seem to be the current solution to try to mimic,
the functiontailtiy. Though it does seem that this doesn’t quite handle
spammy alerts as well.

Correctly me if my thinking is incorrect, but if you have the occurrences
set to a low value like 2 the check
still has to wait two times the interval cycle to trigger a handler. So if
you have a check that runs every 5 min,
this wouldn’t trigger a handler for 10min. If the occurrence value is set
to low like 1, then you can get a bunch of spammy alerts.

An example would be something on the system spiking CPU for a very short
period of them like 1-2 seconds, with a occurrence value of 1 this
could trigger a spammy handler. Where if the check was run again it would
be determined OK, and no handler would be triggered.

Is there a recommened method for force checks to retry upon failure before
triggering a handler ?

v/r

STEVE

#6

Here is the code that does this work:

Yea... I agree it looks like you don't have access to the whole event
dictionary if you are filtering on a particular attribute.
A developer with more expertise on this could would have to confirm.
@portertech?

···

On Tue, Feb 23, 2016 at 1:46 PM, David Moreau Simard <moi@dmsimard.com> wrote:

Thanks for the reply Kyle.

What I'm trying to understand is if it is possible to filter against a value
other than one from the field we're testing on ?
I think I could get my workflow to work properly if I am able to do that.

For example, consider the following:

{
  "filters": {
    "first-occurrence": {
      "attributes": {
        "occurrences": "eval: value == 1"
      }
    }
  }
}

This is a filter called 'first-occurrence' that will only trigger on the
first occurrence (evaluating the field "occurrences" of the event).

But let's pretend I only want the filter to let things through if the value
of the event "occurrences" is equal or greater than the value of the check
"occurrences" - so, basically, the same default behavior as using only the
"occurrences" field on the check without a filter.
So, in order to do that, I'd need to compare these two values together.

My understanding from the documentation is that you can filter on any field
from the event or the check but only against themselves.
So you could filter that the "environment" key of a client is equal to
"production" with that hardcoded in but you couldn't filter that the
"environment" key is equal to the "environment" key of a check.

I don't really need to do the above, I'm just trying to come up with simple
examples to show that I can do this and then I can work on that.
I tried different ways of comparing values from different fields but I'm
getting errors like the following:

{"timestamp":"2016-02-23T21:31:33.777479+0000","level":"error","message":"filter
attribute eval error","raw_eval_string":"eval: event[:occurrences] >=
check[:occurrences]","value":67,"error":"undefined local variable or method
`event' for Kernel:Module"}
{"timestamp":"2016-02-23T21:34:44.144980+0000","level":"error","message":"filter
attribute eval error","raw_eval_string":"eval: value >=
check[:occurrences]","value":608,"error":"undefined local variable or method
`check' for Kernel:Module"}
{"timestamp":"2016-02-23T21:41:33.794716+0000","level":"error","message":"filter
attribute eval error","raw_eval_string":"eval: value >=
attributes['check']['occurrences']","value":72,"error":"undefined local
variable or method `attributes' for Kernel:Module"}
{"timestamp":"2016-02-23T21:43:33.776419+0000","level":"error","message":"filter
attribute eval error","raw_eval_string":"eval: value >=
check['occurrences']","value":73,"error":"undefined local variable or method
`check' for Kernel:Module"}

The structure of the data available is pretty opaque.

Thanks,

On Tuesday, 23 February 2016 11:08:24 UTC-5, Kyle Anderson wrote:

On Mon, Feb 22, 2016 at 2:37 PM, David Moreau Simard <m...@dmsimard.com> >> wrote:
> So I'm a new guy with Sensu and I'm also struggling with the
> relationship
> between checks -> occurrences/refresh -> handlers -> filters.
> While I can see this is much more flexible than what Nagios-like
> environments provide, it feels so much harder and awkward to achieve
> similar
> results.
>
> I don't have any solutions but ,aybe we can help each other unless
> someone
> else can chime in as we're trying to do just about the same thing.
>
> So, I have something that looks a bit like this:
> https://gist.github.com/dmsimard/2c9cbee0d803ba83220c
>
> I have a client, this client is subscribed to default.
> I have a check: check-cpu that is published to default.
> I don't want this check to "trigger" my handler unless 5 consecutive
> checks
> have gone bad, thus "occurrences" is set to 5.
> I have a handler: Fairly unrelevant to the issue, just a bot that
> handles
> notification logic to IRC - this works.
> I have a filter: Basically extracted from this as is. The idea being
> that
> okay, notify me once you have a problem but then don't bother me for the
> next hour.
>
> Now, with that, I have check-cpu notifying my handler and triggering an
> event on the first occurrence.
> So this is more than likely the filter where "occurrence == 1"
> overriding
> the "occurrences" parameter from the check.
>
> So I feel the filter should be something more like (pseudocode):
> "eval: event[:occurrences] == check[:occurrences] || event[:occurrences]
> %
> check[:occurrences] == 0"
>
> Or more accurately with a custom check parameter ?
> "eval: event[:occurrences] == check[:occurrences] || event[:occurrences]
> %
> check[:retry] == 0"
>
> However, I have no clue what variables I can access from the eval and
> how to
> access them to do something like that, it's a black box.

Take a look at the sensu server log when an even comes in, anything in
that JSON dictionary is fair game.
I think the docs are pretty good, but nothing beats being able to see
*real* event data from your own logs to see what you actually have to
work with.

I think I did an "OK" job of covering this in my intermediate sensu
training:

https://github.com/solarkennedy/sensu-training/tree/master/intermediate/lectures/Handlers%2C%20Filters%2C%20and%20Subdued%20Checks

Contact me off list and I'll give you a free coupon if you want, but
it sounds like you already have a good grasp of them.

>
> I've searched around and it looks like some people have just given up on
> trying to handle this flow in Sensu and are just handling the
> notification
> interval logic right within the handler.
> For example, Yelp: https://github.com/Yelp/puppet-monitoring_check
> In their handlers:
>
> https://github.com/Yelp/sensu_handlers/blob/ee91619406502a2e77512d5f529d34ca8b2dab31/files/base.rb#L163-L214
>

I was part of the team that wrote the Yelp handlers.
In retrospect... I still think it was worth it. I say just being able
to implement exponential backoff on the alerts was enough to make it
worth it.

Another thing to mention in retrospect was that at Yelp, we knew we
were going to train lots of engineers to use this system. (not just a
one-man-ops-shop)
We figured if they are going to learn new words for this, they might
as well be words that make sense to *us*. (check_every, alert_after,
realert_every)
The idea was that you could read it in a sentence for humans:
check_apache: check_every => 5m, alert_after => 10m

I also think it is a good testament to Sensu's flexibility that we
were able to write our own logic as we see fit without any changes to
Sensu itself.

> I'm about to do something similar if it keeps up but I'd rather avoid
> having
> to manage that logic.

Yea, I advise to avoid as much as you can. As soon as you define a
custom base handler you have to modify existing handlers to use it.

>
> On Monday, 22 February 2016 06:12:26 UTC-5, Steve Bambling wrote:
>>
>> I'm trying to determine what in sensu would equate to the soft/hard
>> state
>> along with check obession from Nagios and its clones.
>> It looks like from reading the docs and searching the mailing lists
>> that
>> occurrences seem to be the current solution to try to mimic,
>> the functiontailtiy. Though it does seem that this doesn't quite
>> handle
>> spammy alerts as well.
>>
>> Correctly me if my thinking is incorrect, but if you have the
>> occurrences
>> set to a low value like 2 the check
>> still has to wait two times the interval cycle to trigger a handler. So
>> if
>> you have a check that runs every 5 min,
>> this wouldn't trigger a handler for 10min. If the occurrence value is
>> set
>> to low like 1, then you can get a bunch of spammy alerts.
>>
>> An example would be something on the system spiking CPU for a very
>> short
>> period of them like 1-2 seconds, with a occurrence value of 1 this
>> could trigger a spammy handler. Where if the check was run again it
>> would
>> be determined OK, and no handler would be triggered.
>>
>> Is there a recommened method for force checks to retry upon failure
>> before
>> triggering a handler ?
>>
>> v/r
>>
>> STEVE

#7

Thanks for pointing me in the right direction.

Just thinking out loud here …

So basically that method ends up calling:

https://github.com/sensu/sensu/blob/81cf45d176c60bd03d09a9a0d677c1a823de42c2/lib/sensu/server/filter.rb#L214

->

https://github.com/sensu/sensu/blob/81cf45d176c60bd03d09a9a0d677c1a823de42c2/lib/sensu/server/filter.rb#L180-L192

Which then uses:

https://github.com/sensu/sensu/blob/81cf45d176c60bd03d09a9a0d677c1a823de42c2/lib/sensu/server/sandbox.rb#L12-L18

I hacked filters.rb to look at the contents of “hash_one” and “hash_two”, we get:

hash_one

occurrences:eval: value == 1 || value % 60 == 0

hash_two

id:8ebcd346-d34e-410d-aea2-30001d42305c

client:{:name=>“snip”, :address=>“192.168.66.17”, :subscriptions=>[“default”], :redact=>, :socket=>{:bind=>“127.0.0.1”, :port=>3030}, :safe_mode=>false, :datacenter=>“snip”, :keepalive=>{:thresholds=>{:warning=>180, :critical=>300}, :handlers=>[“errbot”]}, :version=>“0.22.0”, :timestamp=>1456352103}

check:{:command=>"/usr/local/bin/check-cpu.rb", :handlers=>[“errbot”], :interval=>60, :occurrences=>5, :subscribers=>[“default”], :standalone=>false, :refresh=>1800, :name=>“check-cpu”, :issued=>1456352106, :executed=>1456352106, :duration=>1.097, :output=>“CheckCPU TOTAL WARNING: total=89.09 user=69.29 nice=0.0 system=11.17 idle=10.91 iowait=8.38 irq=0.0 softirq=0.25 steal=0.0 guest=0.0\n”, :status=>1, :history=>[“0”, “0”, “0”, “1”, “1”, “1”, “0”, “0”, “0”, “0”, “0”, “0”, “1”, “0”, “0”, “0”, “0”, “0”, “1”, “1”, “1”], :total_state_change=>25}

occurrences:3

action:create

timestamp:1456352108

``

So really it takes the key given in hash_one (occurrences) and declares value_two as the value of that key in hash_two:
https://github.com/sensu/sensu/blob/81cf45d176c60bd03d09a9a0d677c1a823de42c2/lib/sensu/server/filter.rb#L205

What I don’t understand is the black magic that goes behind setting the variable “value” to the actual value of the key.

It’s just two strings:

eval_attribute_value(value_one, value_two)

Either way, it doesn’t look like I’ll be able to do what I want without modifying the way values are sent for comparison.

I suck at ruby but I’ll try and see if I can come up with something.

···

On Tuesday, 23 February 2016 23:12:41 UTC-5, Kyle Anderson wrote:

Here is the code that does this work:
https://github.com/sensu/sensu/blob/81cf45d176c60bd03d09a9a0d677c1a823de42c2/lib/sensu/server/filter.rb#L203-L219

Yea… I agree it looks like you don’t have access to the whole event
dictionary if you are filtering on a particular attribute.
A developer with more expertise on this could would have to confirm.
@portertech?

On Tue, Feb 23, 2016 at 1:46 PM, David Moreau Simard m...@dmsimard.com wrote:

Thanks for the reply Kyle.

What I’m trying to understand is if it is possible to filter against a value
other than one from the field we’re testing on ?
I think I could get my workflow to work properly if I am able to do that.

For example, consider the following:

{
“filters”: {
“first-occurrence”: {
“attributes”: {
“occurrences”: “eval: value == 1”
}
}
}
}

This is a filter called ‘first-occurrence’ that will only trigger on the
first occurrence (evaluating the field “occurrences” of the event).

But let’s pretend I only want the filter to let things through if the value
of the event “occurrences” is equal or greater than the value of the check
“occurrences” - so, basically, the same default behavior as using only the
“occurrences” field on the check without a filter.
So, in order to do that, I’d need to compare these two values together.

My understanding from the documentation is that you can filter on any field
from the event or the check but only against themselves.
So you could filter that the “environment” key of a client is equal to
“production” with that hardcoded in but you couldn’t filter that the
“environment” key is equal to the “environment” key of a check.

I don’t really need to do the above, I’m just trying to come up with simple
examples to show that I can do this and then I can work on that.
I tried different ways of comparing values from different fields but I’m
getting errors like the following:

{“timestamp”:“2016-02-23T21:31:33.777479+0000”,“level”:“error”,“message”:“filter
attribute eval error”,“raw_eval_string”:“eval: event[:occurrences] >=
check[:occurrences]”,“value”:67,“error”:“undefined local variable or method
event' for Kernel:Module"} {"timestamp":"2016-02-23T21:34:44.144980+0000","level":"error","message":"filter attribute eval error","raw_eval_string":"eval: value >= check[:occurrences]","value":608,"error":"undefined local variable or methodcheck’ for Kernel:Module”}
{“timestamp”:“2016-02-23T21:41:33.794716+0000”,“level”:“error”,“message”:“filter
attribute eval error”,“raw_eval_string”:“eval: value >=
attributes[‘check’][‘occurrences’]”,“value”:72,“error”:“undefined local
variable or method attributes' for Kernel:Module"} {"timestamp":"2016-02-23T21:43:33.776419+0000","level":"error","message":"filter attribute eval error","raw_eval_string":"eval: value >= check['occurrences']","value":73,"error":"undefined local variable or methodcheck’ for Kernel:Module”}

The structure of the data available is pretty opaque.

Thanks,

On Tuesday, 23 February 2016 11:08:24 UTC-5, Kyle Anderson wrote:

On Mon, Feb 22, 2016 at 2:37 PM, David Moreau Simard m...@dmsimard.com > >> wrote:

So I’m a new guy with Sensu and I’m also struggling with the
relationship
between checks -> occurrences/refresh -> handlers -> filters.
While I can see this is much more flexible than what Nagios-like
environments provide, it feels so much harder and awkward to achieve
similar
results.

I don’t have any solutions but ,aybe we can help each other unless
someone
else can chime in as we’re trying to do just about the same thing.

So, I have something that looks a bit like this:
https://gist.github.com/dmsimard/2c9cbee0d803ba83220c

I have a client, this client is subscribed to default.
I have a check: check-cpu that is published to default.
I don’t want this check to “trigger” my handler unless 5 consecutive
checks
have gone bad, thus “occurrences” is set to 5.
I have a handler: Fairly unrelevant to the issue, just a bot that
handles
notification logic to IRC - this works.
I have a filter: Basically extracted from this as is. The idea being
that
okay, notify me once you have a problem but then don’t bother me for the
next hour.

Now, with that, I have check-cpu notifying my handler and triggering an
event on the first occurrence.
So this is more than likely the filter where “occurrence == 1”
overriding
the “occurrences” parameter from the check.

So I feel the filter should be something more like (pseudocode):
“eval: event[:occurrences] == check[:occurrences] || event[:occurrences]
%
check[:occurrences] == 0”

Or more accurately with a custom check parameter ?
“eval: event[:occurrences] == check[:occurrences] || event[:occurrences]
%
check[:retry] == 0”

However, I have no clue what variables I can access from the eval and
how to
access them to do something like that, it’s a black box.

Take a look at the sensu server log when an even comes in, anything in
that JSON dictionary is fair game.
I think the docs are pretty good, but nothing beats being able to see
real event data from your own logs to see what you actually have to
work with.

I think I did an “OK” job of covering this in my intermediate sensu
training:

https://github.com/solarkennedy/sensu-training/tree/master/intermediate/lectures/Handlers%2C%20Filters%2C%20and%20Subdued%20Checks

Contact me off list and I’ll give you a free coupon if you want, but
it sounds like you already have a good grasp of them.

I’ve searched around and it looks like some people have just given up on
trying to handle this flow in Sensu and are just handling the
notification
interval logic right within the handler.
For example, Yelp: https://github.com/Yelp/puppet-monitoring_check

In their handlers:

https://github.com/Yelp/sensu_handlers/blob/ee91619406502a2e77512d5f529d34ca8b2dab31/files/base.rb#L163-L214

I was part of the team that wrote the Yelp handlers.
In retrospect… I still think it was worth it. I say just being able
to implement exponential backoff on the alerts was enough to make it
worth it.

Another thing to mention in retrospect was that at Yelp, we knew we
were going to train lots of engineers to use this system. (not just a
one-man-ops-shop)
We figured if they are going to learn new words for this, they might
as well be words that make sense to us. (check_every, alert_after,
realert_every)
The idea was that you could read it in a sentence for humans:
check_apache: check_every => 5m, alert_after => 10m

I also think it is a good testament to Sensu’s flexibility that we
were able to write our own logic as we see fit without any changes to
Sensu itself.

I’m about to do something similar if it keeps up but I’d rather avoid
having
to manage that logic.

Yea, I advise to avoid as much as you can. As soon as you define a
custom base handler you have to modify existing handlers to use it.

On Monday, 22 February 2016 06:12:26 UTC-5, Steve Bambling wrote:

I’m trying to determine what in sensu would equate to the soft/hard
state
along with check obession from Nagios and its clones.
It looks like from reading the docs and searching the mailing lists
that
occurrences seem to be the current solution to try to mimic,
the functiontailtiy. Though it does seem that this doesn’t quite
handle
spammy alerts as well.

Correctly me if my thinking is incorrect, but if you have the
occurrences
set to a low value like 2 the check
still has to wait two times the interval cycle to trigger a handler. So
if
you have a check that runs every 5 min,
this wouldn’t trigger a handler for 10min. If the occurrence value is
set
to low like 1, then you can get a bunch of spammy alerts.

An example would be something on the system spiking CPU for a very
short
period of them like 1-2 seconds, with a occurrence value of 1 this
could trigger a spammy handler. Where if the check was run again it
would
be determined OK, and no handler would be triggered.

Is there a recommened method for force checks to retry upon failure
before
triggering a handler ?

v/r

STEVE

#8

I gave up trying to make filter.rb able to compare more complex things and went with an approach with a mutator instead.

The objective would be to get the mutator to create a key that would content both the event occurrences and the check occurrences so that we can compare them inside a filter.

It doesn’t seem to work like the documentation implies, though, and someone else just noticed the same thing.

There’s an issue on github about it: https://github.com/sensu/sensu/issues/1175

···

On Friday, 26 February 2016 11:03:44 UTC-5, David Moreau Simard wrote:

Thanks for pointing me in the right direction.

Just thinking out loud here …

So basically that method ends up calling:

https://github.com/sensu/sensu/blob/81cf45d176c60bd03d09a9a0d677c1a823de42c2/lib/sensu/server/filter.rb#L214

->

https://github.com/sensu/sensu/blob/81cf45d176c60bd03d09a9a0d677c1a823de42c2/lib/sensu/server/filter.rb#L180-L192

Which then uses:

https://github.com/sensu/sensu/blob/81cf45d176c60bd03d09a9a0d677c1a823de42c2/lib/sensu/server/sandbox.rb#L12-L18

I hacked filters.rb to look at the contents of “hash_one” and “hash_two”, we get:

hash_one

occurrences:eval: value == 1 || value % 60 == 0

hash_two

id:8ebcd346-d34e-410d-aea2-30001d42305c

client:{:name=>“snip”, :address=>“192.168.66.17”, :subscriptions=>[“default”], :redact=>, :socket=>{:bind=>“127.0.0.1”, :port=>3030}, :safe_mode=>false, :datacenter=>“snip”, :keepalive=>{:thresholds=>{:warning=>180, :critical=>300}, :handlers=>[“errbot”]}, :version=>“0.22.0”, :timestamp=>1456352103}

check:{:command=>"/usr/local/bin/check-cpu.rb", :handlers=>[“errbot”], :interval=>60, :occurrences=>5, :subscribers=>[“default”], :standalone=>false, :refresh=>1800, :name=>“check-cpu”, :issued=>1456352106, :executed=>1456352106, :duration=>1.097, :output=>“CheckCPU TOTAL WARNING: total=89.09 user=69.29 nice=0.0 system=11.17 idle=10.91 iowait=8.38 irq=0.0 softirq=0.25 steal=0.0 guest=0.0\n”, :status=>1, :history=>[“0”, “0”, “0”, “1”, “1”, “1”, “0”, “0”, “0”, “0”, “0”, “0”, “1”, “0”, “0”, “0”, “0”, “0”, “1”, “1”, “1”], :total_state_change=>25}

occurrences:3

action:create

timestamp:1456352108

``

So really it takes the key given in hash_one (occurrences) and declares value_two as the value of that key in hash_two:
https://github.com/sensu/sensu/blob/81cf45d176c60bd03d09a9a0d677c1a823de42c2/lib/sensu/server/filter.rb#L205

What I don’t understand is the black magic that goes behind setting the variable “value” to the actual value of the key.

It’s just two strings:

eval_attribute_value(value_one, value_two)

Either way, it doesn’t look like I’ll be able to do what I want without modifying the way values are sent for comparison.

I suck at ruby but I’ll try and see if I can come up with something.

On Tuesday, 23 February 2016 23:12:41 UTC-5, Kyle Anderson wrote:

Here is the code that does this work:
https://github.com/sensu/sensu/blob/81cf45d176c60bd03d09a9a0d677c1a823de42c2/lib/sensu/server/filter.rb#L203-L219

Yea… I agree it looks like you don’t have access to the whole event
dictionary if you are filtering on a particular attribute.
A developer with more expertise on this could would have to confirm.
@portertech?

On Tue, Feb 23, 2016 at 1:46 PM, David Moreau Simard m...@dmsimard.com wrote:

Thanks for the reply Kyle.

What I’m trying to understand is if it is possible to filter against a value
other than one from the field we’re testing on ?
I think I could get my workflow to work properly if I am able to do that.

For example, consider the following:

{
“filters”: {
“first-occurrence”: {
“attributes”: {
“occurrences”: “eval: value == 1”
}
}
}
}

This is a filter called ‘first-occurrence’ that will only trigger on the
first occurrence (evaluating the field “occurrences” of the event).

But let’s pretend I only want the filter to let things through if the value
of the event “occurrences” is equal or greater than the value of the check
“occurrences” - so, basically, the same default behavior as using only the
“occurrences” field on the check without a filter.
So, in order to do that, I’d need to compare these two values together.

My understanding from the documentation is that you can filter on any field
from the event or the check but only against themselves.
So you could filter that the “environment” key of a client is equal to
“production” with that hardcoded in but you couldn’t filter that the
“environment” key is equal to the “environment” key of a check.

I don’t really need to do the above, I’m just trying to come up with simple
examples to show that I can do this and then I can work on that.
I tried different ways of comparing values from different fields but I’m
getting errors like the following:

{“timestamp”:“2016-02-23T21:31:33.777479+0000”,“level”:“error”,“message”:“filter
attribute eval error”,“raw_eval_string”:“eval: event[:occurrences] >=
check[:occurrences]”,“value”:67,“error”:“undefined local variable or method
event' for Kernel:Module"} {"timestamp":"2016-02-23T21:34:44.144980+0000","level":"error","message":"filter attribute eval error","raw_eval_string":"eval: value >= check[:occurrences]","value":608,"error":"undefined local variable or methodcheck’ for Kernel:Module”}
{“timestamp”:“2016-02-23T21:41:33.794716+0000”,“level”:“error”,“message”:“filter
attribute eval error”,“raw_eval_string”:“eval: value >=
attributes[‘check’][‘occurrences’]”,“value”:72,“error”:“undefined local
variable or method attributes' for Kernel:Module"} {"timestamp":"2016-02-23T21:43:33.776419+0000","level":"error","message":"filter attribute eval error","raw_eval_string":"eval: value >= check['occurrences']","value":73,"error":"undefined local variable or methodcheck’ for Kernel:Module”}

The structure of the data available is pretty opaque.

Thanks,

On Tuesday, 23 February 2016 11:08:24 UTC-5, Kyle Anderson wrote:

On Mon, Feb 22, 2016 at 2:37 PM, David Moreau Simard m...@dmsimard.com > > >> wrote:

So I’m a new guy with Sensu and I’m also struggling with the
relationship
between checks -> occurrences/refresh -> handlers -> filters.
While I can see this is much more flexible than what Nagios-like
environments provide, it feels so much harder and awkward to achieve
similar
results.

I don’t have any solutions but ,aybe we can help each other unless
someone
else can chime in as we’re trying to do just about the same thing.

So, I have something that looks a bit like this:
https://gist.github.com/dmsimard/2c9cbee0d803ba83220c

I have a client, this client is subscribed to default.
I have a check: check-cpu that is published to default.
I don’t want this check to “trigger” my handler unless 5 consecutive
checks
have gone bad, thus “occurrences” is set to 5.
I have a handler: Fairly unrelevant to the issue, just a bot that
handles
notification logic to IRC - this works.
I have a filter: Basically extracted from this as is. The idea being
that
okay, notify me once you have a problem but then don’t bother me for the
next hour.

Now, with that, I have check-cpu notifying my handler and triggering an
event on the first occurrence.
So this is more than likely the filter where “occurrence == 1”
overriding
the “occurrences” parameter from the check.

So I feel the filter should be something more like (pseudocode):
“eval: event[:occurrences] == check[:occurrences] || event[:occurrences]
%
check[:occurrences] == 0”

Or more accurately with a custom check parameter ?
“eval: event[:occurrences] == check[:occurrences] || event[:occurrences]
%
check[:retry] == 0”

However, I have no clue what variables I can access from the eval and
how to
access them to do something like that, it’s a black box.

Take a look at the sensu server log when an even comes in, anything in
that JSON dictionary is fair game.
I think the docs are pretty good, but nothing beats being able to see
real event data from your own logs to see what you actually have to
work with.

I think I did an “OK” job of covering this in my intermediate sensu
training:

https://github.com/solarkennedy/sensu-training/tree/master/intermediate/lectures/Handlers%2C%20Filters%2C%20and%20Subdued%20Checks

Contact me off list and I’ll give you a free coupon if you want, but
it sounds like you already have a good grasp of them.

I’ve searched around and it looks like some people have just given up on
trying to handle this flow in Sensu and are just handling the
notification
interval logic right within the handler.
For example, Yelp: https://github.com/Yelp/puppet-monitoring_check

In their handlers:

https://github.com/Yelp/sensu_handlers/blob/ee91619406502a2e77512d5f529d34ca8b2dab31/files/base.rb#L163-L214

I was part of the team that wrote the Yelp handlers.
In retrospect… I still think it was worth it. I say just being able
to implement exponential backoff on the alerts was enough to make it
worth it.

Another thing to mention in retrospect was that at Yelp, we knew we
were going to train lots of engineers to use this system. (not just a
one-man-ops-shop)
We figured if they are going to learn new words for this, they might
as well be words that make sense to us. (check_every, alert_after,
realert_every)
The idea was that you could read it in a sentence for humans:
check_apache: check_every => 5m, alert_after => 10m

I also think it is a good testament to Sensu’s flexibility that we
were able to write our own logic as we see fit without any changes to
Sensu itself.

I’m about to do something similar if it keeps up but I’d rather avoid
having
to manage that logic.

Yea, I advise to avoid as much as you can. As soon as you define a
custom base handler you have to modify existing handlers to use it.

On Monday, 22 February 2016 06:12:26 UTC-5, Steve Bambling wrote:

I’m trying to determine what in sensu would equate to the soft/hard
state
along with check obession from Nagios and its clones.
It looks like from reading the docs and searching the mailing lists
that
occurrences seem to be the current solution to try to mimic,
the functiontailtiy. Though it does seem that this doesn’t quite
handle
spammy alerts as well.

Correctly me if my thinking is incorrect, but if you have the
occurrences
set to a low value like 2 the check
still has to wait two times the interval cycle to trigger a handler. So
if
you have a check that runs every 5 min,
this wouldn’t trigger a handler for 10min. If the occurrence value is
set
to low like 1, then you can get a bunch of spammy alerts.

An example would be something on the system spiking CPU for a very
short
period of them like 1-2 seconds, with a occurrence value of 1 this
could trigger a spammy handler. Where if the check was run again it
would
be determined OK, and no handler would be triggered.

Is there a recommened method for force checks to retry upon failure
before
triggering a handler ?

v/r

STEVE