checks constantly appearing and dissapearing

I am finding Sensu very unreliable. I am constantly getting client timeouts, but also intermittent check results.

I am running a disk check check (from nagios), and a given client will appear and disappear with a WARNING status…but the disk space on the client has remained constant.

The dashboard provides no way to drill into a check that does not appear as an “event”, so I cant go and check the specific result of a supposedly passing test, to actually see what and when it had returned last.

And they really think it can replace Nagios?

Absolutely.

The timeouts, are you referring to keepalive events? Are you not syncing your clocks, have significant drift?

Sensu doesn’t set the check result severity, the check plugin does. Is there a pattern to the occurrences?

The Sensu Dashboard does lack much of the desired functionality, but development continues, and I am hopeful for its future. Sensu makes it easy to leverage a tool best suited for storing historical data, such as Logstash or Splunk.

I hope the community can help you address your issues.

Sean.

···

On May 29, 2014 7:47 PM, “zzz” megamic@gmail.com wrote:

I am finding Sensu very unreliable. I am constantly getting client timeouts, but also intermittent check results.

I am running a disk check check (from nagios), and a given client will appear and disappear with a WARNING status…but the disk space on the client has remained constant.

The dashboard provides no way to drill into a check that does not appear as an “event”, so I cant go and check the specific result of a supposedly passing test, to actually see what and when it had returned last.

And they really think it can replace Nagios?

When you say client timeouts do you mean the keepalive events? I’m guessing you are in a cloud provider, have a firewall killing long lived connections or misconfigured RabbitMQ.

Try adding these options in your /etc/rabbitmq/rabbitmq.config. Note that this is not the full config but options you should potentially add/change

{rabbit, [

{heartbeat, 25},

{tcp_listen_options, [binary, {packet,raw},

{reuseaddr,true},

{backlog,128},

{nodelay,true},

{exit_on_close,false},

{keepalive,true}]}

]}

The dashboard feedback is something typical from newer converts to Sensu. It’s a hard mentality to break to not look at a green dashboard. Sensu gives you the primitives to create a green dashboard though.

You can use the client history api to get the status of recent checks. You can easily interact with it by using the sensu-cli gem by running sensu-cli client history NODE.

You can also test the check on the box that it’s firing on. You don’t give a ton of detail if it is on the threshold of firing or not. Is it flapping? You should also be able to use the client logs to help you diagnose.

As far as replacing nagios goes, I monitor a multi-billion dollar ecommerce with it. It works and the composable and extendability of sensu is what makes it awesome.

Thanks,

Bryan

···

On Thu, May 29, 2014 at 9:47 PM, zzz megamic@gmail.com wrote:

I am finding Sensu very unreliable. I am constantly getting client timeouts, but also intermittent check results.

I am running a disk check check (from nagios), and a given client will appear and disappear with a WARNING status…but the disk space on the client has remained constant.

The dashboard provides no way to drill into a check that does not appear as an “event”, so I cant go and check the specific result of a supposedly passing test, to actually see what and when it had returned last.

And they really think it can replace Nagios?

Absolutely.

The timeouts, are you referring to keepalive events? Are you not syncing your clocks, have significant drift?

Yes, I get these 180 seconds client time out messages on the dashboard. Then they disappear. My servers are all using NTP and dont have significant clock skew.

Sensu doesn’t set the check result severity, the check plugin does. Is there a pattern to the occurrences?

Not at all. Check results just appear and disappear on the dashboard randomly. Checks that have been resolved stay red for far too long and never go away. For example, one of my clients started publishing this result:

{“timestamp”:“2014-05-30T13:19:15.487728+1000”,“level”:“info”,“message”:“publishing check result”,“payload”:{“client”:“dev04.corp.f7”,“check”:{“name”:“check_disk”,“issued”:1401419955,“command”:"/usr/lib64/nagios/plugins/check_disk -w 80% -c 10% -p /",“executed”:1401419955,“output”:“DISK WARNING - free space: / 5564 MB (30% inode=84%);| /=12499MB;3805;17124;0;19027\n”,“status”:1,“duration”:0.007}}}

But the result never appears on the dashboard, even though the dashboard knows about this server and displays the result from other clients.

Also, what does “handle” mean in the check? Is it required for it to display on the dashboard? I dont want to handle the check now, just display it on the dashboard. Sometimes when I enable “handle” and set a handler to “debug”, all of a sudden it appears on the dashboard, then disappears later.

The Sensu Dashboard does lack much of the desired functionality, but development continues, and I am hopeful for its future. Sensu makes it easy to leverage a tool best suited for storing historical data, such as Logstash or Splunk.

The problem is it is very opaque. If a client doesnt appear as an event, do I assume it is OK, or somehow the server has just forgotten about it or doesn’t care if it is not responding to check requests?

···

On Friday, May 30, 2014 1:11:27 PM UTC+10, portertech wrote:

I hope the community can help you address your issues.

Sean.

On May 29, 2014 7:47 PM, “zzz” meg...@gmail.com wrote:

I am finding Sensu very unreliable. I am constantly getting client timeouts, but also intermittent check results.

I am running a disk check check (from nagios), and a given client will appear and disappear with a WARNING status…but the disk space on the client has remained constant.

The dashboard provides no way to drill into a check that does not appear as an “event”, so I cant go and check the specific result of a supposedly passing test, to actually see what and when it had returned last.

And they really think it can replace Nagios?

When you say client timeouts do you mean the keepalive events? I’m guessing you are in a cloud provider, have a firewall killing long lived connections or misconfigured RabbitMQ.

Try adding these options in your /etc/rabbitmq/rabbitmq.config. Note that this is not the full config but options you should potentially add/change

{rabbit, [

{heartbeat, 25},

{tcp_listen_options, [binary, {packet,raw},

{reuseaddr,true},

{backlog,128},

{nodelay,true},

{exit_on_close,false},

{keepalive,true}]}

]}

I will try and add this although I am configuring rabbit through puppet so Im not sure how to do it.

···

On Friday, May 30, 2014 1:30:28 PM UTC+10, agent462 wrote:

The dashboard feedback is something typical from newer converts to Sensu. It’s a hard mentality to break to not look at a green dashboard. Sensu gives you the primitives to create a green dashboard though.

You can use the client history api to get the status of recent checks. You can easily interact with it by using the sensu-cli gem by running sensu-cli client history NODE.

You can also test the check on the box that it’s firing on. You don’t give a ton of detail if it is on the threshold of firing or not. Is it flapping? You should also be able to use the client logs to help you diagnose.

As far as replacing nagios goes, I monitor a multi-billion dollar ecommerce with it. It works and the composable and extendability of sensu is what makes it awesome.

Thanks,

Bryan

On Thu, May 29, 2014 at 9:47 PM, zzz meg...@gmail.com wrote:

I am finding Sensu very unreliable. I am constantly getting client timeouts, but also intermittent check results.

I am running a disk check check (from nagios), and a given client will appear and disappear with a WARNING status…but the disk space on the client has remained constant.

The dashboard provides no way to drill into a check that does not appear as an “event”, so I cant go and check the specific result of a supposedly passing test, to actually see what and when it had returned last.

And they really think it can replace Nagios?