Sensu self-monitoring

Hi all,

We’d like to seek for your advice around the following topics:

  1. What is the approach on monitoring Sensu health (Self-monitoring)?

  2. If the Sensu server is down, can we configure the server to send an alert? Also what happens to the check results extracted by Sensu client?

Appreciate your support on this.

Thank you.

If the sensu server is down, it can't send an alert :slight_smile:

As far as I know, last time I checked, the results are lost if the server
goes down. We wrote a client caching layer for results, but that PR was
rejected by Sensu maintainers [0] (with a resonable explanation I must
say). I think the approach they want to take is doing it at the transport
layer (which is sensible, but we needed more control at the app layer).

I also would be curious to see what others are doing. What we do is we have
a separate sensu server just monitoring other sensu servers. The monitored
sensu servers also monitor the 'meta monitor', so if any one server goes
down, the other alerts.

[0] Client Metrics Caching by moises-silva · Pull Request #1392 · sensu/sensu · GitHub

···

On Wed, Sep 20, 2017 at 10:07 PM, Kaye <sanpascual.keeshia@gmail.com> wrote:

1. What is the approach on monitoring Sensu health (Self-monitoring)?
2. If the Sensu server is down, can we configure the server to send an
alert? Also what happens to the check results extracted by Sensu client?

-
Moy

Another alternative is a type of heartbeat service
https://www.opsgenie.com/features#heartbeat-monitoring

https://deadmanssnitch.com/

···

On Wed, Sep 20, 2017 at 8:23 PM, Moises Silva moises.silva@gmail.com wrote:

On Wed, Sep 20, 2017 at 10:07 PM, Kaye sanpascual.keeshia@gmail.com wrote:

  1. What is the approach on monitoring Sensu health (Self-monitoring)?
  1. If the Sensu server is down, can we configure the server to send an alert? Also what happens to the check results extracted by Sensu client?

If the sensu server is down, it can’t send an alert :slight_smile:

As far as I know, last time I checked, the results are lost if the server goes down. We wrote a client caching layer for results, but that PR was rejected by Sensu maintainers [0] (with a resonable explanation I must say). I think the approach they want to take is doing it at the transport layer (which is sensible, but we needed more control at the app layer).

I also would be curious to see what others are doing. What we do is we have a separate sensu server just monitoring other sensu servers. The monitored sensu servers also monitor the ‘meta monitor’, so if any one server goes down, the other alerts.

[0] https://github.com/sensu/sensu/pull/1392

Moy

Thanks all for your response! These are very informative and now we have two-fold approach for self-monitoring.

Also @Moises we’re currently looking for a way to ensure client checks are captured despite loss of connectivity and looking at your caching mechanism -

  1. How are the events handled once connection is back? And what if there 2 of same events, are they sent as one event?

  2. How do we configure this mechanism in the Transport layer? Do you have a documentation on this that might help us?

Thank you so much! Would really appreciate your advice on this. :slight_smile:

···

On Thursday, September 21, 2017 at 10:07:03 AM UTC+8, Kaye wrote:

Hi all,

We’d like to seek for your advice around the following topics:

  1. What is the approach on monitoring Sensu health (Self-monitoring)?
  1. If the Sensu server is down, can we configure the server to send an alert? Also what happens to the check results extracted by Sensu client?

Appreciate your support on this.

Thank you.

Also @Moises we're currently looking for a way to ensure client checks are
captured despite loss of connectivity and looking at your caching mechanism
-
1. How are the events handled once connection is back? And what if there 2
of same events, are they sent as one event?

The caching implementation only does metrics at the moment, so it sends all
metrics at 10x rate until it transmits all cached data. It's not a lot of
work to make it work for checks, but you'd need to decide what you want to
do there. If you want to keep the history (e.g last 20 events). If a hard
disk check fails while connectivity is down, then minutes later the check
does not fail anymore and connectivity is restored, why would you want to
send an alert for something that is not a problem anymore? (e.g check disk
space got resolved on its own already). If all you want is keep a history
of when something failed even if it got resolved, then that'd require more
work because you'd need to transmit the check history to the server somehow
(currently checks just return to the server the current status).

2. How do we configure this mechanism in the Transport layer? Do you have
a documentation on this that might help us?

Sorry, not really. Perhaps Sensu 2.0 docs contain something? I know core
devs were looking at having this feature for 2.0, but I don't know if they
made it.

···

On Thu, Sep 21, 2017 at 10:38 PM, Kaye <sanpascual.keeshia@gmail.com> wrote:

Adding https://healthchecks.io to the mix

···

On Thursday, 21 September 2017 05:53:41 UTC+2, Kyle Anderson wrote:

Another alternative is a type of heartbeat service
https://www.opsgenie.com/features#heartbeat-monitoring

https://deadmanssnitch.com/

On Wed, Sep 20, 2017 at 8:23 PM, Moises Silva moises...@gmail.com wrote:

On Wed, Sep 20, 2017 at 10:07 PM, Kaye sanpascua...@gmail.com wrote:

  1. What is the approach on monitoring Sensu health (Self-monitoring)?
  1. If the Sensu server is down, can we configure the server to send an alert? Also what happens to the check results extracted by Sensu client?

If the sensu server is down, it can’t send an alert :slight_smile:

As far as I know, last time I checked, the results are lost if the server goes down. We wrote a client caching layer for results, but that PR was rejected by Sensu maintainers [0] (with a resonable explanation I must say). I think the approach they want to take is doing it at the transport layer (which is sensible, but we needed more control at the app layer).

I also would be curious to see what others are doing. What we do is we have a separate sensu server just monitoring other sensu servers. The monitored sensu servers also monitor the ‘meta monitor’, so if any one server goes down, the other alerts.

[0] https://github.com/sensu/sensu/pull/1392

Moy