Distributed sensu monitoring setup

Hi all,

I’m designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in

different countries.

  • Site A and B
  • Lots of servers for each site

  • We can set up Sensu servers in both sites

  • Site C and D
  • A few servers for each site

  • We can’t have Sensu server here

  • No direct internet access

  • Every site is connected via VPN each other but there is no link between

C and D

Now I think we’d like to

  • Have separate sensu servers in A and B to monitor site-local clients
  • Because we don’t want the sensu server in A to monitor the clients in

B and vice versa

  • We don’t want a flood of alerts when the A-B link is down

    • Each server monitors the other server

      • If the A-B link fails, we’ll get just two alerts from each server.
  • Probably use standalone checks in site C and D
  • Because the machines in C and D are likely to have different monitoring

configurations, and the number of machines is a few. Also bandwidth is

precious here.

  • I guess we need a RabbitMQ cluster (and Redis) between A and B, and

make all clients in C and D to post the results to the cluster.

  • (I guess) we can use RabbitMQ federation and Redis replication in

between A and B.

  • The clients in C and D are posting to a virtual address with VRRP.
  • We can use Keepalived here.

This way we’ll need at least two RabbitMQ with separate namespaces

(/sensu and /sensu-shared?), two Redis with replication, and four sensu

servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it’s quite complicated.

Could we make it simpler?

Thanks,

This is one of the most common questions on the list, I think I'll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn't cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu's Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don't get an alert flood. (this will require some tuning)

You have to balance "simple" with "robust", which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is "simple", but not robust
(imho). Isolated A and B clusters with C and D connecting to *one* of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies\)

···

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe <maoe@foldr.in> wrote:

Hi all,

I'm designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in
different countries.

* Site A and B
    - Lots of servers for each site
    - We can set up Sensu servers in both sites
* Site C and D
    - A few servers for each site
    - We can't have Sensu server here
    - No direct internet access
* Every site is connected via VPN each other but there is no link between
  C and D

Now I think we'd like to

* Have separate sensu servers in A and B to monitor site-local clients
    - Because we don't want the sensu server in A to monitor the clients in
      B and vice versa
        + We don't want a flood of alerts when the A-B link is down
+ Each server monitors the other server
   * If the A-B link fails, we'll get just two alerts from each server.
* Probably use standalone checks in site C and D
    - Because the machines in C and D are likely to have different
monitoring
      configurations, and the number of machines is a few. Also bandwidth is
      precious here.
    - I guess we need a RabbitMQ cluster (and Redis) between A and B, and
      make all clients in C and D to post the results to the cluster.
        + (I guess) we can use RabbitMQ federation and Redis replication in
          between A and B.
        + The clients in C and D are posting to a virtual address with VRRP.
            * We can use Keepalived here.

This way we'll need at least two RabbitMQ with separate namespaces
(/sensu and /sensu-shared?), two Redis with replication, and four sensu
servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it's quite complicated.
Could we make it simpler?

Thanks,

Hi Kyle,

Thank you for the reply.

I was thinking of a cluster across A/B to monitor C and D because

the servers in C and D are important to us. We’d like to keep monitoring

even if one of the sensu servers are down.

Probably it would be great if the check dependencies in sensu would

accept negative dependencies, namely “Handle this check only when
that check is failing”.

This way the sensu clients in C and D would just need to post the results

to either of sensu server in A or B whichever alive, and we won’t get a

flood of alerts.

Thanks,

2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:

···

This is one of the most common questions on the list, I think I’ll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn’t cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu’s Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don’t get an alert flood. (this will require some tuning)

You have to balance “simple” with “robust”, which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is “simple”, but not robust
(imho). Isolated A and B clusters with C and D connecting to one of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies)

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi all,

I’m designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in
different countries.

  • Site A and B
    • Lots of servers for each site
    • We can set up Sensu servers in both sites
  • Site C and D
    • A few servers for each site
    • We can’t have Sensu server here
    • No direct internet access
  • Every site is connected via VPN each other but there is no link between
    C and D

Now I think we’d like to

  • Have separate sensu servers in A and B to monitor site-local clients
    • Because we don’t want the sensu server in A to monitor the clients in
      B and vice versa
      • We don’t want a flood of alerts when the A-B link is down
  • Each server monitors the other server
    • If the A-B link fails, we’ll get just two alerts from each server.
  • Probably use standalone checks in site C and D
    • Because the machines in C and D are likely to have different
      monitoring
      configurations, and the number of machines is a few. Also bandwidth is
      precious here.
    • I guess we need a RabbitMQ cluster (and Redis) between A and B, and
      make all clients in C and D to post the results to the cluster.
      • (I guess) we can use RabbitMQ federation and Redis replication in
        between A and B.
      • The clients in C and D are posting to a virtual address with VRRP.
        • We can use Keepalived here.

This way we’ll need at least two RabbitMQ with separate namespaces
(/sensu and /sensu-shared?), two Redis with replication, and four sensu
servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it’s quite complicated.
Could we make it simpler?

Thanks,

I still recommend local sensu clusters in A and B (I recommend
building for 3 in a quorum) so that no one sensu server makes an
outage.

I think the negative check dependency is.. a strange artifact and will
lead to confusion. I would KISS.

You can still achieve what you want by having a simple topology (c=>a,
d=>b, single registration) but just have A and B just ping the vpn
endpoints of C and D. (and make D a dependency of C's ping, so you
only get one alert)
But having C and D clients register with A and B as a cluster will be
pain. Any network instability will lead to massive split brains.

···

On Thu, Aug 28, 2014 at 9:25 PM, Mitsutoshi Aoe <maoe@foldr.in> wrote:

Hi Kyle,

Thank you for the reply.

I was thinking of a cluster across A/B to monitor C and D because
the servers in C and D are important to us. We'd like to keep monitoring
even if one of the sensu servers are down.

Probably it would be great if the check dependencies in sensu would
accept negative dependencies, namely "Handle this check only when
that check is failing".

This way the sensu clients in C and D would just need to post the results
to either of sensu server in A or B whichever alive, and we won't get a
flood of alerts.

Thanks,

2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:

This is one of the most common questions on the list, I think I'll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn't cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu's Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don't get an alert flood. (this will require some tuning)

You have to balance "simple" with "robust", which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is "simple", but not robust
(imho). Isolated A and B clusters with C and D connecting to *one* of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies\)

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe <ma...@foldr.in> wrote:
> Hi all,
>
> I'm designing a monitoring system with sensu in a distirubted setup.
>
> Suppose we have 4 sites and each site may or may not be located in
> different countries.
>
> * Site A and B
> - Lots of servers for each site
> - We can set up Sensu servers in both sites
> * Site C and D
> - A few servers for each site
> - We can't have Sensu server here
> - No direct internet access
> * Every site is connected via VPN each other but there is no link
> between
> C and D
>
> Now I think we'd like to
>
> * Have separate sensu servers in A and B to monitor site-local clients
> - Because we don't want the sensu server in A to monitor the clients
> in
> B and vice versa
> + We don't want a flood of alerts when the A-B link is down
> + Each server monitors the other server
> * If the A-B link fails, we'll get just two alerts from each server.
> * Probably use standalone checks in site C and D
> - Because the machines in C and D are likely to have different
> monitoring
> configurations, and the number of machines is a few. Also
> bandwidth is
> precious here.
> - I guess we need a RabbitMQ cluster (and Redis) between A and B,
> and
> make all clients in C and D to post the results to the cluster.
> + (I guess) we can use RabbitMQ federation and Redis replication
> in
> between A and B.
> + The clients in C and D are posting to a virtual address with
> VRRP.
> * We can use Keepalived here.
>
> This way we'll need at least two RabbitMQ with separate namespaces
> (/sensu and /sensu-shared?), two Redis with replication, and four sensu
> servers in A and B. Am I misunderstanding something?
>
> The problem here is that, as you can see, it's quite complicated.
> Could we make it simpler?
>
> Thanks,
>

Having local clusters in A and B makes sense.

Regarding the negative check dependencies, yes, I have to admit

that they are confusing.

I’d like to accomplish two things here:

  1. No alert floods in case of a sensu server or VPN failure

  2. Highly available monitoring for C and D which is robust even if

a sensu server or a VPN link is down

The fixed topology like { C => A, D => B } is sufficient for 1 but

that doesn’t help the item 2 because if the A<->C link is down

we can’t tell if the servers in C are working fine.

I’m now thinking it might be possible to imitate what the ec2_node

handler in sensu-community-plugins does instead of the negative

check dependencies. For simplicity, we’ll focus on C => A here.

  • Set up a load balancer in C
  • We can use LVS for example

  • The real servers are the RabbitMQ instances in A and B

  • The scheduling algorithm should probably be sh (source hash)

  • We can set up two balancers with VRRP for availability

  • All clients in C post the results to the virtual IP provided by the load

balancers.

  • If the A-C link is down, LVS removes the RabbitMQ instances in A

    from the list of real servers. Therefore all results from C will goes to

    the RabbitMQ in B.

  • After keepalive threshold time has passed, the sensu server in A

realizes it hasn’t received anything from the servers in C, then it runs

keepalive handlers.

I think we could make a new handler for the last step. The handler

searches for the stale client in the sensu server in B through sensu

API. If the client exists in the sensu server in B, it removes the client

from the sensu server in A. Otherwise it’s actually outage. The sensu

server in A sends an alert.

This might sound complicated but the implementation shouldn’t be

that difficult. It would be great if sensu-client could have a list of

RabbitMQ addresses and try next one automatically if failed.

This way we wouldn’t need to set up the load balancers at all.

What do you think?

Thanks,

2014年8月30日土曜日 0時01分57秒 UTC+9 Kyle Anderson:

···

I still recommend local sensu clusters in A and B (I recommend
building for 3 in a quorum) so that no one sensu server makes an
outage.

I think the negative check dependency is… a strange artifact and will
lead to confusion. I would KISS.

You can still achieve what you want by having a simple topology (c=>a,
d=>b, single registration) but just have A and B just ping the vpn
endpoints of C and D. (and make D a dependency of C’s ping, so you
only get one alert)
But having C and D clients register with A and B as a cluster will be
pain. Any network instability will lead to massive split brains.

On Thu, Aug 28, 2014 at 9:25 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi Kyle,

Thank you for the reply.

I was thinking of a cluster across A/B to monitor C and D because
the servers in C and D are important to us. We’d like to keep monitoring
even if one of the sensu servers are down.

Probably it would be great if the check dependencies in sensu would
accept negative dependencies, namely “Handle this check only when
that check is failing”.

This way the sensu clients in C and D would just need to post the results
to either of sensu server in A or B whichever alive, and we won’t get a
flood of alerts.

Thanks,

2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:

This is one of the most common questions on the list, I think I’ll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn’t cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu’s Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don’t get an alert flood. (this will require some tuning)

You have to balance “simple” with “robust”, which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is “simple”, but not robust
(imho). Isolated A and B clusters with C and D connecting to one of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies)

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi all,

I’m designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in
different countries.

  • Site A and B
    • Lots of servers for each site
    • We can set up Sensu servers in both sites
  • Site C and D
    • A few servers for each site
    • We can’t have Sensu server here
    • No direct internet access
  • Every site is connected via VPN each other but there is no link
    between
    C and D

Now I think we’d like to

  • Have separate sensu servers in A and B to monitor site-local clients
    • Because we don’t want the sensu server in A to monitor the clients
      in
      B and vice versa
      • We don’t want a flood of alerts when the A-B link is down
  • Each server monitors the other server
    • If the A-B link fails, we’ll get just two alerts from each server.
  • Probably use standalone checks in site C and D
    • Because the machines in C and D are likely to have different
      monitoring
      configurations, and the number of machines is a few. Also
      bandwidth is
      precious here.
    • I guess we need a RabbitMQ cluster (and Redis) between A and B,
      and
      make all clients in C and D to post the results to the cluster.
      • (I guess) we can use RabbitMQ federation and Redis replication
        in
        between A and B.
      • The clients in C and D are posting to a virtual address with
        VRRP.
        • We can use Keepalived here.

This way we’ll need at least two RabbitMQ with separate namespaces
(/sensu and /sensu-shared?), two Redis with replication, and four sensu
servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it’s quite complicated.
Could we make it simpler?

Thanks,

Sounds too complicated for my tastes, but it could work. I think there
is some serious race conditions in there.
(It sounds like this special pruning handler would happen on the
keepalive and have to activate at the warning threshold? Otherwise if
it activates on the crit threshold, alerts would have already been
sent?)

Why not just make the sensu servers in D and C ping the A and B vpn endpoints?
If the A-C link goes down, D will alert because it is watching the
endpoint (not because clients were registered with it)
C may alert also if the network issues is not more widespread, the two
alerts isn't that much of a flood considering it gives you the benefit
of simple to understand monitoring?

Sensu is great in that it is very flexible, but setting up complicated
cluster topologies should be discouraged if you ask me. Distributed
systems are hard, split brains suck, redis doesn't tolerate it well,
and nobody likes alert floods.

Go forth and pick a path. Maybe blog about it for others to learn from
your experience?

···

On Sun, Aug 31, 2014 at 6:04 PM, Mitsutoshi Aoe <maoe@foldr.in> wrote:

Having local clusters in A and B makes sense.

Regarding the negative check dependencies, yes, I have to admit
that they are confusing.

I'd like to accomplish two things here:

1) No alert floods in case of a sensu server or VPN failure
2) Highly available monitoring for C and D which is robust even if
a sensu server or a VPN link is down

The fixed topology like { C => A, D => B } is sufficient for 1 but
that doesn't help the item 2 because if the A<->C link is down
we can't tell if the servers in C are working fine.

I'm now thinking it might be possible to imitate what the ec2_node
handler in sensu-community-plugins does instead of the negative
check dependencies. For simplicity, we'll focus on C => A here.

* Set up a load balancer in C
    - We can use LVS for example
    - The real servers are the RabbitMQ instances in A and B
    - The scheduling algorithm should probably be sh (source hash)
    - We can set up two balancers with VRRP for availability
* All clients in C post the results to the virtual IP provided by the load
  balancers.
* If the A-C link is down, LVS removes the RabbitMQ instances in A
  from the list of real servers. Therefore all results from C will goes to
  the RabbitMQ in B.
* After keepalive threshold time has passed, the sensu server in A
  realizes it hasn't received anything from the servers in C, then it runs
  keepalive handlers.

I think we could make a new handler for the last step. The handler
searches for the stale client in the sensu server in *B* through sensu
API. If the client exists in the sensu server in B, it removes the client
from the sensu server in A. Otherwise it's actually outage. The sensu
server in A sends an alert.

This might sound complicated but the implementation shouldn't be
that difficult. It would be great if sensu-client could have a list of
RabbitMQ addresses and try next one automatically if failed.
This way we wouldn't need to set up the load balancers at all.

What do you think?

Thanks,

2014年8月30日土曜日 0時01分57秒 UTC+9 Kyle Anderson:

I still recommend local sensu clusters in A and B (I recommend
building for 3 in a quorum) so that no one sensu server makes an
outage.

I think the negative check dependency is.. a strange artifact and will
lead to confusion. I would KISS.

You can still achieve what you want by having a simple topology (c=>a,
d=>b, single registration) but just have A and B just ping the vpn
endpoints of C and D. (and make D a dependency of C's ping, so you
only get one alert)
But having C and D clients register with A and B as a cluster will be
pain. Any network instability will lead to massive split brains.

On Thu, Aug 28, 2014 at 9:25 PM, Mitsutoshi Aoe <ma...@foldr.in> wrote:
> Hi Kyle,
>
> Thank you for the reply.
>
> I was thinking of a cluster across A/B to monitor C and D because
> the servers in C and D are important to us. We'd like to keep monitoring
> even if one of the sensu servers are down.
>
> Probably it would be great if the check dependencies in sensu would
> accept negative dependencies, namely "Handle this check only when
> that check is failing".
>
> This way the sensu clients in C and D would just need to post the
> results
> to either of sensu server in A or B whichever alive, and we won't get a
> flood of alerts.
>
> Thanks,
>
> 2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:
>>
>> This is one of the most common questions on the list, I think I'll do
>> a PR on the docs to try to help answer these questions better.
>> (Obviously there is a lot going on, there are many different possible
>> configurations, and taste / philosophy, etc)
>>
>> I personally wouldn't cluster across A/B. A and B sound like
>> Datacenters (regions) and I would treat them in isolation.
>> That means no, have the A and B clusters distinct, not federated, no
>> redis sharing. Just make a local A cluster and a local B cluster. (use
>> Uchiwa to get the single pane of glass feel)
>>
>> I would make C and D post to just one of the A or B clusters, not
>> shared. Then use Sensu's Dependency feature and make the dependencies
>> for all sensu checks on any server on C / D dependent on the vpn link
>> so you don't get an alert flood. (this will require some tuning)
>>
>> You have to balance "simple" with "robust", which are the same most of
>> the time, but you are also trying to not get alert floods.
>> One big mega sensu federated cluster is "simple", but not robust
>> (imho). Isolated A and B clusters with C and D connecting to *one* of
>> them. (maybe C => A, D => B?) is probably the most robust way to do
>> it, and has the easiest failure modes given your above constraints.
>> Use dependencies to avoid the alert flood
>> (http://sensuapp.org/docs/latest/checks#check-dependencies\)
>>
>> On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe <ma...@foldr.in> wrote:
>> > Hi all,
>> >
>> > I'm designing a monitoring system with sensu in a distirubted setup.
>> >
>> > Suppose we have 4 sites and each site may or may not be located in
>> > different countries.
>> >
>> > * Site A and B
>> > - Lots of servers for each site
>> > - We can set up Sensu servers in both sites
>> > * Site C and D
>> > - A few servers for each site
>> > - We can't have Sensu server here
>> > - No direct internet access
>> > * Every site is connected via VPN each other but there is no link
>> > between
>> > C and D
>> >
>> > Now I think we'd like to
>> >
>> > * Have separate sensu servers in A and B to monitor site-local
>> > clients
>> > - Because we don't want the sensu server in A to monitor the
>> > clients
>> > in
>> > B and vice versa
>> > + We don't want a flood of alerts when the A-B link is down
>> > + Each server monitors the other server
>> > * If the A-B link fails, we'll get just two alerts from each
>> > server.
>> > * Probably use standalone checks in site C and D
>> > - Because the machines in C and D are likely to have different
>> > monitoring
>> > configurations, and the number of machines is a few. Also
>> > bandwidth is
>> > precious here.
>> > - I guess we need a RabbitMQ cluster (and Redis) between A and B,
>> > and
>> > make all clients in C and D to post the results to the cluster.
>> > + (I guess) we can use RabbitMQ federation and Redis
>> > replication
>> > in
>> > between A and B.
>> > + The clients in C and D are posting to a virtual address
>> > with
>> > VRRP.
>> > * We can use Keepalived here.
>> >
>> > This way we'll need at least two RabbitMQ with separate namespaces
>> > (/sensu and /sensu-shared?), two Redis with replication, and four
>> > sensu
>> > servers in A and B. Am I misunderstanding something?
>> >
>> > The problem here is that, as you can see, it's quite complicated.
>> > Could we make it simpler?
>> >
>> > Thanks,
>> >

(It sounds like this special pruning handler would happen on the
keepalive and have to activate at the warning threshold?

I think that’s right.

Why not just make the sensu servers in D and C ping the A and B vpn endpoints?

Because we’d like to keep monitoring the individual servers in C

rather than just the endpoints even if A-C link is down. Maybe I

should have emphasized earlier that the servers in C and D are

really important to us. They must keep running even if one of

the site A or B is completely down, and we need to take action

immediately if one of the servers is failing.

I guess I’ll give it a try.

Thanks,

2014年9月1日月曜日 10時59分13秒 UTC+9 Kyle Anderson:

···

Sounds too complicated for my tastes, but it could work. I think there
is some serious race conditions in there.
(It sounds like this special pruning handler would happen on the
keepalive and have to activate at the warning threshold? Otherwise if
it activates on the crit threshold, alerts would have already been
sent?)

Why not just make the sensu servers in D and C ping the A and B vpn endpoints?
If the A-C link goes down, D will alert because it is watching the
endpoint (not because clients were registered with it)
C may alert also if the network issues is not more widespread, the two
alerts isn’t that much of a flood considering it gives you the benefit
of simple to understand monitoring?

Sensu is great in that it is very flexible, but setting up complicated
cluster topologies should be discouraged if you ask me. Distributed
systems are hard, split brains suck, redis doesn’t tolerate it well,
and nobody likes alert floods.

Go forth and pick a path. Maybe blog about it for others to learn from
your experience?

On Sun, Aug 31, 2014 at 6:04 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Having local clusters in A and B makes sense.

Regarding the negative check dependencies, yes, I have to admit
that they are confusing.

I’d like to accomplish two things here:

  1. No alert floods in case of a sensu server or VPN failure
  2. Highly available monitoring for C and D which is robust even if
    a sensu server or a VPN link is down

The fixed topology like { C => A, D => B } is sufficient for 1 but
that doesn’t help the item 2 because if the A<->C link is down
we can’t tell if the servers in C are working fine.

I’m now thinking it might be possible to imitate what the ec2_node
handler in sensu-community-plugins does instead of the negative
check dependencies. For simplicity, we’ll focus on C => A here.

  • Set up a load balancer in C
    • We can use LVS for example
    • The real servers are the RabbitMQ instances in A and B
    • The scheduling algorithm should probably be sh (source hash)
    • We can set up two balancers with VRRP for availability
  • All clients in C post the results to the virtual IP provided by the load
    balancers.
  • If the A-C link is down, LVS removes the RabbitMQ instances in A
    from the list of real servers. Therefore all results from C will goes to
    the RabbitMQ in B.
  • After keepalive threshold time has passed, the sensu server in A
    realizes it hasn’t received anything from the servers in C, then it runs
    keepalive handlers.

I think we could make a new handler for the last step. The handler
searches for the stale client in the sensu server in B through sensu
API. If the client exists in the sensu server in B, it removes the client
from the sensu server in A. Otherwise it’s actually outage. The sensu
server in A sends an alert.

This might sound complicated but the implementation shouldn’t be
that difficult. It would be great if sensu-client could have a list of
RabbitMQ addresses and try next one automatically if failed.
This way we wouldn’t need to set up the load balancers at all.

What do you think?

Thanks,

2014年8月30日土曜日 0時01分57秒 UTC+9 Kyle Anderson:

I still recommend local sensu clusters in A and B (I recommend
building for 3 in a quorum) so that no one sensu server makes an
outage.

I think the negative check dependency is… a strange artifact and will
lead to confusion. I would KISS.

You can still achieve what you want by having a simple topology (c=>a,
d=>b, single registration) but just have A and B just ping the vpn
endpoints of C and D. (and make D a dependency of C’s ping, so you
only get one alert)
But having C and D clients register with A and B as a cluster will be
pain. Any network instability will lead to massive split brains.

On Thu, Aug 28, 2014 at 9:25 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi Kyle,

Thank you for the reply.

I was thinking of a cluster across A/B to monitor C and D because
the servers in C and D are important to us. We’d like to keep monitoring
even if one of the sensu servers are down.

Probably it would be great if the check dependencies in sensu would
accept negative dependencies, namely “Handle this check only when
that check is failing”.

This way the sensu clients in C and D would just need to post the
results
to either of sensu server in A or B whichever alive, and we won’t get a
flood of alerts.

Thanks,

2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:

This is one of the most common questions on the list, I think I’ll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn’t cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu’s Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don’t get an alert flood. (this will require some tuning)

You have to balance “simple” with “robust”, which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is “simple”, but not robust
(imho). Isolated A and B clusters with C and D connecting to one of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies)

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi all,

I’m designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in
different countries.

  • Site A and B
    • Lots of servers for each site
    • We can set up Sensu servers in both sites
  • Site C and D
    • A few servers for each site
    • We can’t have Sensu server here
    • No direct internet access
  • Every site is connected via VPN each other but there is no link
    between
    C and D

Now I think we’d like to

  • Have separate sensu servers in A and B to monitor site-local
    clients
    • Because we don’t want the sensu server in A to monitor the
      clients
      in
      B and vice versa
      • We don’t want a flood of alerts when the A-B link is down
  • Each server monitors the other server
    • If the A-B link fails, we’ll get just two alerts from each
      server.
  • Probably use standalone checks in site C and D
    • Because the machines in C and D are likely to have different
      monitoring
      configurations, and the number of machines is a few. Also
      bandwidth is
      precious here.
    • I guess we need a RabbitMQ cluster (and Redis) between A and B,
      and
      make all clients in C and D to post the results to the cluster.
      • (I guess) we can use RabbitMQ federation and Redis
        replication
        in
        between A and B.
      • The clients in C and D are posting to a virtual address
        with
        VRRP.
        • We can use Keepalived here.

This way we’ll need at least two RabbitMQ with separate namespaces
(/sensu and /sensu-shared?), two Redis with replication, and four
sensu
servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it’s quite complicated.
Could we make it simpler?

Thanks,

I’m now thinking it might be possible to imitate what the ec2_node

handler in sensu-community-plugins does instead of the negative

check dependencies.

···

Wrist Watch | Wrist Watch Phone

On Monday, September 1, 2014 7:19:05 AM UTC+5, Mitsutoshi Aoe wrote:

(It sounds like this special pruning handler would happen on the
keepalive and have to activate at the warning threshold?

I think that’s right.

Why not just make the sensu servers in D and C ping the A and B vpn endpoints?

Because we’d like to keep monitoring the individual servers in C

rather than just the endpoints even if A-C link is down. Maybe I

should have emphasized earlier that the servers in C and D are

really important to us. They must keep running even if one of

the site A or B is completely down, and we need to take action

immediately if one of the servers is failing.

I guess I’ll give it a try.

Thanks,

2014年9月1日月曜日 10時59分13秒 UTC+9 Kyle Anderson:

Sounds too complicated for my tastes, but it could work. I think there
is some serious race conditions in there.
(It sounds like this special pruning handler would happen on the
keepalive and have to activate at the warning threshold? Otherwise if
it activates on the crit threshold, alerts would have already been
sent?)

Why not just make the sensu servers in D and C ping the A and B vpn endpoints?
If the A-C link goes down, D will alert because it is watching the
endpoint (not because clients were registered with it)
C may alert also if the network issues is not more widespread, the two
alerts isn’t that much of a flood considering it gives you the benefit
of simple to understand monitoring?

Sensu is great in that it is very flexible, but setting up complicated
cluster topologies should be discouraged if you ask me. Distributed
systems are hard, split brains suck, redis doesn’t tolerate it well,
and nobody likes alert floods.

Go forth and pick a path. Maybe blog about it for others to learn from
your experience?

On Sun, Aug 31, 2014 at 6:04 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Having local clusters in A and B makes sense.

Regarding the negative check dependencies, yes, I have to admit
that they are confusing.

I’d like to accomplish two things here:

  1. No alert floods in case of a sensu server or VPN failure
  2. Highly available monitoring for C and D which is robust even if
    a sensu server or a VPN link is down

The fixed topology like { C => A, D => B } is sufficient for 1 but
that doesn’t help the item 2 because if the A<->C link is down
we can’t tell if the servers in C are working fine.

I’m now thinking it might be possible to imitate what the ec2_node
handler in sensu-community-plugins does instead of the negative
check dependencies. For simplicity, we’ll focus on C => A here.

  • Set up a load balancer in C
    • We can use LVS for example
    • The real servers are the RabbitMQ instances in A and B
    • The scheduling algorithm should probably be sh (source hash)
    • We can set up two balancers with VRRP for availability
  • All clients in C post the results to the virtual IP provided by the load
    balancers.
  • If the A-C link is down, LVS removes the RabbitMQ instances in A
    from the list of real servers. Therefore all results from C will goes to
    the RabbitMQ in B.
  • After keepalive threshold time has passed, the sensu server in A
    realizes it hasn’t received anything from the servers in C, then it runs
    keepalive handlers.

I think we could make a new handler for the last step. The handler
searches for the stale client in the sensu server in B through sensu
API. If the client exists in the sensu server in B, it removes the client
from the sensu server in A. Otherwise it’s actually outage. The sensu
server in A sends an alert.

This might sound complicated but the implementation shouldn’t be
that difficult. It would be great if sensu-client could have a list of
RabbitMQ addresses and try next one automatically if failed.
This way we wouldn’t need to set up the load balancers at all.

What do you think?

Thanks,

2014年8月30日土曜日 0時01分57秒 UTC+9 Kyle Anderson:

I still recommend local sensu clusters in A and B (I recommend
building for 3 in a quorum) so that no one sensu server makes an
outage.

I think the negative check dependency is… a strange artifact and will
lead to confusion. I would KISS.

You can still achieve what you want by having a simple topology (c=>a,
d=>b, single registration) but just have A and B just ping the vpn
endpoints of C and D. (and make D a dependency of C’s ping, so you
only get one alert)
But having C and D clients register with A and B as a cluster will be
pain. Any network instability will lead to massive split brains.

On Thu, Aug 28, 2014 at 9:25 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi Kyle,

Thank you for the reply.

I was thinking of a cluster across A/B to monitor C and D because
the servers in C and D are important to us. We’d like to keep monitoring
even if one of the sensu servers are down.

Probably it would be great if the check dependencies in sensu would
accept negative dependencies, namely “Handle this check only when
that check is failing”.

This way the sensu clients in C and D would just need to post the
results
to either of sensu server in A or B whichever alive, and we won’t get a
flood of alerts.

Thanks,

2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:

This is one of the most common questions on the list, I think I’ll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn’t cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu’s Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don’t get an alert flood. (this will require some tuning)

You have to balance “simple” with “robust”, which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is “simple”, but not robust
(imho). Isolated A and B clusters with C and D connecting to one of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies)

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi all,

I’m designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in
different countries.

  • Site A and B
    • Lots of servers for each site
    • We can set up Sensu servers in both sites
  • Site C and D
    • A few servers for each site
    • We can’t have Sensu server here
    • No direct internet access
  • Every site is connected via VPN each other but there is no link
    between
    C and D

Now I think we’d like to

  • Have separate sensu servers in A and B to monitor site-local
    clients
    • Because we don’t want the sensu server in A to monitor the
      clients
      in
      B and vice versa
      • We don’t want a flood of alerts when the A-B link is down
  • Each server monitors the other server
    • If the A-B link fails, we’ll get just two alerts from each
      server.
  • Probably use standalone checks in site C and D
    • Because the machines in C and D are likely to have different
      monitoring
      configurations, and the number of machines is a few. Also
      bandwidth is
      precious here.
    • I guess we need a RabbitMQ cluster (and Redis) between A and B,
      and
      make all clients in C and D to post the results to the cluster.
      • (I guess) we can use RabbitMQ federation and Redis
        replication
        in
        between A and B.
      • The clients in C and D are posting to a virtual address
        with
        VRRP.
        • We can use Keepalived here.

This way we’ll need at least two RabbitMQ with separate namespaces
(/sensu and /sensu-shared?), two Redis with replication, and four
sensu
servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it’s quite complicated.
Could we make it simpler?

Thanks,