Distributed sensu monitoring setup

Mitsutoshi_Aoe · August 28, 2014, 4:19am

Hi all,

I’m designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in

different countries.

Site A and B

Lots of servers for each site
We can set up Sensu servers in both sites

Site C and D

A few servers for each site
We can’t have Sensu server here
No direct internet access

Every site is connected via VPN each other but there is no link between

C and D

Now I think we’d like to

Have separate sensu servers in A and B to monitor site-local clients

Because we don’t want the sensu server in A to monitor the clients in

B and vice versa

We don’t want a flood of alerts when the A-B link is down
- Each server monitors the other server
  - If the A-B link fails, we’ll get just two alerts from each server.

Probably use standalone checks in site C and D

Because the machines in C and D are likely to have different monitoring

configurations, and the number of machines is a few. Also bandwidth is

precious here.

I guess we need a RabbitMQ cluster (and Redis) between A and B, and

make all clients in C and D to post the results to the cluster.

(I guess) we can use RabbitMQ federation and Redis replication in

between A and B.

The clients in C and D are posting to a virtual address with VRRP.

We can use Keepalived here.

This way we’ll need at least two RabbitMQ with separate namespaces

(/sensu and /sensu-shared?), two Redis with replication, and four sensu

servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it’s quite complicated.

Could we make it simpler?

Thanks,

Kyle_Anderson · August 28, 2014, 2:35pm

This is one of the most common questions on the list, I think I'll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn't cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu's Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don't get an alert flood. (this will require some tuning)

You have to balance "simple" with "robust", which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is "simple", but not robust
(imho). Isolated A and B clusters with C and D connecting to *one* of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies\)

···

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe <maoe@foldr.in> wrote:

Hi all,

I'm designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in
different countries.

* Site A and B
    - Lots of servers for each site
    - We can set up Sensu servers in both sites
* Site C and D
    - A few servers for each site
    - We can't have Sensu server here
    - No direct internet access
* Every site is connected via VPN each other but there is no link between
  C and D

Now I think we'd like to

* Have separate sensu servers in A and B to monitor site-local clients
    - Because we don't want the sensu server in A to monitor the clients in
      B and vice versa
        + We don't want a flood of alerts when the A-B link is down
+ Each server monitors the other server
   * If the A-B link fails, we'll get just two alerts from each server.
* Probably use standalone checks in site C and D
    - Because the machines in C and D are likely to have different
monitoring
      configurations, and the number of machines is a few. Also bandwidth is
      precious here.
    - I guess we need a RabbitMQ cluster (and Redis) between A and B, and
      make all clients in C and D to post the results to the cluster.
        + (I guess) we can use RabbitMQ federation and Redis replication in
          between A and B.
        + The clients in C and D are posting to a virtual address with VRRP.
            * We can use Keepalived here.

This way we'll need at least two RabbitMQ with separate namespaces
(/sensu and /sensu-shared?), two Redis with replication, and four sensu
servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it's quite complicated.
Could we make it simpler?

Thanks,

Mitsutoshi_Aoe · August 29, 2014, 4:25am

Hi Kyle,

Thank you for the reply.

I was thinking of a cluster across A/B to monitor C and D because

the servers in C and D are important to us. We’d like to keep monitoring

even if one of the sensu servers are down.

Probably it would be great if the check dependencies in sensu would

accept negative dependencies, namely “Handle this check only when
that check is failing”.

This way the sensu clients in C and D would just need to post the results

to either of sensu server in A or B whichever alive, and we won’t get a

flood of alerts.

Thanks,

2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:

···

This is one of the most common questions on the list, I think I’ll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn’t cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu’s Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don’t get an alert flood. (this will require some tuning)

You have to balance “simple” with “robust”, which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is “simple”, but not robust
(imho). Isolated A and B clusters with C and D connecting to one of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies)

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi all,

I’m designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in
different countries.

Site A and B

Lots of servers for each site

We can set up Sensu servers in both sites

Site C and D

A few servers for each site

We can’t have Sensu server here

No direct internet access

Every site is connected via VPN each other but there is no link between
C and D

Now I think we’d like to

Have separate sensu servers in A and B to monitor site-local clients

Because we don’t want the sensu server in A to monitor the clients in
B and vice versa

We don’t want a flood of alerts when the A-B link is down

Each server monitors the other server

If the A-B link fails, we’ll get just two alerts from each server.

Probably use standalone checks in site C and D

Because the machines in C and D are likely to have different
monitoring
configurations, and the number of machines is a few. Also bandwidth is
precious here.

I guess we need a RabbitMQ cluster (and Redis) between A and B, and
make all clients in C and D to post the results to the cluster.

(I guess) we can use RabbitMQ federation and Redis replication in
between A and B.

The clients in C and D are posting to a virtual address with VRRP.

We can use Keepalived here.

This way we’ll need at least two RabbitMQ with separate namespaces
(/sensu and /sensu-shared?), two Redis with replication, and four sensu
servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it’s quite complicated.
Could we make it simpler?

Thanks,

Kyle_Anderson · August 29, 2014, 3:01pm

I still recommend local sensu clusters in A and B (I recommend
building for 3 in a quorum) so that no one sensu server makes an
outage.

I think the negative check dependency is.. a strange artifact and will
lead to confusion. I would KISS.

You can still achieve what you want by having a simple topology (c=>a,
d=>b, single registration) but just have A and B just ping the vpn
endpoints of C and D. (and make D a dependency of C's ping, so you
only get one alert)
But having C and D clients register with A and B as a cluster will be
pain. Any network instability will lead to massive split brains.

···

On Thu, Aug 28, 2014 at 9:25 PM, Mitsutoshi Aoe <maoe@foldr.in> wrote:

Hi Kyle,

Thank you for the reply.

I was thinking of a cluster across A/B to monitor C and D because
the servers in C and D are important to us. We'd like to keep monitoring
even if one of the sensu servers are down.

Probably it would be great if the check dependencies in sensu would
accept negative dependencies, namely "Handle this check only when
that check is failing".

This way the sensu clients in C and D would just need to post the results
to either of sensu server in A or B whichever alive, and we won't get a
flood of alerts.

Thanks,

2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:

This is one of the most common questions on the list, I think I'll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn't cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu's Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don't get an alert flood. (this will require some tuning)

You have to balance "simple" with "robust", which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is "simple", but not robust
(imho). Isolated A and B clusters with C and D connecting to *one* of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies\)

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe <ma...@foldr.in> wrote:
> Hi all,
>
> I'm designing a monitoring system with sensu in a distirubted setup.
>
> Suppose we have 4 sites and each site may or may not be located in
> different countries.
>
> * Site A and B
> - Lots of servers for each site
> - We can set up Sensu servers in both sites
> * Site C and D
> - A few servers for each site
> - We can't have Sensu server here
> - No direct internet access
> * Every site is connected via VPN each other but there is no link
> between
> C and D
>
> Now I think we'd like to
>
> * Have separate sensu servers in A and B to monitor site-local clients
> - Because we don't want the sensu server in A to monitor the clients
> in
> B and vice versa
> + We don't want a flood of alerts when the A-B link is down
> + Each server monitors the other server
> * If the A-B link fails, we'll get just two alerts from each server.
> * Probably use standalone checks in site C and D
> - Because the machines in C and D are likely to have different
> monitoring
> configurations, and the number of machines is a few. Also
> bandwidth is
> precious here.
> - I guess we need a RabbitMQ cluster (and Redis) between A and B,
> and
> make all clients in C and D to post the results to the cluster.
> + (I guess) we can use RabbitMQ federation and Redis replication
> in
> between A and B.
> + The clients in C and D are posting to a virtual address with
> VRRP.
> * We can use Keepalived here.
>
> This way we'll need at least two RabbitMQ with separate namespaces
> (/sensu and /sensu-shared?), two Redis with replication, and four sensu
> servers in A and B. Am I misunderstanding something?
>
> The problem here is that, as you can see, it's quite complicated.
> Could we make it simpler?
>
> Thanks,
>

Mitsutoshi_Aoe · September 1, 2014, 1:04am

Having local clusters in A and B makes sense.

Regarding the negative check dependencies, yes, I have to admit

that they are confusing.

I’d like to accomplish two things here:

No alert floods in case of a sensu server or VPN failure
Highly available monitoring for C and D which is robust even if

a sensu server or a VPN link is down

The fixed topology like { C => A, D => B } is sufficient for 1 but

that doesn’t help the item 2 because if the A<->C link is down

we can’t tell if the servers in C are working fine.

I’m now thinking it might be possible to imitate what the ec2_node

handler in sensu-community-plugins does instead of the negative

check dependencies. For simplicity, we’ll focus on C => A here.

Set up a load balancer in C

We can use LVS for example
The real servers are the RabbitMQ instances in A and B
The scheduling algorithm should probably be sh (source hash)
We can set up two balancers with VRRP for availability

All clients in C post the results to the virtual IP provided by the load

balancers.

If the A-C link is down, LVS removes the RabbitMQ instances in A

from the list of real servers. Therefore all results from C will goes to

the RabbitMQ in B.
After keepalive threshold time has passed, the sensu server in A

realizes it hasn’t received anything from the servers in C, then it runs

keepalive handlers.

I think we could make a new handler for the last step. The handler

searches for the stale client in the sensu server in B through sensu

API. If the client exists in the sensu server in B, it removes the client

from the sensu server in A. Otherwise it’s actually outage. The sensu

server in A sends an alert.

This might sound complicated but the implementation shouldn’t be

that difficult. It would be great if sensu-client could have a list of

RabbitMQ addresses and try next one automatically if failed.

This way we wouldn’t need to set up the load balancers at all.

What do you think?

Thanks,

2014年8月30日土曜日 0時01分57秒 UTC+9 Kyle Anderson:

···

I still recommend local sensu clusters in A and B (I recommend
building for 3 in a quorum) so that no one sensu server makes an
outage.

I think the negative check dependency is… a strange artifact and will
lead to confusion. I would KISS.

You can still achieve what you want by having a simple topology (c=>a,
d=>b, single registration) but just have A and B just ping the vpn
endpoints of C and D. (and make D a dependency of C’s ping, so you
only get one alert)
But having C and D clients register with A and B as a cluster will be
pain. Any network instability will lead to massive split brains.

On Thu, Aug 28, 2014 at 9:25 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi Kyle,

Thank you for the reply.

I was thinking of a cluster across A/B to monitor C and D because
the servers in C and D are important to us. We’d like to keep monitoring
even if one of the sensu servers are down.

Probably it would be great if the check dependencies in sensu would
accept negative dependencies, namely “Handle this check only when
that check is failing”.

This way the sensu clients in C and D would just need to post the results
to either of sensu server in A or B whichever alive, and we won’t get a
flood of alerts.

Thanks,

2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:

This is one of the most common questions on the list, I think I’ll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn’t cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu’s Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don’t get an alert flood. (this will require some tuning)

You have to balance “simple” with “robust”, which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is “simple”, but not robust
(imho). Isolated A and B clusters with C and D connecting to one of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies)

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi all,

I’m designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in
different countries.

Site A and B

Lots of servers for each site

We can set up Sensu servers in both sites

Site C and D

A few servers for each site

We can’t have Sensu server here

No direct internet access

Every site is connected via VPN each other but there is no link
between
C and D

Now I think we’d like to

Have separate sensu servers in A and B to monitor site-local clients

Because we don’t want the sensu server in A to monitor the clients
in
B and vice versa

We don’t want a flood of alerts when the A-B link is down

Each server monitors the other server

If the A-B link fails, we’ll get just two alerts from each server.

Probably use standalone checks in site C and D

Because the machines in C and D are likely to have different
monitoring
configurations, and the number of machines is a few. Also
bandwidth is
precious here.

I guess we need a RabbitMQ cluster (and Redis) between A and B,
and
make all clients in C and D to post the results to the cluster.

(I guess) we can use RabbitMQ federation and Redis replication
in
between A and B.

The clients in C and D are posting to a virtual address with
VRRP.

We can use Keepalived here.

This way we’ll need at least two RabbitMQ with separate namespaces
(/sensu and /sensu-shared?), two Redis with replication, and four sensu
servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it’s quite complicated.
Could we make it simpler?

Thanks,

Kyle_Anderson · September 1, 2014, 1:59am

Sounds too complicated for my tastes, but it could work. I think there
is some serious race conditions in there.
(It sounds like this special pruning handler would happen on the
keepalive and have to activate at the warning threshold? Otherwise if
it activates on the crit threshold, alerts would have already been
sent?)

Why not just make the sensu servers in D and C ping the A and B vpn endpoints?
If the A-C link goes down, D will alert because it is watching the
endpoint (not because clients were registered with it)
C may alert also if the network issues is not more widespread, the two
alerts isn't that much of a flood considering it gives you the benefit
of simple to understand monitoring?

Sensu is great in that it is very flexible, but setting up complicated
cluster topologies should be discouraged if you ask me. Distributed
systems are hard, split brains suck, redis doesn't tolerate it well,
and nobody likes alert floods.

Go forth and pick a path. Maybe blog about it for others to learn from
your experience?

···

On Sun, Aug 31, 2014 at 6:04 PM, Mitsutoshi Aoe <maoe@foldr.in> wrote:

Having local clusters in A and B makes sense.

Regarding the negative check dependencies, yes, I have to admit
that they are confusing.

I'd like to accomplish two things here:

1) No alert floods in case of a sensu server or VPN failure
2) Highly available monitoring for C and D which is robust even if
a sensu server or a VPN link is down

The fixed topology like { C => A, D => B } is sufficient for 1 but
that doesn't help the item 2 because if the A<->C link is down
we can't tell if the servers in C are working fine.

I'm now thinking it might be possible to imitate what the ec2_node
handler in sensu-community-plugins does instead of the negative
check dependencies. For simplicity, we'll focus on C => A here.

* Set up a load balancer in C
    - We can use LVS for example
    - The real servers are the RabbitMQ instances in A and B
    - The scheduling algorithm should probably be sh (source hash)
    - We can set up two balancers with VRRP for availability
* All clients in C post the results to the virtual IP provided by the load
  balancers.
* If the A-C link is down, LVS removes the RabbitMQ instances in A
  from the list of real servers. Therefore all results from C will goes to
  the RabbitMQ in B.
* After keepalive threshold time has passed, the sensu server in A
  realizes it hasn't received anything from the servers in C, then it runs
  keepalive handlers.

I think we could make a new handler for the last step. The handler
searches for the stale client in the sensu server in *B* through sensu
API. If the client exists in the sensu server in B, it removes the client
from the sensu server in A. Otherwise it's actually outage. The sensu
server in A sends an alert.

This might sound complicated but the implementation shouldn't be
that difficult. It would be great if sensu-client could have a list of
RabbitMQ addresses and try next one automatically if failed.
This way we wouldn't need to set up the load balancers at all.

What do you think?

Thanks,

2014年8月30日土曜日 0時01分57秒 UTC+9 Kyle Anderson:

I still recommend local sensu clusters in A and B (I recommend
building for 3 in a quorum) so that no one sensu server makes an
outage.

I think the negative check dependency is.. a strange artifact and will
lead to confusion. I would KISS.

You can still achieve what you want by having a simple topology (c=>a,
d=>b, single registration) but just have A and B just ping the vpn
endpoints of C and D. (and make D a dependency of C's ping, so you
only get one alert)
But having C and D clients register with A and B as a cluster will be
pain. Any network instability will lead to massive split brains.

On Thu, Aug 28, 2014 at 9:25 PM, Mitsutoshi Aoe <ma...@foldr.in> wrote:
> Hi Kyle,
>
> Thank you for the reply.
>
> I was thinking of a cluster across A/B to monitor C and D because
> the servers in C and D are important to us. We'd like to keep monitoring
> even if one of the sensu servers are down.
>
> Probably it would be great if the check dependencies in sensu would
> accept negative dependencies, namely "Handle this check only when
> that check is failing".
>
> This way the sensu clients in C and D would just need to post the
> results
> to either of sensu server in A or B whichever alive, and we won't get a
> flood of alerts.
>
> Thanks,
>
> 2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:
>>
>> This is one of the most common questions on the list, I think I'll do
>> a PR on the docs to try to help answer these questions better.
>> (Obviously there is a lot going on, there are many different possible
>> configurations, and taste / philosophy, etc)
>>
>> I personally wouldn't cluster across A/B. A and B sound like
>> Datacenters (regions) and I would treat them in isolation.
>> That means no, have the A and B clusters distinct, not federated, no
>> redis sharing. Just make a local A cluster and a local B cluster. (use
>> Uchiwa to get the single pane of glass feel)
>>
>> I would make C and D post to just one of the A or B clusters, not
>> shared. Then use Sensu's Dependency feature and make the dependencies
>> for all sensu checks on any server on C / D dependent on the vpn link
>> so you don't get an alert flood. (this will require some tuning)
>>
>> You have to balance "simple" with "robust", which are the same most of
>> the time, but you are also trying to not get alert floods.
>> One big mega sensu federated cluster is "simple", but not robust
>> (imho). Isolated A and B clusters with C and D connecting to *one* of
>> them. (maybe C => A, D => B?) is probably the most robust way to do
>> it, and has the easiest failure modes given your above constraints.
>> Use dependencies to avoid the alert flood
>> (http://sensuapp.org/docs/latest/checks#check-dependencies\)
>>
>> On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe <ma...@foldr.in> wrote:
>> > Hi all,
>> >
>> > I'm designing a monitoring system with sensu in a distirubted setup.
>> >
>> > Suppose we have 4 sites and each site may or may not be located in
>> > different countries.
>> >
>> > * Site A and B
>> > - Lots of servers for each site
>> > - We can set up Sensu servers in both sites
>> > * Site C and D
>> > - A few servers for each site
>> > - We can't have Sensu server here
>> > - No direct internet access
>> > * Every site is connected via VPN each other but there is no link
>> > between
>> > C and D
>> >
>> > Now I think we'd like to
>> >
>> > * Have separate sensu servers in A and B to monitor site-local
>> > clients
>> > - Because we don't want the sensu server in A to monitor the
>> > clients
>> > in
>> > B and vice versa
>> > + We don't want a flood of alerts when the A-B link is down
>> > + Each server monitors the other server
>> > * If the A-B link fails, we'll get just two alerts from each
>> > server.
>> > * Probably use standalone checks in site C and D
>> > - Because the machines in C and D are likely to have different
>> > monitoring
>> > configurations, and the number of machines is a few. Also
>> > bandwidth is
>> > precious here.
>> > - I guess we need a RabbitMQ cluster (and Redis) between A and B,
>> > and
>> > make all clients in C and D to post the results to the cluster.
>> > + (I guess) we can use RabbitMQ federation and Redis
>> > replication
>> > in
>> > between A and B.
>> > + The clients in C and D are posting to a virtual address
>> > with
>> > VRRP.
>> > * We can use Keepalived here.
>> >
>> > This way we'll need at least two RabbitMQ with separate namespaces
>> > (/sensu and /sensu-shared?), two Redis with replication, and four
>> > sensu
>> > servers in A and B. Am I misunderstanding something?
>> >
>> > The problem here is that, as you can see, it's quite complicated.
>> > Could we make it simpler?
>> >
>> > Thanks,
>> >

Mitsutoshi_Aoe · September 1, 2014, 2:19am

(It sounds like this special pruning handler would happen on the
keepalive and have to activate at the warning threshold?

I think that’s right.

Why not just make the sensu servers in D and C ping the A and B vpn endpoints?

Because we’d like to keep monitoring the individual servers in C

rather than just the endpoints even if A-C link is down. Maybe I

should have emphasized earlier that the servers in C and D are

really important to us. They must keep running even if one of

the site A or B is completely down, and we need to take action

immediately if one of the servers is failing.

I guess I’ll give it a try.

Thanks,

2014年9月1日月曜日 10時59分13秒 UTC+9 Kyle Anderson:

···

Sounds too complicated for my tastes, but it could work. I think there
is some serious race conditions in there.
(It sounds like this special pruning handler would happen on the
keepalive and have to activate at the warning threshold? Otherwise if
it activates on the crit threshold, alerts would have already been
sent?)

Why not just make the sensu servers in D and C ping the A and B vpn endpoints?
If the A-C link goes down, D will alert because it is watching the
endpoint (not because clients were registered with it)
C may alert also if the network issues is not more widespread, the two
alerts isn’t that much of a flood considering it gives you the benefit
of simple to understand monitoring?

Sensu is great in that it is very flexible, but setting up complicated
cluster topologies should be discouraged if you ask me. Distributed
systems are hard, split brains suck, redis doesn’t tolerate it well,
and nobody likes alert floods.

Go forth and pick a path. Maybe blog about it for others to learn from
your experience?

On Sun, Aug 31, 2014 at 6:04 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Having local clusters in A and B makes sense.

Regarding the negative check dependencies, yes, I have to admit
that they are confusing.

I’d like to accomplish two things here:

No alert floods in case of a sensu server or VPN failure

Highly available monitoring for C and D which is robust even if
a sensu server or a VPN link is down

The fixed topology like { C => A, D => B } is sufficient for 1 but
that doesn’t help the item 2 because if the A<->C link is down
we can’t tell if the servers in C are working fine.

I’m now thinking it might be possible to imitate what the ec2_node
handler in sensu-community-plugins does instead of the negative
check dependencies. For simplicity, we’ll focus on C => A here.

Set up a load balancer in C

We can use LVS for example

The real servers are the RabbitMQ instances in A and B

The scheduling algorithm should probably be sh (source hash)

We can set up two balancers with VRRP for availability

All clients in C post the results to the virtual IP provided by the load
balancers.

If the A-C link is down, LVS removes the RabbitMQ instances in A
from the list of real servers. Therefore all results from C will goes to
the RabbitMQ in B.

After keepalive threshold time has passed, the sensu server in A
realizes it hasn’t received anything from the servers in C, then it runs
keepalive handlers.

I think we could make a new handler for the last step. The handler
searches for the stale client in the sensu server in B through sensu
API. If the client exists in the sensu server in B, it removes the client
from the sensu server in A. Otherwise it’s actually outage. The sensu
server in A sends an alert.

This might sound complicated but the implementation shouldn’t be
that difficult. It would be great if sensu-client could have a list of
RabbitMQ addresses and try next one automatically if failed.
This way we wouldn’t need to set up the load balancers at all.

What do you think?

Thanks,

2014年8月30日土曜日 0時01分57秒 UTC+9 Kyle Anderson:

I still recommend local sensu clusters in A and B (I recommend
building for 3 in a quorum) so that no one sensu server makes an
outage.

I think the negative check dependency is… a strange artifact and will
lead to confusion. I would KISS.

You can still achieve what you want by having a simple topology (c=>a,
d=>b, single registration) but just have A and B just ping the vpn
endpoints of C and D. (and make D a dependency of C’s ping, so you
only get one alert)
But having C and D clients register with A and B as a cluster will be
pain. Any network instability will lead to massive split brains.

On Thu, Aug 28, 2014 at 9:25 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi Kyle,

Thank you for the reply.

I was thinking of a cluster across A/B to monitor C and D because
the servers in C and D are important to us. We’d like to keep monitoring
even if one of the sensu servers are down.

Probably it would be great if the check dependencies in sensu would
accept negative dependencies, namely “Handle this check only when
that check is failing”.

This way the sensu clients in C and D would just need to post the
results
to either of sensu server in A or B whichever alive, and we won’t get a
flood of alerts.

Thanks,

2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:

This is one of the most common questions on the list, I think I’ll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn’t cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu’s Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don’t get an alert flood. (this will require some tuning)

You have to balance “simple” with “robust”, which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is “simple”, but not robust
(imho). Isolated A and B clusters with C and D connecting to one of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies)

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi all,

I’m designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in
different countries.

Site A and B

Lots of servers for each site

We can set up Sensu servers in both sites

Site C and D

A few servers for each site

We can’t have Sensu server here

No direct internet access

Every site is connected via VPN each other but there is no link
between
C and D

Now I think we’d like to

Have separate sensu servers in A and B to monitor site-local
clients

Because we don’t want the sensu server in A to monitor the
clients
in
B and vice versa

We don’t want a flood of alerts when the A-B link is down

Each server monitors the other server

If the A-B link fails, we’ll get just two alerts from each
server.

Probably use standalone checks in site C and D

Because the machines in C and D are likely to have different
monitoring
configurations, and the number of machines is a few. Also
bandwidth is
precious here.

I guess we need a RabbitMQ cluster (and Redis) between A and B,
and
make all clients in C and D to post the results to the cluster.

(I guess) we can use RabbitMQ federation and Redis
replication
in
between A and B.

The clients in C and D are posting to a virtual address
with
VRRP.

We can use Keepalived here.

This way we’ll need at least two RabbitMQ with separate namespaces
(/sensu and /sensu-shared?), two Redis with replication, and four
sensu
servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it’s quite complicated.
Could we make it simpler?

Thanks,

Liza_Jenifer · October 15, 2014, 7:17pm

I’m now thinking it might be possible to imitate what the ec2_node

handler in sensu-community-plugins does instead of the negative

check dependencies.

···

Wrist Watch | Wrist Watch Phone

On Monday, September 1, 2014 7:19:05 AM UTC+5, Mitsutoshi Aoe wrote:

(It sounds like this special pruning handler would happen on the
keepalive and have to activate at the warning threshold?

I think that’s right.

Why not just make the sensu servers in D and C ping the A and B vpn endpoints?

Because we’d like to keep monitoring the individual servers in C

rather than just the endpoints even if A-C link is down. Maybe I

should have emphasized earlier that the servers in C and D are

really important to us. They must keep running even if one of

the site A or B is completely down, and we need to take action

immediately if one of the servers is failing.

I guess I’ll give it a try.

Thanks,

2014年9月1日月曜日 10時59分13秒 UTC+9 Kyle Anderson:

Sounds too complicated for my tastes, but it could work. I think there
is some serious race conditions in there.
(It sounds like this special pruning handler would happen on the
keepalive and have to activate at the warning threshold? Otherwise if
it activates on the crit threshold, alerts would have already been
sent?)

Why not just make the sensu servers in D and C ping the A and B vpn endpoints?
If the A-C link goes down, D will alert because it is watching the
endpoint (not because clients were registered with it)
C may alert also if the network issues is not more widespread, the two
alerts isn’t that much of a flood considering it gives you the benefit
of simple to understand monitoring?

Sensu is great in that it is very flexible, but setting up complicated
cluster topologies should be discouraged if you ask me. Distributed
systems are hard, split brains suck, redis doesn’t tolerate it well,
and nobody likes alert floods.

Go forth and pick a path. Maybe blog about it for others to learn from
your experience?

On Sun, Aug 31, 2014 at 6:04 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Having local clusters in A and B makes sense.

Regarding the negative check dependencies, yes, I have to admit
that they are confusing.

I’d like to accomplish two things here:

No alert floods in case of a sensu server or VPN failure

Highly available monitoring for C and D which is robust even if
a sensu server or a VPN link is down

The fixed topology like { C => A, D => B } is sufficient for 1 but
that doesn’t help the item 2 because if the A<->C link is down
we can’t tell if the servers in C are working fine.

I’m now thinking it might be possible to imitate what the ec2_node
handler in sensu-community-plugins does instead of the negative
check dependencies. For simplicity, we’ll focus on C => A here.

Set up a load balancer in C

We can use LVS for example

The real servers are the RabbitMQ instances in A and B

The scheduling algorithm should probably be sh (source hash)

We can set up two balancers with VRRP for availability

All clients in C post the results to the virtual IP provided by the load
balancers.

If the A-C link is down, LVS removes the RabbitMQ instances in A
from the list of real servers. Therefore all results from C will goes to
the RabbitMQ in B.

After keepalive threshold time has passed, the sensu server in A
realizes it hasn’t received anything from the servers in C, then it runs
keepalive handlers.

I think we could make a new handler for the last step. The handler
searches for the stale client in the sensu server in B through sensu
API. If the client exists in the sensu server in B, it removes the client
from the sensu server in A. Otherwise it’s actually outage. The sensu
server in A sends an alert.

This might sound complicated but the implementation shouldn’t be
that difficult. It would be great if sensu-client could have a list of
RabbitMQ addresses and try next one automatically if failed.
This way we wouldn’t need to set up the load balancers at all.

What do you think?

Thanks,

2014年8月30日土曜日 0時01分57秒 UTC+9 Kyle Anderson:

I still recommend local sensu clusters in A and B (I recommend
building for 3 in a quorum) so that no one sensu server makes an
outage.

I think the negative check dependency is… a strange artifact and will
lead to confusion. I would KISS.

You can still achieve what you want by having a simple topology (c=>a,
d=>b, single registration) but just have A and B just ping the vpn
endpoints of C and D. (and make D a dependency of C’s ping, so you
only get one alert)
But having C and D clients register with A and B as a cluster will be
pain. Any network instability will lead to massive split brains.

On Thu, Aug 28, 2014 at 9:25 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi Kyle,

Thank you for the reply.

I was thinking of a cluster across A/B to monitor C and D because
the servers in C and D are important to us. We’d like to keep monitoring
even if one of the sensu servers are down.

Probably it would be great if the check dependencies in sensu would
accept negative dependencies, namely “Handle this check only when
that check is failing”.

This way the sensu clients in C and D would just need to post the
results
to either of sensu server in A or B whichever alive, and we won’t get a
flood of alerts.

Thanks,

2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:

This is one of the most common questions on the list, I think I’ll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)

I personally wouldn’t cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)

I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu’s Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don’t get an alert flood. (this will require some tuning)

You have to balance “simple” with “robust”, which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is “simple”, but not robust
(imho). Isolated A and B clusters with C and D connecting to one of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies)

On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe ma...@foldr.in wrote:

Hi all,

I’m designing a monitoring system with sensu in a distirubted setup.

Suppose we have 4 sites and each site may or may not be located in
different countries.

Site A and B

Lots of servers for each site

We can set up Sensu servers in both sites

Site C and D

A few servers for each site

We can’t have Sensu server here

No direct internet access

Every site is connected via VPN each other but there is no link
between
C and D

Now I think we’d like to

Have separate sensu servers in A and B to monitor site-local
clients

Because we don’t want the sensu server in A to monitor the
clients
in
B and vice versa

We don’t want a flood of alerts when the A-B link is down

Each server monitors the other server

If the A-B link fails, we’ll get just two alerts from each
server.

Probably use standalone checks in site C and D

Because the machines in C and D are likely to have different
monitoring
configurations, and the number of machines is a few. Also
bandwidth is
precious here.

I guess we need a RabbitMQ cluster (and Redis) between A and B,
and
make all clients in C and D to post the results to the cluster.

(I guess) we can use RabbitMQ federation and Redis
replication
in
between A and B.

The clients in C and D are posting to a virtual address
with
VRRP.

We can use Keepalived here.

This way we’ll need at least two RabbitMQ with separate namespaces
(/sensu and /sensu-shared?), two Redis with replication, and four
sensu
servers in A and B. Am I misunderstanding something?

The problem here is that, as you can see, it’s quite complicated.
Could we make it simpler?

Thanks,

Topic		Replies	Views
Distributed Sensu setup? Sensu Classic (EOL)	4	473	November 22, 2018
Distributed Sensu setup? Sensu Classic (EOL)	1	481	November 22, 2018
Input on Distributed setup PLS!!! Sensu Classic (EOL)	2	442	November 22, 2018
Multiple Sensu Servers Sensu Classic (EOL)	5	634	November 22, 2018
Multiple Sensu Servers Sensu Classic (EOL)	5	692	November 22, 2018

Distributed sensu monitoring setup

Related topics