Having local clusters in A and B makes sense.
Regarding the negative check dependencies, yes, I have to admit
that they are confusing.
I’d like to accomplish two things here:
-
No alert floods in case of a sensu server or VPN failure
-
Highly available monitoring for C and D which is robust even if
a sensu server or a VPN link is down
The fixed topology like { C => A, D => B } is sufficient for 1 but
that doesn’t help the item 2 because if the A<->C link is down
we can’t tell if the servers in C are working fine.
I’m now thinking it might be possible to imitate what the ec2_node
handler in sensu-community-plugins does instead of the negative
check dependencies. For simplicity, we’ll focus on C => A here.
- Set up a load balancer in C
-
We can use LVS for example
-
The real servers are the RabbitMQ instances in A and B
-
The scheduling algorithm should probably be sh (source hash)
-
We can set up two balancers with VRRP for availability
- All clients in C post the results to the virtual IP provided by the load
balancers.
-
If the A-C link is down, LVS removes the RabbitMQ instances in A
from the list of real servers. Therefore all results from C will goes to
the RabbitMQ in B.
-
After keepalive threshold time has passed, the sensu server in A
realizes it hasn’t received anything from the servers in C, then it runs
keepalive handlers.
I think we could make a new handler for the last step. The handler
searches for the stale client in the sensu server in B through sensu
API. If the client exists in the sensu server in B, it removes the client
from the sensu server in A. Otherwise it’s actually outage. The sensu
server in A sends an alert.
This might sound complicated but the implementation shouldn’t be
that difficult. It would be great if sensu-client could have a list of
RabbitMQ addresses and try next one automatically if failed.
This way we wouldn’t need to set up the load balancers at all.
What do you think?
Thanks,
2014年8月30日土曜日 0時01分57秒 UTC+9 Kyle Anderson:
···
I still recommend local sensu clusters in A and B (I recommend
building for 3 in a quorum) so that no one sensu server makes an
outage.
I think the negative check dependency is… a strange artifact and will
lead to confusion. I would KISS.
You can still achieve what you want by having a simple topology (c=>a,
d=>b, single registration) but just have A and B just ping the vpn
endpoints of C and D. (and make D a dependency of C’s ping, so you
only get one alert)
But having C and D clients register with A and B as a cluster will be
pain. Any network instability will lead to massive split brains.
On Thu, Aug 28, 2014 at 9:25 PM, Mitsutoshi Aoe ma...@foldr.in wrote:
Hi Kyle,
Thank you for the reply.
I was thinking of a cluster across A/B to monitor C and D because
the servers in C and D are important to us. We’d like to keep monitoring
even if one of the sensu servers are down.
Probably it would be great if the check dependencies in sensu would
accept negative dependencies, namely “Handle this check only when
that check is failing”.
This way the sensu clients in C and D would just need to post the results
to either of sensu server in A or B whichever alive, and we won’t get a
flood of alerts.
Thanks,
2014年8月28日木曜日 23時35分31秒 UTC+9 Kyle Anderson:
This is one of the most common questions on the list, I think I’ll do
a PR on the docs to try to help answer these questions better.
(Obviously there is a lot going on, there are many different possible
configurations, and taste / philosophy, etc)
I personally wouldn’t cluster across A/B. A and B sound like
Datacenters (regions) and I would treat them in isolation.
That means no, have the A and B clusters distinct, not federated, no
redis sharing. Just make a local A cluster and a local B cluster. (use
Uchiwa to get the single pane of glass feel)
I would make C and D post to just one of the A or B clusters, not
shared. Then use Sensu’s Dependency feature and make the dependencies
for all sensu checks on any server on C / D dependent on the vpn link
so you don’t get an alert flood. (this will require some tuning)
You have to balance “simple” with “robust”, which are the same most of
the time, but you are also trying to not get alert floods.
One big mega sensu federated cluster is “simple”, but not robust
(imho). Isolated A and B clusters with C and D connecting to one of
them. (maybe C => A, D => B?) is probably the most robust way to do
it, and has the easiest failure modes given your above constraints.
Use dependencies to avoid the alert flood
(http://sensuapp.org/docs/latest/checks#check-dependencies)
On Wed, Aug 27, 2014 at 9:19 PM, Mitsutoshi Aoe ma...@foldr.in wrote:
Hi all,
I’m designing a monitoring system with sensu in a distirubted setup.
Suppose we have 4 sites and each site may or may not be located in
different countries.
- Site A and B
- Lots of servers for each site
- We can set up Sensu servers in both sites
- Site C and D
- A few servers for each site
- We can’t have Sensu server here
- No direct internet access
- Every site is connected via VPN each other but there is no link
between
C and D
Now I think we’d like to
- Have separate sensu servers in A and B to monitor site-local clients
- Because we don’t want the sensu server in A to monitor the clients
in
B and vice versa
- We don’t want a flood of alerts when the A-B link is down
- Each server monitors the other server
- If the A-B link fails, we’ll get just two alerts from each server.
- Probably use standalone checks in site C and D
- Because the machines in C and D are likely to have different
monitoring
configurations, and the number of machines is a few. Also
bandwidth is
precious here.
- I guess we need a RabbitMQ cluster (and Redis) between A and B,
and
make all clients in C and D to post the results to the cluster.
- (I guess) we can use RabbitMQ federation and Redis replication
in
between A and B.
- The clients in C and D are posting to a virtual address with
VRRP.
- We can use Keepalived here.
This way we’ll need at least two RabbitMQ with separate namespaces
(/sensu and /sensu-shared?), two Redis with replication, and four sensu
servers in A and B. Am I misunderstanding something?
The problem here is that, as you can see, it’s quite complicated.
Could we make it simpler?
Thanks,