When going back and reading through the logs again, I think that I flipped around what was happening between the server and clients in my mind when I wrote the original post… Oops. This is quite a bit different than I originally posted.
It looks like the server continues to publish checks, but the clients aren’t executing. In one client during the last issue, there’s a 12 minute (11 minutes, 58 seconds) gap where no checks are executed. The checks stop exactly when the RabbitMQ logs show that the node went offline, then resume when the node comes back online. Of course once they start again, there’s a ton of duplicate checks piled up waiting for the client. There’s no message about RabbitMQ reconnecting on this specific client, so I don’t believe this client was connected to the down node (that node was actually rebooted, so I don’t think it was reusing the same connection).
I can’t go back far enough in the RabbitMQ web interface far enough to see messages from the last time this happened, but I did see messages pile up, perhaps it was simply check requests… sigh. This issue has kept me up a couple of nights and now I’m not so sure of this
We do have the heartbeat property set in the rabbitmq connection settings in the client, I was thinking that might help this issue but it didn’t seem to make a difference. It’s set to 30s, and I see a timeout of 30s in the RabbitMQ interface (was 580s before adding the heartbeat property).
Now I’m lost
It is possible that Redis was available, but keys were unable to be set for some reason. Does the /health endpoint do anything like setting a key? The /health api endpoint seemed to function just fine, but not /info, /checks, /clients, etc.
Thanks for the responses so far!
Regarding RabbitMQ questions:
I’m not sure what you mean by the term “mode” in the context of RabbitMQ clustering. I originally deployed this as a single node, then added two more via the usual “rabbitmqctl stop_app; rabbitmqctl join_cluster rabbit@node; rabbitmqctl start_app” in the documentation. I then also replaced the original node with a new one after removing it from the cluster.
We do also have a couple of nodes which are federating off this cluster into another datacenter. In order for the federation to fail over properly when a node in the cluster was down (the one which originally defined the federated queue), I had to add a policy to replicate those queues since they are created as durable queues. It looks something like this from the rabbitmqctl list_policies command:
‘sensu ha-federation all ^federation: {“ha-mode”:“all”} 0’
I verified that no queues besides the federation queues are durable. W also don’t see any errors in the RabbitMQ logs about errors regarding queues or exchanges being unavailable, simply that it noticed the one node in the cluster went down, and client reconnects to the surviving RabbitMQ nodes.
···
On Sat, Jun 14, 2014 at 4:21 PM, Kyle Anderson kyle@xkyle.com wrote:
What mode are you clustering them in?
But if you can see the queue count rising, that means the queue is
“working” in some sense. I thought you might have had a strict node
count or something.
But it sounds like the queue is filling up, so clients are adding
events, but sensu-servers are not pulling them off.
The logs for the sensu servers would reveal which was master at the
time, and what it was doing.
It is also possible that Redis was available, but the sensu servers
couldn’t get a master lock?
On Sat, Jun 14, 2014 at 2:15 PM, Wyatt Walter wwalter@sugarcrm.com wrote:
Sure, there’s not a lot there. I configured clustering dynamically.
http://pastebin.com/Xp7grpUi
On Saturday, June 14, 2014 10:10:08 AM UTC-5, Kyle Anderson wrote:
Can you pastebin your rabbitmq config?
On Fri, Jun 13, 2014 at 2:32 PM, Wyatt Walter wwa...@sugarcrm.com wrote:
Hello,
We are currently using Sensu in a 3-node cluster in AWS. Each node runs
RabbitMQ and the Sensu API, dashboard, and server services. Every once
in a
while (has happened 3 times this week) one of the nodes drops offline.
(I
believe this is due to something in the underlying hypervisor and am not
interested in fixing that yet, what I am interested in is fixing what
happens while it’s down…) When this node drops goes away, the Sensu API
and
server processes essentially stop doing anything. Any calls to the API
service just timeout, and no new checks are scheduled. This lasts until
the
downed node comes back online.
I was fortunate to have it happen while I was actually sitting at my
desk
today, and poked around during the outage. Redis is available, RabbitMQ
seems fine (except for the one node being offline). I have the RabbitMQ
web
management plugin enabled, and can login and see results and keepalives
flowing into the cluster from clients, but the messages just stack up.
Once
the node comes back, the servers get slammed with keepalive events
because
no keepalive has been processed the whole time that single node was
down.
If I stop the same node cleanly in the cluster, everything keeps flowing
just like I would expect. However, it seems like when this node drops
off
something about the way that Sensu is configured is trying to reach that
node. I have the host for RabbitMQ set to ‘localhost’ on each of the
servers, so the services should just be connecting locally. I also don’t
believe that the node that’s gone down was the master today.
Is there some thing that causes Sensu to reach out to other nodes in the
RabbitMQ cluster that’s not timing out properly?
Running Sensu 0.12.6 with RabbitMQ 3.2.1 and Erlang R14B.
Thanks!