RabbitMQ clustering, node failure and Sensu


#1

Hello,

We are currently using Sensu in a 3-node cluster in AWS. Each node runs RabbitMQ and the Sensu API, dashboard, and server services. Every once in a while (has happened 3 times this week) one of the nodes drops offline. (I believe this is due to something in the underlying hypervisor and am not interested in fixing that yet, what I am interested in is fixing what happens while it’s down…) When this node drops goes away, the Sensu API and server processes essentially stop doing anything. Any calls to the API service just timeout, and no new checks are scheduled. This lasts until the downed node comes back online.

I was fortunate to have it happen while I was actually sitting at my desk today, and poked around during the outage. Redis is available, RabbitMQ seems fine (except for the one node being offline). I have the RabbitMQ web management plugin enabled, and can login and see results and keepalives flowing into the cluster from clients, but the messages just stack up. Once the node comes back, the servers get slammed with keepalive events because no keepalive has been processed the whole time that single node was down.

If I stop the same node cleanly in the cluster, everything keeps flowing just like I would expect. However, it seems like when this node drops off something about the way that Sensu is configured is trying to reach that node. I have the host for RabbitMQ set to ‘localhost’ on each of the servers, so the services should just be connecting locally. I also don’t believe that the node that’s gone down was the master today.

Is there some thing that causes Sensu to reach out to other nodes in the RabbitMQ cluster that’s not timing out properly?

Running Sensu 0.12.6 with RabbitMQ 3.2.1 and Erlang R14B.

Thanks!


#2

Can you pastebin your rabbitmq config?

···

On Fri, Jun 13, 2014 at 2:32 PM, Wyatt Walter <wwalter@sugarcrm.com> wrote:

Hello,

We are currently using Sensu in a 3-node cluster in AWS. Each node runs
RabbitMQ and the Sensu API, dashboard, and server services. Every once in a
while (has happened 3 times this week) one of the nodes drops offline. (I
believe this is due to something in the underlying hypervisor and am not
interested in fixing that yet, what I am interested in is fixing what
happens while it's down..) When this node drops goes away, the Sensu API and
server processes essentially stop doing anything. Any calls to the API
service just timeout, and no new checks are scheduled. This lasts until the
downed node comes back online.

I was fortunate to have it happen while I was actually sitting at my desk
today, and poked around during the outage. Redis is available, RabbitMQ
seems fine (except for the one node being offline). I have the RabbitMQ web
management plugin enabled, and can login and see results and keepalives
flowing into the cluster from clients, but the messages just stack up. Once
the node comes back, the servers get slammed with keepalive events because
no keepalive has been processed the whole time that single node was down.

If I stop the same node cleanly in the cluster, everything keeps flowing
just like I would expect. However, it seems like when this node drops off
something about the way that Sensu is configured is trying to reach that
node. I have the host for RabbitMQ set to 'localhost' on each of the
servers, so the services should just be connecting locally. I also don't
believe that the node that's gone down was the master today.

Is there some thing that causes Sensu to reach out to other nodes in the
RabbitMQ cluster that's not timing out properly?

Running Sensu 0.12.6 with RabbitMQ 3.2.1 and Erlang R14B.

Thanks!


#3

Sure, there’s not a lot there. I configured clustering dynamically.

http://pastebin.com/Xp7grpUi

···

On Saturday, June 14, 2014 10:10:08 AM UTC-5, Kyle Anderson wrote:

Can you pastebin your rabbitmq config?

On Fri, Jun 13, 2014 at 2:32 PM, Wyatt Walter wwa...@sugarcrm.com wrote:

Hello,

We are currently using Sensu in a 3-node cluster in AWS. Each node runs

RabbitMQ and the Sensu API, dashboard, and server services. Every once in a

while (has happened 3 times this week) one of the nodes drops offline. (I

believe this is due to something in the underlying hypervisor and am not

interested in fixing that yet, what I am interested in is fixing what

happens while it’s down…) When this node drops goes away, the Sensu API and

server processes essentially stop doing anything. Any calls to the API

service just timeout, and no new checks are scheduled. This lasts until the

downed node comes back online.

I was fortunate to have it happen while I was actually sitting at my desk

today, and poked around during the outage. Redis is available, RabbitMQ

seems fine (except for the one node being offline). I have the RabbitMQ web

management plugin enabled, and can login and see results and keepalives

flowing into the cluster from clients, but the messages just stack up. Once

the node comes back, the servers get slammed with keepalive events because

no keepalive has been processed the whole time that single node was down.

If I stop the same node cleanly in the cluster, everything keeps flowing

just like I would expect. However, it seems like when this node drops off

something about the way that Sensu is configured is trying to reach that

node. I have the host for RabbitMQ set to ‘localhost’ on each of the

servers, so the services should just be connecting locally. I also don’t

believe that the node that’s gone down was the master today.

Is there some thing that causes Sensu to reach out to other nodes in the

RabbitMQ cluster that’s not timing out properly?

Running Sensu 0.12.6 with RabbitMQ 3.2.1 and Erlang R14B.

Thanks!


#4

What mode are you clustering them in?

But if you can see the queue count rising, that means the queue is
"working" in some sense. I thought you might have had a strict node
count or something.

But it *sounds* like the queue is filling up, so clients are adding
events, but sensu-servers are not pulling them off.

The logs for the sensu servers would reveal which was master at the
time, and what it was doing.

It is also possible that Redis was available, but the sensu servers
couldn't get a master lock?

···

On Sat, Jun 14, 2014 at 2:15 PM, Wyatt Walter <wwalter@sugarcrm.com> wrote:

Sure, there's not a lot there. I configured clustering dynamically.

http://pastebin.com/Xp7grpUi

On Saturday, June 14, 2014 10:10:08 AM UTC-5, Kyle Anderson wrote:

Can you pastebin your rabbitmq config?

On Fri, Jun 13, 2014 at 2:32 PM, Wyatt Walter <wwa...@sugarcrm.com> wrote:
> Hello,
>
> We are currently using Sensu in a 3-node cluster in AWS. Each node runs
> RabbitMQ and the Sensu API, dashboard, and server services. Every once
> in a
> while (has happened 3 times this week) one of the nodes drops offline.
> (I
> believe this is due to something in the underlying hypervisor and am not
> interested in fixing that yet, what I am interested in is fixing what
> happens while it's down..) When this node drops goes away, the Sensu API
> and
> server processes essentially stop doing anything. Any calls to the API
> service just timeout, and no new checks are scheduled. This lasts until
> the
> downed node comes back online.
>
> I was fortunate to have it happen while I was actually sitting at my
> desk
> today, and poked around during the outage. Redis is available, RabbitMQ
> seems fine (except for the one node being offline). I have the RabbitMQ
> web
> management plugin enabled, and can login and see results and keepalives
> flowing into the cluster from clients, but the messages just stack up.
> Once
> the node comes back, the servers get slammed with keepalive events
> because
> no keepalive has been processed the whole time that single node was
> down.
>
> If I stop the same node cleanly in the cluster, everything keeps flowing
> just like I would expect. However, it seems like when this node drops
> off
> something about the way that Sensu is configured is trying to reach that
> node. I have the host for RabbitMQ set to 'localhost' on each of the
> servers, so the services should just be connecting locally. I also don't
> believe that the node that's gone down was the master today.
>
> Is there some thing that causes Sensu to reach out to other nodes in the
> RabbitMQ cluster that's not timing out properly?
>
> Running Sensu 0.12.6 with RabbitMQ 3.2.1 and Erlang R14B.
>
> Thanks!
>


#5

Check to see if the Sensu server connections/channels are being blocked.

···

On Jun 14, 2014 2:21 PM, “Kyle Anderson” kyle@xkyle.com wrote:

What mode are you clustering them in?

But if you can see the queue count rising, that means the queue is

“working” in some sense. I thought you might have had a strict node

count or something.

But it sounds like the queue is filling up, so clients are adding

events, but sensu-servers are not pulling them off.

The logs for the sensu servers would reveal which was master at the

time, and what it was doing.

It is also possible that Redis was available, but the sensu servers

couldn’t get a master lock?

On Sat, Jun 14, 2014 at 2:15 PM, Wyatt Walter wwalter@sugarcrm.com wrote:

Sure, there’s not a lot there. I configured clustering dynamically.

http://pastebin.com/Xp7grpUi

On Saturday, June 14, 2014 10:10:08 AM UTC-5, Kyle Anderson wrote:

Can you pastebin your rabbitmq config?

On Fri, Jun 13, 2014 at 2:32 PM, Wyatt Walter wwa...@sugarcrm.com wrote:

Hello,

We are currently using Sensu in a 3-node cluster in AWS. Each node runs

RabbitMQ and the Sensu API, dashboard, and server services. Every once

in a

while (has happened 3 times this week) one of the nodes drops offline.

(I

believe this is due to something in the underlying hypervisor and am not

interested in fixing that yet, what I am interested in is fixing what

happens while it’s down…) When this node drops goes away, the Sensu API

and

server processes essentially stop doing anything. Any calls to the API

service just timeout, and no new checks are scheduled. This lasts until

the

downed node comes back online.

I was fortunate to have it happen while I was actually sitting at my

desk

today, and poked around during the outage. Redis is available, RabbitMQ

seems fine (except for the one node being offline). I have the RabbitMQ

web

management plugin enabled, and can login and see results and keepalives

flowing into the cluster from clients, but the messages just stack up.

Once

the node comes back, the servers get slammed with keepalive events

because

no keepalive has been processed the whole time that single node was

down.

If I stop the same node cleanly in the cluster, everything keeps flowing

just like I would expect. However, it seems like when this node drops

off

something about the way that Sensu is configured is trying to reach that

node. I have the host for RabbitMQ set to ‘localhost’ on each of the

servers, so the services should just be connecting locally. I also don’t

believe that the node that’s gone down was the master today.

Is there some thing that causes Sensu to reach out to other nodes in the

RabbitMQ cluster that’s not timing out properly?

Running Sensu 0.12.6 with RabbitMQ 3.2.1 and Erlang R14B.

Thanks!


#6

When going back and reading through the logs again, I think that I flipped around what was happening between the server and clients in my mind when I wrote the original post… Oops. This is quite a bit different than I originally posted.

It looks like the server continues to publish checks, but the clients aren’t executing. In one client during the last issue, there’s a 12 minute (11 minutes, 58 seconds) gap where no checks are executed. The checks stop exactly when the RabbitMQ logs show that the node went offline, then resume when the node comes back online. Of course once they start again, there’s a ton of duplicate checks piled up waiting for the client. There’s no message about RabbitMQ reconnecting on this specific client, so I don’t believe this client was connected to the down node (that node was actually rebooted, so I don’t think it was reusing the same connection).

I can’t go back far enough in the RabbitMQ web interface far enough to see messages from the last time this happened, but I did see messages pile up, perhaps it was simply check requests… sigh. This issue has kept me up a couple of nights and now I’m not so sure of this :frowning:

We do have the heartbeat property set in the rabbitmq connection settings in the client, I was thinking that might help this issue but it didn’t seem to make a difference. It’s set to 30s, and I see a timeout of 30s in the RabbitMQ interface (was 580s before adding the heartbeat property).

Now I’m lost :frowning:

It is possible that Redis was available, but keys were unable to be set for some reason. Does the /health endpoint do anything like setting a key? The /health api endpoint seemed to function just fine, but not /info, /checks, /clients, etc.

Thanks for the responses so far!

Regarding RabbitMQ questions:

I’m not sure what you mean by the term “mode” in the context of RabbitMQ clustering. I originally deployed this as a single node, then added two more via the usual “rabbitmqctl stop_app; rabbitmqctl join_cluster rabbit@node; rabbitmqctl start_app” in the documentation. I then also replaced the original node with a new one after removing it from the cluster.

We do also have a couple of nodes which are federating off this cluster into another datacenter. In order for the federation to fail over properly when a node in the cluster was down (the one which originally defined the federated queue), I had to add a policy to replicate those queues since they are created as durable queues. It looks something like this from the rabbitmqctl list_policies command:

‘sensu ha-federation all ^federation: {“ha-mode”:“all”} 0’

I verified that no queues besides the federation queues are durable. W also don’t see any errors in the RabbitMQ logs about errors regarding queues or exchanges being unavailable, simply that it noticed the one node in the cluster went down, and client reconnects to the surviving RabbitMQ nodes.

···

On Sat, Jun 14, 2014 at 4:21 PM, Kyle Anderson kyle@xkyle.com wrote:

What mode are you clustering them in?

But if you can see the queue count rising, that means the queue is

“working” in some sense. I thought you might have had a strict node

count or something.

But it sounds like the queue is filling up, so clients are adding

events, but sensu-servers are not pulling them off.

The logs for the sensu servers would reveal which was master at the

time, and what it was doing.

It is also possible that Redis was available, but the sensu servers

couldn’t get a master lock?

On Sat, Jun 14, 2014 at 2:15 PM, Wyatt Walter wwalter@sugarcrm.com wrote:

Sure, there’s not a lot there. I configured clustering dynamically.

http://pastebin.com/Xp7grpUi

On Saturday, June 14, 2014 10:10:08 AM UTC-5, Kyle Anderson wrote:

Can you pastebin your rabbitmq config?

On Fri, Jun 13, 2014 at 2:32 PM, Wyatt Walter wwa...@sugarcrm.com wrote:

Hello,

We are currently using Sensu in a 3-node cluster in AWS. Each node runs

RabbitMQ and the Sensu API, dashboard, and server services. Every once

in a

while (has happened 3 times this week) one of the nodes drops offline.

(I

believe this is due to something in the underlying hypervisor and am not

interested in fixing that yet, what I am interested in is fixing what

happens while it’s down…) When this node drops goes away, the Sensu API

and

server processes essentially stop doing anything. Any calls to the API

service just timeout, and no new checks are scheduled. This lasts until

the

downed node comes back online.

I was fortunate to have it happen while I was actually sitting at my

desk

today, and poked around during the outage. Redis is available, RabbitMQ

seems fine (except for the one node being offline). I have the RabbitMQ

web

management plugin enabled, and can login and see results and keepalives

flowing into the cluster from clients, but the messages just stack up.

Once

the node comes back, the servers get slammed with keepalive events

because

no keepalive has been processed the whole time that single node was

down.

If I stop the same node cleanly in the cluster, everything keeps flowing

just like I would expect. However, it seems like when this node drops

off

something about the way that Sensu is configured is trying to reach that

node. I have the host for RabbitMQ set to ‘localhost’ on each of the

servers, so the services should just be connecting locally. I also don’t

believe that the node that’s gone down was the master today.

Is there some thing that causes Sensu to reach out to other nodes in the

RabbitMQ cluster that’s not timing out properly?

Running Sensu 0.12.6 with RabbitMQ 3.2.1 and Erlang R14B.

Thanks!