Transport is not connected, false keepalives


#1

Hi Folks,

I have been having some issues with keepalives for sometime. The odd time we get keepalive storms. We got some keepalives today but I cant seem to pinpoint the issue. It looks like one of our sensu servers closed the connection but im not sure why. According to the load balancer logs, it was closed by the client.

2 node sensu cluster (v1.4.3)
3 node rabbitmq cluster (v3.7.7) behind AVI load balancer with SSL terminated on front end
3 node redis cluster (redis-sentinel) v3.2.12
OS of all nodes RHEL 7.5

Any ideas on what might be causing this issue?

Thanks,
Fearghal


#2

Hey Fearghal,

We observed something similar here, with a slightly different setup. We sometimes have massive hiccups in our RabbitMQ cluster (basically, we are losing the accesses to most of the cluster, sometimes with split-brains, data and credentials included :+1: Thanksfully it’s a test environment :slight_smile: ), in which case, although the cluster knows how to recover, the Sensu servers and clients are weirdly connected: the servers still receive the keep-alives from the clients (at least they show up correctly through the API), the servers are publishing checks, but they never arrive to the clients. So, for us, it looks like the communication clients -> servers is fine but servers -> clients just don’t work. We usually end up restarting all the clients and it works again.

Our setup looks like this: we have 3 RabbitMQ nodes in a cluster,on which the servers and the nodes are connecting directly (no load-balancer nor SSL in the middle). We are running Sensu Core 1.4.2 with 2 servers and the nodes are mostly the same version. Only 1 Redis here.

So, ultimately, it only happens when things go really bad here.
However, although I haven’t tried it yet, it looks like Sensu 1.6 RabbitMQ improvements would help in this case?

IMPROVEMENT : Sensu Core 1.6 implements changes in RabbitMQ communication by using two discrete connections to the transport instead of two channels on a single connection, thereby doubling the number of concurrent connections RabbitMQ receives. This change prevents check result processing rates from being impacted by check execution request publishing rates and reduces the possibility of false keepalive alerts under certain conditions.


#3

Have you tried using Sensu’s built-in support for RabbitMQ clusters? We typically recommend avoiding load-balancing middleware. See here for more information: https://docs.sensu.io/sensu-core/latest/reference/rabbitmq/#configure-sensu-to-use-the-rabbitmq-cluster


#4

Sorry for the delayed response @Jonathan_Ballet. Its good to know that we are not the only folks experiencing this. Unfortunately ours is production :frowning: so would definitely like to get this resolved.

I did see the fixes in 1.6 and was waiting for a little bit before proceeding with an upgrade but I will definitely push this out to our staging and non prod environments.

We would have to keep an eye on it for a little bit as the keepalive storms aren’t exactly consistent :frowning:

Will update this thread down the line if the issue remains or is resolved.

Thanks again for your feedback :metal:


#5

Hi @calebhailey,

We actually used the built in support for the rabbitmq cluster on initial install but decided with a load balancer as if we had to rebuild a member of the rabbitmq cluster for whatever reason, we wouldn’t have to update the client configs across all our hosts but if the recommendation is not to use a load balancer then I would be happy changing this too.

We also seen the keepalive issues though when using the built-in method for a rabbitmq cluster, though hopefully the update to 1.6.x will resolve this :slight_smile:

Will try it out and update you.

Thanks for the recommendations.