We observed something similar here, with a slightly different setup. We sometimes have massive hiccups in our RabbitMQ cluster (basically, we are losing the accesses to most of the cluster, sometimes with split-brains, data and credentials included Thanksfully it’s a test environment ), in which case, although the cluster knows how to recover, the Sensu servers and clients are weirdly connected: the servers still receive the keep-alives from the clients (at least they show up correctly through the API), the servers are publishing checks, but they never arrive to the clients. So, for us, it looks like the communication clients -> servers is fine but servers -> clients just don’t work. We usually end up restarting all the clients and it works again.
Our setup looks like this: we have 3 RabbitMQ nodes in a cluster,on which the servers and the nodes are connecting directly (no load-balancer nor SSL in the middle). We are running Sensu Core 1.4.2 with 2 servers and the nodes are mostly the same version. Only 1 Redis here.
So, ultimately, it only happens when things go really bad here.
However, although I haven’t tried it yet, it looks like Sensu 1.6 RabbitMQ improvements would help in this case?
IMPROVEMENT : Sensu Core 1.6 implements changes in RabbitMQ communication by using two discrete connections to the transport instead of two channels on a single connection, thereby doubling the number of concurrent connections RabbitMQ receives. This change prevents check result processing rates from being impacted by check execution request publishing rates and reduces the possibility of false keepalive alerts under certain conditions.