Hi Guys,
We’ve recently found that several of our servers running the sensu client version 0.12.2-1 on Centos 6.4 x86_64 have hung immediately after writing the following error to their /var/log/sensu/sensu-client.log:
{“timestamp”:“2014-08-16T01:32:41.841667+1000”,“level”:“error”,“message”:“detected missing amqp heartbeats”}
``
Once the sensu-client has logged the above error its process does not exit, but it does essentially appear to have hung, as it no longer writes anything to its logs and the following event is reported in the sensu server dashboard:
lwa0002.mydomain.com
keepalive
No keep-alive sent from client in over 180 secondsThe only way I have found to recover from this state is to restart the sensu-client. I’ve looked at our sensu server performance graphs to see what the sensu server was doing at the time that the sensu-clients logged “detected missing amqp heartbeats” and the server’s load had spiked up to well over 40, so I’m confident that the high load on the sensu server is the cause of the clients logging the error. Even so, my argument is that this is possibly a bug with the sensu-client. It shouldn’t hang after logging the error, it should either recover once it starts getting a response again from rabbitMQ or at the very least, cleanly exit. The sensu-client shouldn’t just hang around waiting for some sort human intervention.
Any advise or input in regard to this issue would be very much appreciated.
Cheers,
Tom