Hello,
I was hoping someone would be able to help me with some problems I’ve been having with a new install of Sensu and its associated supporting technologies.
I have set up all the parts of the stack in separate LXC containers, and it was working kind of OK, until apparently 4 days ago. I have a separate container for redis, rabbitmq, sensu (server and api are in the same container).
I previousy had sensu set up on a standalone VPS - rabbitmq, redis and sensu all on the same, underpowered, machine. When I swapped over to the new machine and its containers, I just changed the rabbitmq details to point to the new server.
My first issue became apparent when I installed uchiwa - clients appeared, but complained about “no keepalive sent from client for “n” seconds” (where “n” can be a few hundred seconds, or several hundred thousands). The strange thing is that clients that were not switched over appeared in uchiwa - which I struggle to understand, TBH.
Now I’m seeing that there has been no communication from client <-> server for the last four days. As far as I recall, I haven’t intervened in that time to change anything. I’ve tried restarting the sensu server and api, with no success. The sensu-server log has entries like this:
{“timestamp”:“2016-04-18T08:31:23.780419+0100”,“level”:“info”,“message”:“pruning check result aggregations”}
{“timestamp”:“2016-04-18T08:31:28.508192+0100”,“level”:“info”,“message”:“publishing check request”,“payload”:{“name”:“cpu_metrics”,“issued”:1460964688,“command”:"/etc/sensu/plugins/cpu-metrics.rb -s stats.(hostname -s)"},"subscribers":["default metrics"]}
{"timestamp":"2016-04-18T08:31:43.781529+0100","level":"info","message":"pruning check result aggregations"}
{"timestamp":"2016-04-18T08:31:53.779365+0100","level":"info","message":"determining stale clients"}
{"timestamp":"2016-04-18T08:31:53.779693+0100","level":"info","message":"determining stale check results"}
{"timestamp":"2016-04-18T08:31:58.509449+0100","level":"info","message":"publishing check request","payload":{"name":"cpu_metrics","issued":1460964718,"command":"/etc/sensu/plugins/cpu-metrics.rb -s stats.(hostname -s)"},“subscribers”:[“default metrics”]}
I think there would usually be entries about clients talking back to the server here…
Furthermore, sensu client logs on the clients are very quiet:
{“timestamp”:“2016-04-18T07:53:15.112287+0100”,“level”:“warn”,“message”:“reconnecting to transport”}
{“timestamp”:“2016-04-18T07:53:17.407308+0100”,“level”:“error”,“message”:"[amqp] Detected TCP connection failure"}
{“timestamp”:“2016-04-18T07:53:21.628348+0100”,“level”:“info”,“message”:“reconnected to transport”}
Obviously something (or things) are not communicating properly with some other things. I’ve checked rabbitmq and redis, and they all seem alive and working correctly, I’m just at a loss as to where to look to troubleshoot…
Thanks in advance
Jerry