Sensu problems - communication(?)

Hello,

I was hoping someone would be able to help me with some problems I’ve been having with a new install of Sensu and its associated supporting technologies.

I have set up all the parts of the stack in separate LXC containers, and it was working kind of OK, until apparently 4 days ago. I have a separate container for redis, rabbitmq, sensu (server and api are in the same container).

I previousy had sensu set up on a standalone VPS - rabbitmq, redis and sensu all on the same, underpowered, machine. When I swapped over to the new machine and its containers, I just changed the rabbitmq details to point to the new server.

My first issue became apparent when I installed uchiwa - clients appeared, but complained about “no keepalive sent from client for “n” seconds” (where “n” can be a few hundred seconds, or several hundred thousands). The strange thing is that clients that were not switched over appeared in uchiwa - which I struggle to understand, TBH.

Now I’m seeing that there has been no communication from client <-> server for the last four days. As far as I recall, I haven’t intervened in that time to change anything. I’ve tried restarting the sensu server and api, with no success. The sensu-server log has entries like this:

{“timestamp”:“2016-04-18T08:31:23.780419+0100”,“level”:“info”,“message”:“pruning check result aggregations”}
{“timestamp”:“2016-04-18T08:31:28.508192+0100”,“level”:“info”,“message”:“publishing check request”,“payload”:{“name”:“cpu_metrics”,“issued”:1460964688,“command”:"/etc/sensu/plugins/cpu-metrics.rb -s stats.(hostname -s)"},"subscribers":["default metrics"]} {"timestamp":"2016-04-18T08:31:43.781529+0100","level":"info","message":"pruning check result aggregations"} {"timestamp":"2016-04-18T08:31:53.779365+0100","level":"info","message":"determining stale clients"} {"timestamp":"2016-04-18T08:31:53.779693+0100","level":"info","message":"determining stale check results"} {"timestamp":"2016-04-18T08:31:58.509449+0100","level":"info","message":"publishing check request","payload":{"name":"cpu_metrics","issued":1460964718,"command":"/etc/sensu/plugins/cpu-metrics.rb -s stats.(hostname -s)"},“subscribers”:[“default metrics”]}

I think there would usually be entries about clients talking back to the server here…

Furthermore, sensu client logs on the clients are very quiet:

{“timestamp”:“2016-04-18T07:53:15.112287+0100”,“level”:“warn”,“message”:“reconnecting to transport”}
{“timestamp”:“2016-04-18T07:53:17.407308+0100”,“level”:“error”,“message”:"[amqp] Detected TCP connection failure"}
{“timestamp”:“2016-04-18T07:53:21.628348+0100”,“level”:“info”,“message”:“reconnected to transport”}

Obviously something (or things) are not communicating properly with some other things. I’ve checked rabbitmq and redis, and they all seem alive and working correctly, I’m just at a loss as to where to look to troubleshoot…

Thanks in advance

Jerry

Can you confirm you restarted the sensu components after changing the
rabbitmq details? They need to be restarted otherwise they will use
the stale configuration.

If yes, then the place to start troubleshooting is the sensu client
itself. Can you connect (netcat) manually from the client to the
details listed in the sensu configuration? That would be the place to
start, as the TCP connection failure in the sensu client logs is kinda
the first error to resolve.

···

On Mon, Apr 18, 2016 at 1:28 AM, Jerry Steele <ticktockhouse@gmail.com> wrote:

Hello,

I was hoping someone would be able to help me with some problems I've been
having with a new install of Sensu and its associated supporting
technologies.

I have set up all the parts of the stack in separate LXC containers, and it
was working kind of OK, until apparently 4 days ago. I have a separate
container for redis, rabbitmq, sensu (server and api are in the same
container).

I previousy had sensu set up on a standalone VPS - rabbitmq, redis and sensu
all on the same, underpowered, machine. When I swapped over to the new
machine and its containers, I just changed the rabbitmq details to point to
the new server.

My first issue became apparent when I installed uchiwa - clients appeared,
but complained about "no keepalive sent from client for "n" seconds" (where
"n" can be a few hundred seconds, or several hundred thousands). The strange
thing is that clients that were not switched over appeared in uchiwa - which
I struggle to understand, TBH.

Now I'm seeing that there has been no communication from client <-> server
for the last four days. As far as I recall, I haven't intervened in that
time to change anything. I've tried restarting the sensu server and api,
with no success. The sensu-server log has entries like this:

{"timestamp":"2016-04-18T08:31:23.780419+0100","level":"info","message":"pruning
check result aggregations"}
{"timestamp":"2016-04-18T08:31:28.508192+0100","level":"info","message":"publishing
check
request","payload":{"name":"cpu_metrics","issued":1460964688,"command":"/etc/sensu/plugins/cpu-metrics.rb
-s stats.\(hostname \-s\)&quot;\},&quot;subscribers&quot;:\[&quot;default metrics&quot;\]\} \{&quot;timestamp&quot;:&quot;2016\-04\-18T08:31:43\.781529\+0100&quot;,&quot;level&quot;:&quot;info&quot;,&quot;message&quot;:&quot;pruning check result aggregations&quot;\} \{&quot;timestamp&quot;:&quot;2016\-04\-18T08:31:53\.779365\+0100&quot;,&quot;level&quot;:&quot;info&quot;,&quot;message&quot;:&quot;determining stale clients&quot;\} \{&quot;timestamp&quot;:&quot;2016\-04\-18T08:31:53\.779693\+0100&quot;,&quot;level&quot;:&quot;info&quot;,&quot;message&quot;:&quot;determining stale check results&quot;\} \{&quot;timestamp&quot;:&quot;2016\-04\-18T08:31:58\.509449\+0100&quot;,&quot;level&quot;:&quot;info&quot;,&quot;message&quot;:&quot;publishing check request&quot;,&quot;payload&quot;:\{&quot;name&quot;:&quot;cpu\_metrics&quot;,&quot;issued&quot;:1460964718,&quot;command&quot;:&quot;/etc/sensu/plugins/cpu\-metrics\.rb \-s stats\.(hostname -s)"},"subscribers":["default metrics"]}

I think there would usually be entries about clients talking back to the
server here...

Furthermore, sensu client logs on the clients are very quiet:

{"timestamp":"2016-04-18T07:53:15.112287+0100","level":"warn","message":"reconnecting
to transport"}
{"timestamp":"2016-04-18T07:53:17.407308+0100","level":"error","message":"[amqp]
Detected TCP connection failure"}
{"timestamp":"2016-04-18T07:53:21.628348+0100","level":"info","message":"reconnected
to transport"}

Obviously something (or things) are not communicating properly with some
other things. I've checked rabbitmq and redis, and they all seem alive and
working correctly, I'm just at a loss as to where to look to troubleshoot...

Thanks in advance

Jerry