Sensu UI Going Offline

We’re working on transitioning off Sensu Core to Sensu Go, and I have setup Test and Production instances in two separate datacenters.

The web UI is randomly going offline in our instances. Sometimes it takes a day, sometimes it drops off within minutes after a service restart.

The backend API continues to operate, and will process handlers for alerts coming in.

$ sudo netstat -tulpn | grep sensu
tcp        0      0*               LISTEN      727/sensu-backend
tcp        0      0*               LISTEN      727/sensu-backend
tcp        0      0*               LISTEN      726/sensu-agent
tcp        0      0*               LISTEN      726/sensu-agent
tcp6       0      0 :::8080                 :::*                    LISTEN      727/sensu-backend
tcp6       0      0 :::8081                 :::*                    LISTEN      727/sensu-backend
udp        0      0*                           726/sensu-agent
udp        0      0*                           726/sensu-agent

Healthy instance:

tcp        0      0*               LISTEN      2039/sensu-backend
tcp        0      0*               LISTEN      2039/sensu-backend
tcp        0      0*               LISTEN      997/sensu-agent
tcp        0      0*               LISTEN      997/sensu-agent
tcp6       0      0 :::8080                 :::*                    LISTEN      2039/sensu-backend
tcp6       0      0 :::8081                 :::*                    LISTEN      2039/sensu-backend
tcp6       0      0 :::3000                 :::*                    LISTEN      2039/sensu-backend
udp        0      0*                           997/sensu-agent
udp        0      0*                           997/sensu-agent

These instances are under minimal load, with 8-12 agents reporting in, each with <10 checks on 60 second intervals. Backend instances have 4GB of ram and 2 CPUs. Memory and CPU usage are minimal on the backend servers.

Running Sensu 6.2.5 on Ubuntu 18.04. Agent and Backend installs are handled via Chef, while assets, checks, and handlers are managed via terraform and identical through all 4 instances. We’re using TLS agent communication, and each instance is using the built in default standalone ETCD instance.

Reading through other threads such Intermittent connection refused from sensu-backend, realized checks had no default timeout which could be causing issues. I set a hard timeout on all our checks and handlers of 30 seconds, and that had no effect.

Looks like others are having similar issues from this github issue: Sensu Go WebUI Randomly Crashing · Issue #4139 · sensu/sensu-go · GitHub

Also: Sensu WebUI offline - restarting service solves the problem - #12 by ahoiroman

As a follow up, adding in monitoring of the UI HTTP interface, and two of the servers dropped last night. One at 0700 UTC, and the other at 0739 UTC.

Logs for crash at 0700:
sudo journalctl --unit sensu-backend --since “2021-02-18 06:30” --until “2021-02-18 07:30” | cat

Logs for crash at 0739:
$ sudo journalctl --unit sensu-backend --since “2021-02-18 07:00” --until “2021-02-18 08:00” | cat


Was there any noticeable load spike on the systems around that time? Especially disk i/o load.

What I’m see in that log is a lot of etcd timeouts, if the system was under heavy disk i/o load from something else firing on the system (like a cron job doing a lot of disk activity) that might be a factor here.

The fact that the web-ui thread seems to be the only thing that doesn’t adequately self-rescue afterwards and gets wedged is a problem too, but that fact that it looks like the etcd service is timing out seems to be the start of the situation.

I’ll be honest this one has been hard for me to try to replicate in my testing. The web-ui not recovering from the etcd disruption feels like race condition…and ugh…and its just hard to nail down how to trigger it to sort it out.

I see your own the bug report, can you attach the logs there?

Been doing some more digging into this today.

Now that I have monitoring of the Sensu UI in place, we’re only seeing consistent crashing in one of the datacenters. It looks like we’re seeing hypervisor level I/O latency spikes that correspond to when the UI tips over. We’re doing some more research into what is going on there.

I would agree that this is the start of the problem, but the fact that the UI is the only thing that doesn’t recover is concerning. As far as replicating, trying to think if there is a good way to simulate a disk level I/O hiccup at the hypervisor level. Basically, a complete stop in I/O for a second.

I’ll get the logs posted on the issue.

Once we have a good way to replicate the problem, we’ll be able to sort out how to fix the ui recovery bug. Then maybe we can make this a Q/A test too so it doesn’t regress.

Was able to artificially reproduce this repeatedly in a vmware esxi environment this morning. On a test node while things were running fine, I would limit the disk on the VM to 16 IOPS (Lowest it allows), and then set it back to unlimited once the UI tipped over. As per usual, the UI was the only component to not recover properly.

It looks like with virtualbox you could use bandwidthctl to limit the I/O on a specific machine/disk at runtime: 7.19. VBoxManage bandwidthctl


thanks for the reproducer!

Just fyi 6.2.7 just released has several fixes, including a fix for what was determined to be the underlying problem causing the dashboard to go unresponsive after an etcd load spike.