We’re working on transitioning off Sensu Core to Sensu Go, and I have setup Test and Production instances in two separate datacenters.
The web UI is randomly going offline in our instances. Sometimes it takes a day, sometimes it drops off within minutes after a service restart.
The backend API continues to operate, and will process handlers for alerts coming in.
$ sudo netstat -tulpn | grep sensu
tcp 0 0 127.0.0.1:2379 0.0.0.0:* LISTEN 727/sensu-backend
tcp 0 0 127.0.0.1:2380 0.0.0.0:* LISTEN 727/sensu-backend
tcp 0 0 127.0.0.1:3030 0.0.0.0:* LISTEN 726/sensu-agent
tcp 0 0 127.0.0.1:3031 0.0.0.0:* LISTEN 726/sensu-agent
tcp6 0 0 :::8080 :::* LISTEN 727/sensu-backend
tcp6 0 0 :::8081 :::* LISTEN 727/sensu-backend
udp 0 0 127.0.0.1:8125 0.0.0.0:* 726/sensu-agent
udp 0 0 127.0.0.1:3030 0.0.0.0:* 726/sensu-agent
Healthy instance:
tcp 0 0 127.0.0.1:2379 0.0.0.0:* LISTEN 2039/sensu-backend
tcp 0 0 127.0.0.1:2380 0.0.0.0:* LISTEN 2039/sensu-backend
tcp 0 0 127.0.0.1:3030 0.0.0.0:* LISTEN 997/sensu-agent
tcp 0 0 127.0.0.1:3031 0.0.0.0:* LISTEN 997/sensu-agent
tcp6 0 0 :::8080 :::* LISTEN 2039/sensu-backend
tcp6 0 0 :::8081 :::* LISTEN 2039/sensu-backend
tcp6 0 0 :::3000 :::* LISTEN 2039/sensu-backend
udp 0 0 127.0.0.1:8125 0.0.0.0:* 997/sensu-agent
udp 0 0 127.0.0.1:3030 0.0.0.0:* 997/sensu-agent
These instances are under minimal load, with 8-12 agents reporting in, each with <10 checks on 60 second intervals. Backend instances have 4GB of ram and 2 CPUs. Memory and CPU usage are minimal on the backend servers.
Running Sensu 6.2.5 on Ubuntu 18.04. Agent and Backend installs are handled via Chef, while assets, checks, and handlers are managed via terraform and identical through all 4 instances. We’re using TLS agent communication, and each instance is using the built in default standalone ETCD instance.
Reading through other threads such Intermittent connection refused from sensu-backend, realized checks had no default timeout which could be causing issues. I set a hard timeout on all our checks and handlers of 30 seconds, and that had no effect.