Sensu UI Going Offline

wduncanfraser · February 17, 2021, 9:01pm

We’re working on transitioning off Sensu Core to Sensu Go, and I have setup Test and Production instances in two separate datacenters.

The web UI is randomly going offline in our instances. Sometimes it takes a day, sometimes it drops off within minutes after a service restart.

The backend API continues to operate, and will process handlers for alerts coming in.

$ sudo netstat -tulpn | grep sensu
tcp        0      0 127.0.0.1:2379          0.0.0.0:*               LISTEN      727/sensu-backend
tcp        0      0 127.0.0.1:2380          0.0.0.0:*               LISTEN      727/sensu-backend
tcp        0      0 127.0.0.1:3030          0.0.0.0:*               LISTEN      726/sensu-agent
tcp        0      0 127.0.0.1:3031          0.0.0.0:*               LISTEN      726/sensu-agent
tcp6       0      0 :::8080                 :::*                    LISTEN      727/sensu-backend
tcp6       0      0 :::8081                 :::*                    LISTEN      727/sensu-backend
udp        0      0 127.0.0.1:8125          0.0.0.0:*                           726/sensu-agent
udp        0      0 127.0.0.1:3030          0.0.0.0:*                           726/sensu-agent

Healthy instance:

tcp        0      0 127.0.0.1:2379          0.0.0.0:*               LISTEN      2039/sensu-backend
tcp        0      0 127.0.0.1:2380          0.0.0.0:*               LISTEN      2039/sensu-backend
tcp        0      0 127.0.0.1:3030          0.0.0.0:*               LISTEN      997/sensu-agent
tcp        0      0 127.0.0.1:3031          0.0.0.0:*               LISTEN      997/sensu-agent
tcp6       0      0 :::8080                 :::*                    LISTEN      2039/sensu-backend
tcp6       0      0 :::8081                 :::*                    LISTEN      2039/sensu-backend
tcp6       0      0 :::3000                 :::*                    LISTEN      2039/sensu-backend
udp        0      0 127.0.0.1:8125          0.0.0.0:*                           997/sensu-agent
udp        0      0 127.0.0.1:3030          0.0.0.0:*                           997/sensu-agent

These instances are under minimal load, with 8-12 agents reporting in, each with <10 checks on 60 second intervals. Backend instances have 4GB of ram and 2 CPUs. Memory and CPU usage are minimal on the backend servers.

Running Sensu 6.2.5 on Ubuntu 18.04. Agent and Backend installs are handled via Chef, while assets, checks, and handlers are managed via terraform and identical through all 4 instances. We’re using TLS agent communication, and each instance is using the built in default standalone ETCD instance.

Reading through other threads such Intermittent connection refused from sensu-backend, realized checks had no default timeout which could be causing issues. I set a hard timeout on all our checks and handlers of 30 seconds, and that had no effect.

wduncanfraser · February 17, 2021, 9:28pm

Looks like others are having similar issues from this github issue: Sensu Go WebUI Randomly Crashing · Issue #4139 · sensu/sensu-go · GitHub

Also: Sensu WebUI offline - restarting service solves the problem - #12 by ahoiroman

wduncanfraser · February 18, 2021, 3:46pm

As a follow up, adding in monitoring of the UI HTTP interface, and two of the servers dropped last night. One at 0700 UTC, and the other at 0739 UTC.

Logs for crash at 0700:
sudo journalctl --unit sensu-backend --since “2021-02-18 06:30” --until “2021-02-18 07:30” | cat

gist.github.com

https://gist.github.com/wduncanfraser/c55691024df1bf5d2e6d485790fd766a#file-0700_crash-txt

0700_crash.txt

$ sudo journalctl --unit sensu-backend --since "2021-02-18 06:30" --until "2021-02-18 07:30" | cat
-- Logs begin at Mon 2020-02-10 16:53:57 UTC, end at Thu 2021-02-18 15:34:23 UTC. --
Feb 18 06:30:27 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/entity_configs/\\\" range_end:\\\"/sensu.io/entity_configt\\\" serializable:true \" with result \"range_response_count:6 size:2091\" took too long (105.880394ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:30:27Z"}
Feb 18 06:30:27 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/silenced/\\\" range_end:\\\"/sensu.io/silencee\\\" serializable:true \" with result \"range_response_count:0 size:6\" took too long (106.899425ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:30:27Z"}
Feb 18 06:31:03 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/metricsd/backends/items/bb543e23-4268-4535-ba03-1c9704dfee41\\\" \" with result \"range_response_count:1 size:108\" took too long (135.517112ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:03Z"}
Feb 18 06:31:24 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/assets/shared-services/\\\" range_end:\\\"/sensu.io/assets/shared-services0\\\" serializable:true \" with result \"range_response_count:4 size:3532\" took too long (278.990635ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:24Z"}
Feb 18 06:31:24 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/metricsd/backends/items/bb543e23-4268-4535-ba03-1c9704dfee41\\\" range_end:\\\"/sensu.io/rings/metricsd/backends/items/\\\\377\\\" limit:2 \" with result \"range_response_count:1 size:108\" took too long (309.386898ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:24Z"}
Feb 18 06:31:29 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/metricsd/backends/items/bb543e23-4268-4535-ba03-1c9704dfee41\\\" \" with result \"range_response_count:1 size:108\" took too long (117.374689ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:29Z"}
Feb 18 06:31:48 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/metricsd/backends/items/bb543e23-4268-4535-ba03-1c9704dfee41\\\" \" with result \"range_response_count:1 size:108\" took too long (262.673634ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:48Z"}
Feb 18 06:31:48 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/entity_configs/\\\" range_end:\\\"/sensu.io/entity_configt\\\" serializable:true \" with result \"range_response_count:6 size:2091\" took too long (288.056677ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:48Z"}

This file has been truncated. show original

0739_crash.txt

$ sudo journalctl --unit sensu-backend --since "2021-02-18 07:00" --until "2021-02-18 08:00" | cat
-- Logs begin at Mon 2020-02-10 16:53:57 UTC, end at Thu 2021-02-18 15:39:22 UTC. --
Feb 18 07:00:53 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"Starting auto-compaction at revision 537626 (retention: 2 revisions)","pkg":"compactor","time":"2021-02-18T07:00:53Z"}
Feb 18 07:00:53 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/shared-services/linux/items/<snip>-ciproxy01.ad.mta-telco.com\\\" \" with result \"range_response_count:1 size:108\" took too long (255.474857ms) to execute","pkg":"etcdserver","time":"2021-02-18T07:00:53Z"}
Feb 18 07:00:53 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"Finished auto-compaction at revision 537626","pkg":"compactor","time":"2021-02-18T07:00:53Z"}
Feb 18 07:04:04 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/metricsd/backends/items/6c78293a-0614-419f-a55c-4463a22a199c\\\" \" with result \"range_response_count:1 size:108\" took too long (743.207224ms) to execute","pkg":"etcdserver","time":"2021-02-18T07:04:04Z"}
Feb 18 07:04:14 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/shared-services/linux/items/<snip>-ciworker01.ad.mta-telco.com\\\" \" with result \"range_response_count:1 size:109\" took too long (131.533888ms) to execute","pkg":"etcdserver","time":"2021-02-18T07:04:14Z"}
Feb 18 07:05:53 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"Starting auto-compaction at revision 538567 (retention: 2 revisions)","pkg":"compactor","time":"2021-02-18T07:05:53Z"}
Feb 18 07:05:53 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"Finished auto-compaction at revision 538567","pkg":"compactor","time":"2021-02-18T07:05:53Z"}
Feb 18 07:06:24 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/handlers/shared-services/default\\\" limit:1 \" with result \"range_response_count:1 size:121\" took too long (111.392617ms) to execute","pkg":"etcdserver","time":"2021-02-18T07:06:24Z"}

This file has been truncated. show original

Logs for crash at 0739:
$ sudo journalctl --unit sensu-backend --since “2021-02-18 07:00” --until “2021-02-18 08:00” | cat

gist.github.com

https://gist.github.com/wduncanfraser/c55691024df1bf5d2e6d485790fd766a#file-0739_crash-txt

0700_crash.txt

$ sudo journalctl --unit sensu-backend --since "2021-02-18 06:30" --until "2021-02-18 07:30" | cat
-- Logs begin at Mon 2020-02-10 16:53:57 UTC, end at Thu 2021-02-18 15:34:23 UTC. --
Feb 18 06:30:27 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/entity_configs/\\\" range_end:\\\"/sensu.io/entity_configt\\\" serializable:true \" with result \"range_response_count:6 size:2091\" took too long (105.880394ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:30:27Z"}
Feb 18 06:30:27 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/silenced/\\\" range_end:\\\"/sensu.io/silencee\\\" serializable:true \" with result \"range_response_count:0 size:6\" took too long (106.899425ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:30:27Z"}
Feb 18 06:31:03 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/metricsd/backends/items/bb543e23-4268-4535-ba03-1c9704dfee41\\\" \" with result \"range_response_count:1 size:108\" took too long (135.517112ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:03Z"}
Feb 18 06:31:24 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/assets/shared-services/\\\" range_end:\\\"/sensu.io/assets/shared-services0\\\" serializable:true \" with result \"range_response_count:4 size:3532\" took too long (278.990635ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:24Z"}
Feb 18 06:31:24 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/metricsd/backends/items/bb543e23-4268-4535-ba03-1c9704dfee41\\\" range_end:\\\"/sensu.io/rings/metricsd/backends/items/\\\\377\\\" limit:2 \" with result \"range_response_count:1 size:108\" took too long (309.386898ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:24Z"}
Feb 18 06:31:29 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/metricsd/backends/items/bb543e23-4268-4535-ba03-1c9704dfee41\\\" \" with result \"range_response_count:1 size:108\" took too long (117.374689ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:29Z"}
Feb 18 06:31:48 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/metricsd/backends/items/bb543e23-4268-4535-ba03-1c9704dfee41\\\" \" with result \"range_response_count:1 size:108\" took too long (262.673634ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:48Z"}
Feb 18 06:31:48 <snip>-sensutest01 sensu-backend[17071]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/entity_configs/\\\" range_end:\\\"/sensu.io/entity_configt\\\" serializable:true \" with result \"range_response_count:6 size:2091\" took too long (288.056677ms) to execute","pkg":"etcdserver","time":"2021-02-18T06:31:48Z"}

This file has been truncated. show original

0739_crash.txt

$ sudo journalctl --unit sensu-backend --since "2021-02-18 07:00" --until "2021-02-18 08:00" | cat
-- Logs begin at Mon 2020-02-10 16:53:57 UTC, end at Thu 2021-02-18 15:39:22 UTC. --
Feb 18 07:00:53 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"Starting auto-compaction at revision 537626 (retention: 2 revisions)","pkg":"compactor","time":"2021-02-18T07:00:53Z"}
Feb 18 07:00:53 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/shared-services/linux/items/<snip>-ciproxy01.ad.mta-telco.com\\\" \" with result \"range_response_count:1 size:108\" took too long (255.474857ms) to execute","pkg":"etcdserver","time":"2021-02-18T07:00:53Z"}
Feb 18 07:00:53 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"Finished auto-compaction at revision 537626","pkg":"compactor","time":"2021-02-18T07:00:53Z"}
Feb 18 07:04:04 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/metricsd/backends/items/6c78293a-0614-419f-a55c-4463a22a199c\\\" \" with result \"range_response_count:1 size:108\" took too long (743.207224ms) to execute","pkg":"etcdserver","time":"2021-02-18T07:04:04Z"}
Feb 18 07:04:14 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/shared-services/linux/items/<snip>-ciworker01.ad.mta-telco.com\\\" \" with result \"range_response_count:1 size:109\" took too long (131.533888ms) to execute","pkg":"etcdserver","time":"2021-02-18T07:04:14Z"}
Feb 18 07:05:53 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"Starting auto-compaction at revision 538567 (retention: 2 revisions)","pkg":"compactor","time":"2021-02-18T07:05:53Z"}
Feb 18 07:05:53 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"Finished auto-compaction at revision 538567","pkg":"compactor","time":"2021-02-18T07:05:53Z"}
Feb 18 07:06:24 <snip>-sensu01 sensu-backend[16172]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/handlers/shared-services/default\\\" limit:1 \" with result \"range_response_count:1 size:121\" took too long (111.392617ms) to execute","pkg":"etcdserver","time":"2021-02-18T07:06:24Z"}

This file has been truncated. show original

jspaleta · February 19, 2021, 12:53am

hey,

Was there any noticeable load spike on the systems around that time? Especially disk i/o load.

What I’m see in that log is a lot of etcd timeouts, if the system was under heavy disk i/o load from something else firing on the system (like a cron job doing a lot of disk activity) that might be a factor here.

The fact that the web-ui thread seems to be the only thing that doesn’t adequately self-rescue afterwards and gets wedged is a problem too, but that fact that it looks like the etcd service is timing out seems to be the start of the situation.

I’ll be honest this one has been hard for me to try to replicate in my testing. The web-ui not recovering from the etcd disruption feels like race condition…and ugh…and its just hard to nail down how to trigger it to sort it out.

I see your own the bug report, can you attach the logs there?

wduncanfraser · February 19, 2021, 2:48am

Been doing some more digging into this today.

Now that I have monitoring of the Sensu UI in place, we’re only seeing consistent crashing in one of the datacenters. It looks like we’re seeing hypervisor level I/O latency spikes that correspond to when the UI tips over. We’re doing some more research into what is going on there.

I would agree that this is the start of the problem, but the fact that the UI is the only thing that doesn’t recover is concerning. As far as replicating, trying to think if there is a good way to simulate a disk level I/O hiccup at the hypervisor level. Basically, a complete stop in I/O for a second.

I’ll get the logs posted on the issue.

jspaleta · February 19, 2021, 2:51am

yep,
Once we have a good way to replicate the problem, we’ll be able to sort out how to fix the ui recovery bug. Then maybe we can make this a Q/A test too so it doesn’t regress.

wduncanfraser · February 19, 2021, 3:24pm

Was able to artificially reproduce this repeatedly in a vmware esxi environment this morning. On a test node while things were running fine, I would limit the disk on the VM to 16 IOPS (Lowest it allows), and then set it back to unlimited once the UI tipped over. As per usual, the UI was the only component to not recover properly.

It looks like with virtualbox you could use bandwidthctl to limit the I/O on a specific machine/disk at runtime: 7.19. VBoxManage bandwidthctl

jspaleta · February 19, 2021, 6:01pm

thanks for the reproducer!

jspaleta · April 2, 2021, 3:04am

Hey,
Just fyi 6.2.7 just released has several fixes, including a fix for what was determined to be the underlying problem causing the dashboard to go unresponsive after an etcd load spike.

madhatter22 · June 22, 2023, 9:23pm

Hi, I’m seeing this behaviour in 6.10.0, the webui stops listening on port 3000, there are etcd warnings about requests taking >100ms.
I haven’t correlated the time when the UI dies with the etcd warnings yet as there are no log messages from the UI backend.
Has there been a regression?
I’ll try increasing the etcd timeouts as suggested here Sensu Go WebUI Randomly Crashing · Issue #4139 · sensu/sensu-go · GitHub but with lower values. The warnings I’m getting are >100ms and <300ms

Topic		Replies	Views
Intermittent connection refused from sensu-backend Sensu Go	13	1389	November 30, 2020
sensu-server and sensu-api down Sensu Classic (EOL)	2	576	October 1, 2014
RabbitMQ clustering, node failure and Sensu Sensu Classic (EOL)	5	488	June 14, 2014
sensu-servers seem to sometimes just stop processing keepalive events Sensu Classic (EOL)	0	470	March 1, 2016
Sensu Go 6.8.0 is here! New Releases sensu-go-release	0	228	August 29, 2022

Sensu UI Going Offline

Related topics