High CPU after loss of connectivity in etcd cluster

I’m currently in the process of setting up a Go cluster to migrate to. I currently have 5 nodes (3 in 1 datacentre and 2 in another). We have been having some brief interruptions in network connectivity, causing the cluster to log a loss of communication with some of the cluster members. Following the connection issues, Sensu immediately starts reporting that read-only requests are taking too long, and the CPU on the server maxes out.

The cluster seems to recover after a few seconds, and running the sensuctl cluster health command shows everything is healthy… However the slow response messages do not stop (possibly because the CPU is maxed out on all cluster members), and the CPU never seems to recover… Everything else seems to be OK.

The only way I seem to be able to resolve the CPU issue is to restart the backend on every cluster member.

I’m running the latest version of Go on RHEL7 servers. There’s currently only 5 agents connected (one running on each server), so there’s next to no load on the servers themselves.

Here’s an example of the logs when the issue occurs, and just after… It’s a little hard to tell when everything comes back, as it doesn’t seem to log when a new leader is elected or found.

Jun 12 09:46:15 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer e4448e91885c4656 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2019-06-12T09:46:15+01:00"}
Jun 12 09:46:15 server-3 sensu-backend: {"component":"etcd","level":"error","msg":"failed to read e4448e91885c4656 on stream MsgApp v2 (read tcp 10.3.0.43:40764-\u003e10.2.0.24:2380: i/o timeout)","pkg":"rafthttp","time":"2019-06-12T09:46:15+01:00"}
Jun 12 09:46:15 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer e4448e91885c4656 (stream Message reader)","pkg":"rafthttp","time":"2019-06-12T09:46:15+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer d316a74a763c5880 (stream Message reader)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"error","msg":"failed to read d316a74a763c5880 on stream Message (read tcp 10.3.0.43:55846-\u003e10.2.0.43:2380: i/o timeout)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer d316a74a763c5880 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer f90cbaedc35cc694 (stream Message reader)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"error","msg":"failed to read f90cbaedc35cc694 on stream Message (read tcp 10.3.0.43:35924-\u003e10.2.0.44:2380: i/o timeout)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer f90cbaedc35cc694 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:17 server-3 sensu-backend: {"component":"licensing","level":"error","msg":"unexpected event received: {Error  [] etcdserver: no leader}","time":"2019-06-12T09:46:17+01:00"}
Jun 12 09:46:17 server-3 sensu-backend: {"component":"tessend","error":"etcdserver: no leader","level":"error","msg":"ring event error","time":"2019-06-12T09:46:17+01:00"}
Jun 12 09:46:17 server-3 sensu-backend: {"component":"schedulerd","level":"error","msg":"error in entity watcher","time":"2019-06-12T09:46:17+01:00"}
Jun 12 09:46:24 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"timed out waiting for read index response (local node might have slow network)","pkg":"etcdserver","time":"2019-06-12T09:46:24+01:00"}
Jun 12 09:46:29 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer f90cbaedc35cc694 could not connect: dial tcp 10.2.0.44:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:29+01:00"}
Jun 12 09:46:29 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer d316a74a763c5880 could not connect: dial tcp 10.2.0.43:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:29+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"timed out waiting for read index response (local node might have slow network)","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/check-processes\\\" range_end:\\\"/sensu.io/events/default/server-5.test/check-processet\\\" \" with result \"error:etcdserver: request timed out\" took too long (7.261483399s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/metrics-wrapper-os-only-60s\\\" range_end:\\\"/sensu.io/events/default/server-5.test/metrics-wrapper-os-only-60t\\\" \" with result \"error:etcdserver: request timed out\" took too long (9.734125535s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/check-cpu\\\" range_end:\\\"/sensu.io/events/default/server-5.test/check-cpv\\\" \" with result \"error:etcdserver: request timed out\" took too long (12.433451699s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/check-cpu\\\" range_end:\\\"/sensu.io/events/default/server-5.test/check-cpv\\\" \" with result \"error:etcdserver: request timed out\" took too long (12.433451699s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/entities/default/server-5.test\\\" limit:1 \" with result \"error:etcdserver: request timed out\" took too long (12.818445239s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:34 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer d316a74a763c5880 could not connect: dial tcp 10.2.0.43:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:34+01:00"}
Jun 12 09:46:34 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer f90cbaedc35cc694 could not connect: dial tcp 10.2.0.44:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:34+01:00"}
Jun 12 09:46:38 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"timed out waiting for read index response (local node might have slow network)","pkg":"etcdserver","time":"2019-06-12T09:46:38+01:00"}
Jun 12 09:46:38 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/metrics-wrapper-os-only-30s\\\" range_end:\\\"/sensu.io/events/default/server-5.test/metrics-wrapper-os-only-30t\\\" \" with result \"error:etcdserver: request timed out\" took too long (7.870088896s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:38+01:00"}
Jun 12 09:46:39 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer f90cbaedc35cc694 could not connect: dial tcp 10.2.0.44:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:39+01:00"}
Jun 12 09:46:39 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer d316a74a763c5880 could not connect: dial tcp 10.2.0.43:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:39+01:00"}
Jun 12 09:46:42 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer e4448e91885c4656 could not connect: dial tcp 10.2.0.24:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:42+01:00"}
Jun 12 09:46:44 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer d316a74a763c5880 could not connect: dial tcp 10.2.0.43:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:44+01:00"}
Jun 12 09:46:44 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer f90cbaedc35cc694 could not connect: dial tcp 10.2.0.44:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:44+01:00"}
Jun 12 09:46:45 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"timed out waiting for read index response (local node might have slow network)","pkg":"etcdserver","time":"2019-06-12T09:46:45+01:00"}
Jun 12 09:46:45 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/check-ports\\\" range_end:\\\"/sensu.io/events/default/server-5.test/check-portt\\\" \" with result \"error:etcdserver: request timed out\" took too long (13.730537472s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:45+01:00"}

This continues until 09:47:13, at which time it no longer reports any connectivity issues, only the “read-only range request took too long” messages.

Looking at our CPU graphs, the CPU on all the servers max out at between 09:46:00 and 09:46:30 (we sample every 30 seconds).

Hi @Adam_McDonagh

Which version of Sensu Go are you running? We included several performance enhancements in 5.9 which should really help.

That being said, we believe we identified the cause of a similar problem that might still happen in the latest release in specific conditions, see https://github.com/sensu/sensu-go/issues/3012. We expect to have this patch included within the next release, 5.10, that should normally land this week along with several other performance improvements.

While it’s hard to determine if your high usage is caused by the same reason, the context you gave us make me believe so. I’ll try to reach to you once the new release is available.

Thanks

Hi, thanks for the reply.

We’re currently running the latest version (sensu-go-backend-5.9.0-3043.x86_64).

Good news if 5.10 will address the issue I’m having. I did find that issue, however didn’t think it applied since there’s no mention of it triggering after a node goes down. I’ll look out for 5.10 and update as soon as I see it.

If I still have issues, will let you know.

Thanks!