High CPU after loss of connectivity in etcd cluster

I’m currently in the process of setting up a Go cluster to migrate to. I currently have 5 nodes (3 in 1 datacentre and 2 in another). We have been having some brief interruptions in network connectivity, causing the cluster to log a loss of communication with some of the cluster members. Following the connection issues, Sensu immediately starts reporting that read-only requests are taking too long, and the CPU on the server maxes out.

The cluster seems to recover after a few seconds, and running the sensuctl cluster health command shows everything is healthy… However the slow response messages do not stop (possibly because the CPU is maxed out on all cluster members), and the CPU never seems to recover… Everything else seems to be OK.

The only way I seem to be able to resolve the CPU issue is to restart the backend on every cluster member.

I’m running the latest version of Go on RHEL7 servers. There’s currently only 5 agents connected (one running on each server), so there’s next to no load on the servers themselves.

Here’s an example of the logs when the issue occurs, and just after… It’s a little hard to tell when everything comes back, as it doesn’t seem to log when a new leader is elected or found.

Jun 12 09:46:15 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer e4448e91885c4656 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2019-06-12T09:46:15+01:00"}
Jun 12 09:46:15 server-3 sensu-backend: {"component":"etcd","level":"error","msg":"failed to read e4448e91885c4656 on stream MsgApp v2 (read tcp 10.3.0.43:40764-\u003e10.2.0.24:2380: i/o timeout)","pkg":"rafthttp","time":"2019-06-12T09:46:15+01:00"}
Jun 12 09:46:15 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer e4448e91885c4656 (stream Message reader)","pkg":"rafthttp","time":"2019-06-12T09:46:15+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer d316a74a763c5880 (stream Message reader)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"error","msg":"failed to read d316a74a763c5880 on stream Message (read tcp 10.3.0.43:55846-\u003e10.2.0.43:2380: i/o timeout)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer d316a74a763c5880 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer f90cbaedc35cc694 (stream Message reader)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"error","msg":"failed to read f90cbaedc35cc694 on stream Message (read tcp 10.3.0.43:35924-\u003e10.2.0.44:2380: i/o timeout)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:16 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"lost the TCP streaming connection with peer f90cbaedc35cc694 (stream MsgApp v2 reader)","pkg":"rafthttp","time":"2019-06-12T09:46:16+01:00"}
Jun 12 09:46:17 server-3 sensu-backend: {"component":"licensing","level":"error","msg":"unexpected event received: {Error  [] etcdserver: no leader}","time":"2019-06-12T09:46:17+01:00"}
Jun 12 09:46:17 server-3 sensu-backend: {"component":"tessend","error":"etcdserver: no leader","level":"error","msg":"ring event error","time":"2019-06-12T09:46:17+01:00"}
Jun 12 09:46:17 server-3 sensu-backend: {"component":"schedulerd","level":"error","msg":"error in entity watcher","time":"2019-06-12T09:46:17+01:00"}
Jun 12 09:46:24 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"timed out waiting for read index response (local node might have slow network)","pkg":"etcdserver","time":"2019-06-12T09:46:24+01:00"}
Jun 12 09:46:29 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer f90cbaedc35cc694 could not connect: dial tcp 10.2.0.44:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:29+01:00"}
Jun 12 09:46:29 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer d316a74a763c5880 could not connect: dial tcp 10.2.0.43:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:29+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"timed out waiting for read index response (local node might have slow network)","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/check-processes\\\" range_end:\\\"/sensu.io/events/default/server-5.test/check-processet\\\" \" with result \"error:etcdserver: request timed out\" took too long (7.261483399s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/metrics-wrapper-os-only-60s\\\" range_end:\\\"/sensu.io/events/default/server-5.test/metrics-wrapper-os-only-60t\\\" \" with result \"error:etcdserver: request timed out\" took too long (9.734125535s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/check-cpu\\\" range_end:\\\"/sensu.io/events/default/server-5.test/check-cpv\\\" \" with result \"error:etcdserver: request timed out\" took too long (12.433451699s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/check-cpu\\\" range_end:\\\"/sensu.io/events/default/server-5.test/check-cpv\\\" \" with result \"error:etcdserver: request timed out\" took too long (12.433451699s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:31 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/entities/default/server-5.test\\\" limit:1 \" with result \"error:etcdserver: request timed out\" took too long (12.818445239s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:31+01:00"}
Jun 12 09:46:34 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer d316a74a763c5880 could not connect: dial tcp 10.2.0.43:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:34+01:00"}
Jun 12 09:46:34 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer f90cbaedc35cc694 could not connect: dial tcp 10.2.0.44:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:34+01:00"}
Jun 12 09:46:38 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"timed out waiting for read index response (local node might have slow network)","pkg":"etcdserver","time":"2019-06-12T09:46:38+01:00"}
Jun 12 09:46:38 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/metrics-wrapper-os-only-30s\\\" range_end:\\\"/sensu.io/events/default/server-5.test/metrics-wrapper-os-only-30t\\\" \" with result \"error:etcdserver: request timed out\" took too long (7.870088896s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:38+01:00"}
Jun 12 09:46:39 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer f90cbaedc35cc694 could not connect: dial tcp 10.2.0.44:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:39+01:00"}
Jun 12 09:46:39 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer d316a74a763c5880 could not connect: dial tcp 10.2.0.43:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:39+01:00"}
Jun 12 09:46:42 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer e4448e91885c4656 could not connect: dial tcp 10.2.0.24:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:42+01:00"}
Jun 12 09:46:44 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer d316a74a763c5880 could not connect: dial tcp 10.2.0.43:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:44+01:00"}
Jun 12 09:46:44 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"health check for peer f90cbaedc35cc694 could not connect: dial tcp 10.2.0.44:2380: i/o timeout (prober \"ROUND_TRIPPER_RAFT_MESSAGE\")","pkg":"rafthttp","time":"2019-06-12T09:46:44+01:00"}
Jun 12 09:46:45 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"timed out waiting for read index response (local node might have slow network)","pkg":"etcdserver","time":"2019-06-12T09:46:45+01:00"}
Jun 12 09:46:45 server-3 sensu-backend: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/events/default/server-5.test/check-ports\\\" range_end:\\\"/sensu.io/events/default/server-5.test/check-portt\\\" \" with result \"error:etcdserver: request timed out\" took too long (13.730537472s) to execute","pkg":"etcdserver","time":"2019-06-12T09:46:45+01:00"}

This continues until 09:47:13, at which time it no longer reports any connectivity issues, only the “read-only range request took too long” messages.

Looking at our CPU graphs, the CPU on all the servers max out at between 09:46:00 and 09:46:30 (we sample every 30 seconds).

1 Like

Hi @Adam_McDonagh

Which version of Sensu Go are you running? We included several performance enhancements in 5.9 which should really help.

That being said, we believe we identified the cause of a similar problem that might still happen in the latest release in specific conditions, see https://github.com/sensu/sensu-go/issues/3012. We expect to have this patch included within the next release, 5.10, that should normally land this week along with several other performance improvements.

While it’s hard to determine if your high usage is caused by the same reason, the context you gave us make me believe so. I’ll try to reach to you once the new release is available.

Thanks

Hi, thanks for the reply.

We’re currently running the latest version (sensu-go-backend-5.9.0-3043.x86_64).

Good news if 5.10 will address the issue I’m having. I did find that issue, however didn’t think it applied since there’s no mention of it triggering after a node goes down. I’ll look out for 5.10 and update as soon as I see it.

If I still have issues, will let you know.

Thanks!

I’m seeing the same CPU issue, though I haven’t been able to track down the triggering cause. Connectivity issues seem like a less likely explanation in a Kubernetes cluster in the same datacenter(?).

I tried switching to an external etcd cluster. Even though it seems to take a bit longer for the issue to appear after booting up the cluster (12-24 h instead of 30-60 minutes), the issue is still there, and it is sensu-backend rather than etcd itself that is using the CPU.

@palourde , would you happen to know if the https://github.com/sensu/sensu-go/issues/3012 fix applies to external etcd as well?

@mattias Yes, it does apply to external etcd clusters as well.

We had to delay the 5.10 releases a bit but it should be available by the end of the day if everything goes as planned and we do not encounter any major bug during QA!

Sorry, figuring out this issue and fixing it has been more challenging than anticipated; it is slated to land in 5.11 and not 5.10.

Can confirm that 5.10 has not resolved my specific issue… Was hoping the CPU issues mentioned as being resolved in the changelist would have some impact :frowning:

Any approximate timelines for 5.11?

Hi @Adam_McDonagh,

We had to take a step back and revisit the patch. That being said, it’s now merged and available on master. Due to the holidays this week (Canada & USA), the 5.11.0 release is planned for next week. However, you don’t necessarily have to wait, since you could build and deploy your own binaries from the sensu-go repo.

I’ll update this post once 5.11.0 is available!

Thanks

Hello @Adam_McDonagh,

Sensu Go 5.11 is now available for download. You can view the release notes here:

https://docs.sensu.io/sensu-go/5.11/release-notes/#5-11-0-release-notes

Regards,
Richard.

3 Likes

Thanks! Deployed on the test system this afternoon. Fingers crossed!

1 Like