Frequent Etcd timeouts

I am running Sensu-Go-Backend version 5.10.2 on CentOS 7 in a 3-node cluster. Once every day or so the CPU of one or more backend servers will go to 100% until the sensu-backend process is restarted. The logs contain lots of this (though not on every server and not always consistently):
sensu-backend: {“component”:“etcd”,“level”:“warning”,“msg”:“read-only range request “key:\”/sensu.io/silenced/default/\” range_end:\"/sensu.io/silenced/default0\" " with result “range_response_count:0 size:7” took too long (138.43301ms) to execute",“pkg”:“etcdserver”,“time”:“2019-07-02T16:18:51-07:00”}
There are ~90 clients publishing checks to these servers and most check results are handed to a tcp socket handler. Do you have a recommendation for how I can configure this cluster to avoid this load issue?

1 Like

I have experienced a similar issue, but with even fewer clients. I’ve been told that this is due to my storage backend not being fast enough to handle the etcd data replication for the cluster. FWIW, my cluster is running on HyperV VMs. It has been recommended that we consider switching from VMs to hardware. I’m curious about your configuration, is it VMs as well?

I am running on VMs. This is a 3 member cluster running at Amazon EC2 and the instances are type i3.xlarge. These are Amazon’s mid-tier, storage optimized servers… this is already way more hardware than I expected to have to throw at a project to support 90 clients.

Hi!

Using i3.xlarge instances for ~90 agents is definitely excessive and you shouldn’t require that much hardware. For reference, we were able to handle more than 12,000 events per second on i3.2xlarge instances, which could correspond to 4000 entities running 14 checks at 5 seconds interval.

I believe you might be encountering this bug: https://github.com/sensu/sensu-go/issues/3012. The patch has been merged to master, and it will be released next week as part of the 5.11.0 release if everything goes well. However, you don’t necessarily have to wait, since you could build and deploy your own binaries.

Thanks

Hi! I just wanted to let you know that Sensu Go 5.11.0 is now available. I’d love to know if it helps with the high resource usage problem you saw?

1 Like

Thank you for checking back with me. I upgraded to 5.11 this morning but unfortunately my issue remains. I am still seeing the Etcd timeout messages regularly.

1 Like

Hi @marmot,

The read-only range request [...] took too long messages do not necessarily indicate a problem. What concerns me is the CPU usage. Upgrading to 5.11.0 and restarting the cluster should solve this issue based on the tests we made. If you are still facing this problem, I’d recommend you to open a new issue in Github and try to provide as much information as possible (configuration, logs, etc.).

1 Like

Ok. I should clarify: the CPU load issue I was seeing does appear to be fixed in version 5.11. If the Etcd timeouts are not thought to be a problem I will leave them for now. Thank you for the attention.

1 Like