Frequent Etcd timeouts

marmot · July 2, 2019, 11:32pm

I am running Sensu-Go-Backend version 5.10.2 on CentOS 7 in a 3-node cluster. Once every day or so the CPU of one or more backend servers will go to 100% until the sensu-backend process is restarted. The logs contain lots of this (though not on every server and not always consistently):
sensu-backend: {“component”:“etcd”,“level”:“warning”,“msg”:“read-only range request “key:\”/sensu.io/silenced/default/\” range_end:\"/sensu.io/silenced/default0\" " with result “range_response_count:0 size:7” took too long (138.43301ms) to execute",“pkg”:“etcdserver”,“time”:“2019-07-02T16:18:51-07:00”}
There are ~90 clients publishing checks to these servers and most check results are handed to a tcp socket handler. Do you have a recommendation for how I can configure this cluster to avoid this load issue?

marmot · July 3, 2019, 4:23pm

I am running on VMs. This is a 3 member cluster running at Amazon EC2 and the instances are type i3.xlarge. These are Amazon’s mid-tier, storage optimized servers… this is already way more hardware than I expected to have to throw at a project to support 90 clients.

palourde · July 3, 2019, 5:00pm

Hi!

Using i3.xlarge instances for ~90 agents is definitely excessive and you shouldn’t require that much hardware. For reference, we were able to handle more than 12,000 events per second on i3.2xlarge instances, which could correspond to 4000 entities running 14 checks at 5 seconds interval.

I believe you might be encountering this bug: https://github.com/sensu/sensu-go/issues/3012. The patch has been merged to master, and it will be released next week as part of the 5.11.0 release if everything goes well. However, you don’t necessarily have to wait, since you could build and deploy your own binaries.

Thanks

palourde · July 11, 2019, 2:49pm

Hi! I just wanted to let you know that Sensu Go 5.11.0 is now available. I’d love to know if it helps with the high resource usage problem you saw?

marmot · July 11, 2019, 4:40pm

Thank you for checking back with me. I upgraded to 5.11 this morning but unfortunately my issue remains. I am still seeing the Etcd timeout messages regularly.

palourde · July 12, 2019, 7:03pm

Hi @marmot,

The read-only range request [...] took too long messages do not necessarily indicate a problem. What concerns me is the CPU usage. Upgrading to 5.11.0 and restarting the cluster should solve this issue based on the tests we made. If you are still facing this problem, I’d recommend you to open a new issue in Github and try to provide as much information as possible (configuration, logs, etc.).

marmot · July 15, 2019, 5:27pm

Ok. I should clarify: the CPU load issue I was seeing does appear to be fixed in version 5.11. If the Etcd timeouts are not thought to be a problem I will leave them for now. Thank you for the attention.

Topic		Replies	Views
High CPU after loss of connectivity in etcd cluster Sensu Go	9	3138	July 11, 2019
Sensu Go Clustering and etcd performance Migrating to Sensu Go sensu-go	1	657	September 16, 2020
SensuGO Cloud Deployment Issues Sensu Go	3	333	March 10, 2021
SensuGo v6.1.0-CE gets into a panic when connecting more than 1500 agents at once Sensu Go sensu-go	2	471	December 5, 2020
Sensu HA cluster unstability Sensu Go	3	346	January 27, 2021

Frequent Etcd timeouts

Related topics