I am running Sensu-Go-Backend version 5.10.2 on CentOS 7 in a 3-node cluster. Once every day or so the CPU of one or more backend servers will go to 100% until the sensu-backend process is restarted. The logs contain lots of this (though not on every server and not always consistently):
sensu-backend: {“component”:“etcd”,“level”:“warning”,“msg”:“read-only range request “key:\”/sensu.io/silenced/default/\” range_end:\"/sensu.io/silenced/default0\" " with result “range_response_count:0 size:7” took too long (138.43301ms) to execute",“pkg”:“etcdserver”,“time”:“2019-07-02T16:18:51-07:00”}
There are ~90 clients publishing checks to these servers and most check results are handed to a tcp socket handler. Do you have a recommendation for how I can configure this cluster to avoid this load issue?
I am running on VMs. This is a 3 member cluster running at Amazon EC2 and the instances are type i3.xlarge. These are Amazon’s mid-tier, storage optimized servers… this is already way more hardware than I expected to have to throw at a project to support 90 clients.
Hi!
Using i3.xlarge
instances for ~90 agents is definitely excessive and you shouldn’t require that much hardware. For reference, we were able to handle more than 12,000 events per second on i3.2xlarge
instances, which could correspond to 4000 entities running 14 checks at 5 seconds interval.
I believe you might be encountering this bug: https://github.com/sensu/sensu-go/issues/3012. The patch has been merged to master, and it will be released next week as part of the 5.11.0 release if everything goes well. However, you don’t necessarily have to wait, since you could build and deploy your own binaries.
Thanks
Hi! I just wanted to let you know that Sensu Go 5.11.0 is now available. I’d love to know if it helps with the high resource usage problem you saw?
Thank you for checking back with me. I upgraded to 5.11 this morning but unfortunately my issue remains. I am still seeing the Etcd timeout messages regularly.
Hi @marmot,
The read-only range request [...] took too long
messages do not necessarily indicate a problem. What concerns me is the CPU usage. Upgrading to 5.11.0 and restarting the cluster should solve this issue based on the tests we made. If you are still facing this problem, I’d recommend you to open a new issue in Github and try to provide as much information as possible (configuration, logs, etc.).
Ok. I should clarify: the CPU load issue I was seeing does appear to be fixed in version 5.11. If the Etcd timeouts are not thought to be a problem I will leave them for now. Thank you for the attention.