SensuGO Cloud Deployment Issues

nadeem · February 26, 2021, 6:39am

SensuGO cloud deployment Issues
We are testing sensuGo for a large scale deployment in cloud (AWS) environment. We are facing some serious issues related to capacity limitations.

The Setup
Specs:

1 external etcd C5.2XLarge with dedicated IO optimized nvme disk for etcd-data (Debian 10)
5 sensu-backend C5.Large (Debian 10)
1 LB (haproxy) (Debian 10)
upto 25000 Clients (Debian 8 and 9)
40 checks
SensuGO v6.2.0 OSS

Limitations

SensuGO cannot register agents quickly enough e.g. 1000 to 1500 per minute which is not good at all for a disaster recovery because keeplive events start to trigger. If we flood more than that system goes into a panic.
sensugo-v6-1-0-ce-gets-into-a-panic-when-connecting-more-than-1500-agents-at-once
The system cannot manage frequent checks like 10 checks per minute with 15K agents. ETCD’s CPU utilization maxed out.
Similarly, with larger number of checks like 25 with interval of 30 minutes also CPU utilization of ETCD maxed out.

{“component”:“keepalived”,“error”:“internal error: rpc error: code = Unavailable desc = transport is closing”,“level”:“error”,“msg”:“error updating event”,“time”:“2021-02-26T05:20:31Z”}
Feb 26 05:20:31 ip-172-31-82-204 sensu-backend[595]: {“check_name”:“keepalive”,“check_namespace”:“default”,“component”:“eventd”,“entity_name”:“sensu-agent-1_46”,“entity_namespace”:“default”,“error”:“internal error: rpc error: code = Unavailable desc = transport is closing”,“event_id”:“2d927dfb-8b31-4af4-9353-42dac274b17e”,“level”:“error”,“msg”:“error handling event”,“time”:“2021-02-26T05:20:31Z”}

ETCD can handle way lot more transaction rate and size than this. With this configuration, we can see memory and IOPS being underutilized and CPU more stressed. We have tried different configuration tuning but good so far. And we are using single ETCD to avoid network delay and save cpu time for synchronization. We are suspecting application level limitation in sensuGO.

I need to confirm with SensuGo team that SensuGO has such limitations for cloud deployment or recommendation to improve the performance and capacity. I’d be happy to share more logs an traces if needed.

jspaleta · February 26, 2021, 7:28pm

Hey,

Sensu makes use of etcd leases and watches as part of its operation, and these etcd features are probably cpu bound, so if you are benchmarking against just etcd key creation/deletion transaction throughput (which are disk i/o bound), that could account for the difference in performance expectations.

What version of etcd are you running externally? I know the Sensu engineering team has been looking to upgrade the embedded etcd from 3.3 to 3.4, but there have been some performance regressions in etcd itself that have caused concerns and they are looking closely at tracking the etcd 3.5 milestone to see if the etcd performance regressions are fixed in yet to be released etcd 3.5.

nadeem · March 1, 2021, 4:16am

We are using ETCD 3.5-pre.

runningman84 · March 10, 2021, 3:13pm

Unfortunatly we suffer from similar problems:
https://discourse.sensu.io/t/sensu-classic-vs-sensu-go-scalability/2438/7

Our old setup was using a single t2.micro redis instance which was sufficient to handle 10k+ clients…

Topic		Replies	Views
SensuGo v6.1.0-CE gets into a panic when connecting more than 1500 agents at once Sensu Go sensu-go	2	471	December 5, 2020
Sensu Go Clustering and etcd performance Migrating to Sensu Go sensu-go	1	657	September 16, 2020
Sensu classic vs Sensu go scalability Sensu Go	5	1130	February 12, 2021
Frequent Etcd timeouts Sensu Go	6	2427	July 15, 2019
Expected behavior when exceeding node limit? Sensu Go	2	543	August 8, 2019

SensuGO Cloud Deployment Issues

Related topics