SensuGO cloud deployment Issues
We are testing sensuGo for a large scale deployment in cloud (AWS) environment. We are facing some serious issues related to capacity limitations.
The Setup
Specs:
- 1 external etcd C5.2XLarge with dedicated IO optimized nvme disk for etcd-data (Debian 10)
- 5 sensu-backend C5.Large (Debian 10)
- 1 LB (haproxy) (Debian 10)
- upto 25000 Clients (Debian 8 and 9)
- 40 checks
- SensuGO v6.2.0 OSS
Limitations
-
SensuGO cannot register agents quickly enough e.g. 1000 to 1500 per minute which is not good at all for a disaster recovery because keeplive events start to trigger. If we flood more than that system goes into a panic.
sensugo-v6-1-0-ce-gets-into-a-panic-when-connecting-more-than-1500-agents-at-once -
The system cannot manage frequent checks like 10 checks per minute with 15K agents. ETCD’s CPU utilization maxed out.
-
Similarly, with larger number of checks like 25 with interval of 30 minutes also CPU utilization of ETCD maxed out.
{“component”:“keepalived”,“error”:“internal error: rpc error: code = Unavailable desc = transport is closing”,“level”:“error”,“msg”:“error updating event”,“time”:“2021-02-26T05:20:31Z”}
Feb 26 05:20:31 ip-172-31-82-204 sensu-backend[595]: {“check_name”:“keepalive”,“check_namespace”:“default”,“component”:“eventd”,“entity_name”:“sensu-agent-1_46”,“entity_namespace”:“default”,“error”:“internal error: rpc error: code = Unavailable desc = transport is closing”,“event_id”:“2d927dfb-8b31-4af4-9353-42dac274b17e”,“level”:“error”,“msg”:“error handling event”,“time”:“2021-02-26T05:20:31Z”}
ETCD can handle way lot more transaction rate and size than this. With this configuration, we can see memory and IOPS being underutilized and CPU more stressed. We have tried different configuration tuning but good so far. And we are using single ETCD to avoid network delay and save cpu time for synchronization. We are suspecting application level limitation in sensuGO.
I need to confirm with SensuGo team that SensuGO has such limitations for cloud deployment or recommendation to improve the performance and capacity. I’d be happy to share more logs an traces if needed.