SensuGO Cloud Deployment Issues

SensuGO cloud deployment Issues
We are testing sensuGo for a large scale deployment in cloud (AWS) environment. We are facing some serious issues related to capacity limitations.

The Setup
Specs:

  • 1 external etcd C5.2XLarge with dedicated IO optimized nvme disk for etcd-data (Debian 10)
  • 5 sensu-backend C5.Large (Debian 10)
  • 1 LB (haproxy) (Debian 10)
  • upto 25000 Clients (Debian 8 and 9)
  • 40 checks
  • SensuGO v6.2.0 OSS

Limitations

  1. SensuGO cannot register agents quickly enough e.g. 1000 to 1500 per minute which is not good at all for a disaster recovery because keeplive events start to trigger. If we flood more than that system goes into a panic.
    sensugo-v6-1-0-ce-gets-into-a-panic-when-connecting-more-than-1500-agents-at-once

  2. The system cannot manage frequent checks like 10 checks per minute with 15K agents. ETCD’s CPU utilization maxed out.

  3. Similarly, with larger number of checks like 25 with interval of 30 minutes also CPU utilization of ETCD maxed out.

    {“component”:“keepalived”,“error”:“internal error: rpc error: code = Unavailable desc = transport is closing”,“level”:“error”,“msg”:“error updating event”,“time”:“2021-02-26T05:20:31Z”}
    Feb 26 05:20:31 ip-172-31-82-204 sensu-backend[595]: {“check_name”:“keepalive”,“check_namespace”:“default”,“component”:“eventd”,“entity_name”:“sensu-agent-1_46”,“entity_namespace”:“default”,“error”:“internal error: rpc error: code = Unavailable desc = transport is closing”,“event_id”:“2d927dfb-8b31-4af4-9353-42dac274b17e”,“level”:“error”,“msg”:“error handling event”,“time”:“2021-02-26T05:20:31Z”}

ETCD can handle way lot more transaction rate and size than this. With this configuration, we can see memory and IOPS being underutilized and CPU more stressed. We have tried different configuration tuning but good so far. And we are using single ETCD to avoid network delay and save cpu time for synchronization. We are suspecting application level limitation in sensuGO.

I need to confirm with SensuGo team that SensuGO has such limitations for cloud deployment or recommendation to improve the performance and capacity. I’d be happy to share more logs an traces if needed.

1 Like

Hey,

Sensu makes use of etcd leases and watches as part of its operation, and these etcd features are probably cpu bound, so if you are benchmarking against just etcd key creation/deletion transaction throughput (which are disk i/o bound), that could account for the difference in performance expectations.

What version of etcd are you running externally? I know the Sensu engineering team has been looking to upgrade the embedded etcd from 3.3 to 3.4, but there have been some performance regressions in etcd itself that have caused concerns and they are looking closely at tracking the etcd 3.5 milestone to see if the etcd performance regressions are fixed in yet to be released etcd 3.5.

We are using ETCD 3.5-pre.

Unfortunatly we suffer from similar problems:
https://discourse.sensu.io/t/sensu-classic-vs-sensu-go-scalability/2438/7

Our old setup was using a single t2.micro redis instance which was sufficient to handle 10k+ clients…