SensuGo v6.1.0-CE gets into a panic when connecting more than 1500 agents at once

I am benchmarking SensuGo for a larger deployment. Whenever I try to connect more than 1500 sensu-agents at once it backend gets into panic state. it does not admit any new agents and drops if any already connected. It works fine if connect the agents in smaller batches like 1000 or less.

In logs of sensu-backend, I found:
Nov 27 09:43:39 ip-172-31-82-204 sensu-backend[469]: {"level":"warn","ts":"2020-11-27T09:43:39.707Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-999b3656-1d41-48d3-9dd4-e3cf228f742a/172.31.91.226:2379","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"} Nov 27 09:43:39 ip-172-31-82-204 sensu-backend[469]: {"agent":"sensu-client-43_23","component":"agentd","error":"context canceled","level":"error","msg":"error querying the entity config","namespace":"default","time":"2020-11-27T09:43:39Z"} Nov 27 09:43:39 ip-172-31-82-204 sensu-backend[469]: {"address":"34.228.52.3:40214","agent":"sensu-client-43_23","component":"agentd","error":"context canceled","level":"error","msg":"failed to start session","namespace":"default","time":"2020-11-27T09:43:39Z"}

I wasn’t able to find something about it in documentation or KB.
I wanna know if it is a bug or application limitation.

My system:

  • sensu-backend version 6.1.0+ce, community edition, built 2020-10-19, built with go1.15.3
  • sensu-agent version 6.1.0+ce, community edition, built 2020-10-19, built with go1.15.3
  • external etcd v3.5.0

You are likely running into limits with your etcd deployment. Please have a look at your etcd logs and metrics for more information.

We’ve loaded sensu-go up to a maximum of 14,000 agents in lab tests with etcd 3.3. In these tests, etcd was running in a 3-node cluster on commodity Intel 660p M2 NVMe SSDs. We have not tested with etcd 3.4 or etcd 3.5 under load.

When etcd becomes overloaded, Sensu will stop functioning correctly. This is a fundamental limitation of the software.

You can improve etcd throughput by using better hardware, or by increasing etcd’s heartbeat timeout.

We’ve developed postgres support for the enterprise version of Sensu in order to accommodate larger clusters. We have scaled Sensu clusters up to 40,000 agents in a lab setting with the enterprise edition, using postgres.

Thank you for taking time to respond. You are right etcd is cause the limit.

Initially we tried above-mentioned configuration on AWS with:

  • sensu-backend c5.large (2 vCPU, 4GB RAM)
  • external etcd c5.large (2 vCPU, 4GB RAM) (General purpose SSD)

Then as you (thankfully) suggested we tried same specs with addition disk for etcd data-store

  • Provisioned IOPS SSD with 5000 IOPS
    and got thinly improved results. We were able to connect 2000 sensu-agents successfully bit not more that that. Which does not look good enough for a large fault tolerant monitoring system.

Moreover we tried the enterprise version and configured a AWS-RDS-postgres db.m5.large as you suggested (with optimized disk). I am sorry to say that it made no difference, we could not go any farther than 2000 connections at once…

We are looking for a cloud based monitoring solution. We are trying to find a unit configuration and then plan a scalable monitoring infrastructure for 200k web-servers. I don’t think that we can either get on-demand specific hardware from a cloud provide or it’d be feasible.

We are using a single etcd for ideal conditions. It gets way more stressed when we use a etcd of just 3. I will become impossible deploy it multi-region for FT and HA.

So my questions:

  • Can sensuGo as big as 200k with cloud deployment (not in a lab environment), if so any ideas about etcd specification as it is the limiting component?
  • How do you recommend to incorporate FT & HA in a deployment for such large number of clients spread all around the world.