Sensu classic vs Sensu go scalability

We have been running sensu classic using snssq-ng transport in aws eks/k8s. This setup has been very scalable and serves about 15k clients (with 10-15 checks) distributed in about 10 datacenters (each datacenter has one ore more api and server container pods). We use uchiwa to display events of all dataceneters. The whole setup consumes less than 1 cpu core in k8s. Our sensu clients are not real clients but lambda function who act like sensu clients. We monitor aws services like rds and other services using this setup.

Due to the deprecation of sensu v1 we took a look at sensu-go oss.

In our test setup we use an external etcd cluster (installed via the bitnami helm chart).

We run out lambda functions against the sensu-go api and create entities and events similar to our old solution.

The new setup does not handle more than 3k clients without problems like these:

I understand that sensu-go has a commercial offering which might scale better. But it looks like there are some general issues here. It would be great if sensu-go oss would scale similar to sensu classic.

1 Like

I could not post these issues…

with regard to the external etcd, it looks like you’ve hitting a deficiency in the wrapper script logic used in the docker image to deal with the init.

The script currently uses SENSU_HOSTNAME for the for the init wait loop, which is probably inappropriate for the external etcd service in your configuration. You’ll probably need a custom docker image until we can get this sorted out in the official images.

Here’s a post from earlier this month where I walk through how I make custom changes to the entrypoint. Hopeful this will get you unstuck for now.

We tried it without external etcd before that. It performed even worse. That alone isn’t the issue.

At my org we had a number of the same issues in (some of) our our production environments clusters, while our smaller ones were fine. We switched to the postgres datastore: Datastore reference - Sensu Docs which sadly is a commercial feature. In our case we were already paying for it to get SSO (biggest factor), other features, and support so it was a relatively painless decision to move ahead and not dig deeper. We have not had the same issues in the clusters since we switched over.

I would like to see better scalability not require the commercial version. One of the biggest reasons some of the larger environments use OSS rather than commercial is because of the way its priced (nodes); it creates a disincentive to use horizontal scaling. @runningman84 I know you have been a long time sensu community member and contributor; I want to thank you for your efforts over the years. I know this thread has the right eyes on it or I would be reaching out on your behalf to make sure it gets attention. If you are at all interested and willing to explore a commercial agreement I know there is flexibility especially with volume.

While I can’t speak to all of the problems at play as I understand it some of it are hard limitations from etcd itself, such as its max db size.

We did some tests using prometheus this week which could easily handle our additional 15k clients. We miss some info like check outputs because it would not fit the data model of prometheus. So we just have metrics like sensu_check_status, sensu_check_occurrences and a simple alarm like alert if sensu_check_status is > 1 and sensu_check_occurences is > 5. We store some basic metadata like aws_account number or aws_servie_type as metrics metadata. One drawback is that you would not have a web ui like uchiwa where you could do a full text search.

This gives us a few options:

  • Keep running the sensu 1.x stuff as a fork (which is outdated but running fine so far)
  • Wait for a better sensu oss version (which could be a dead end if the real problem is not fixable because of the etcd layer)
  • Buy sensu commercial features (which is probably a financial no go based on our high numbers of nodes/clients)
  • Migrate to prometheus (which is default for most k8s setups anyway)

I would really like to continue to contribute to the sensu project, but it looks like under the current circumstances prometheus will be the best option for the time being.