Irregular check execution

I’m running Sensu Backend and Agents v. 5.14.2 in a three node cluster. As I’ve added more checks I’ve seen the timing of these checks become more irregular to the point that metrics I collect come in at wildly variable intervals. Checks that are to execute every 10 seconds can be a few seconds late and checks that run once per hour can be 20 minutes late on occasion. I/O and CPU on the servers is not struggling, the only thing I’m seeing in the logs is intermittent etcd “took too long to execute” messages. I have 104 clients collecting metrics that come in at about 250/second, though individual checks are not scheduled for less than 10 second intervals. How can I alleviate what looks like a performance problem here?

Hey!

Have you taken a look at https://github.com/sensu/sensu-perf
and the associated blog post:
https://blog.sensu.io/scaling-sensu-go

You
My understanding is you’ll get the biggest scaling benefit by moving to a postgres event store, but you may be able to to use the backend worker and buffer configuration.
Not sure which worker/buffer you’ll need to adjust.

I’d guess I’d start with the eventd workers/buffer and then increase the pipelined workers/buffer.

I’d focus on tuning the buffers as this seems from discussion I’m reading in the github issues that a small buffer can impact how the agents are running.

Thank you for the suggestions. I increased the recommended thresholds as such:
eventd-buffer-size: 1000
eventd-workers: 1000
keepalived-buffer-size: 1000
pipelined-buffer-size: 1000
pipelined-workers: 1000
etcd-heartbeat-interval: 250
etcd-election-timeout: 1250
Unfortunately this does not seem to have solved the problem. I will look further at the Postgres event store option next.

Actually, looks like the PostgreSQL datastore requires an Enterprise License. I’ll see if I have that option.