I’m running Sensu Backend and Agents v. 5.14.2 in a three node cluster. As I’ve added more checks I’ve seen the timing of these checks become more irregular to the point that metrics I collect come in at wildly variable intervals. Checks that are to execute every 10 seconds can be a few seconds late and checks that run once per hour can be 20 minutes late on occasion. I/O and CPU on the servers is not struggling, the only thing I’m seeing in the logs is intermittent etcd “took too long to execute” messages. I have 104 clients collecting metrics that come in at about 250/second, though individual checks are not scheduled for less than 10 second intervals. How can I alleviate what looks like a performance problem here?
My understanding is you’ll get the biggest scaling benefit by moving to a postgres event store, but you may be able to to use the backend worker and buffer configuration.
Not sure which worker/buffer you’ll need to adjust.
I’d guess I’d start with the eventd workers/buffer and then increase the pipelined workers/buffer.
I’d focus on tuning the buffers as this seems from discussion I’m reading in the github issues that a small buffer can impact how the agents are running.
Thank you for the suggestions. I increased the recommended thresholds as such:
Unfortunately this does not seem to have solved the problem. I will look further at the Postgres event store option next.
Actually, looks like the PostgreSQL datastore requires an Enterprise License. I’ll see if I have that option.