resultChan buffer full Log messages

jtenzer · June 15, 2021, 3:54pm

My backend logs are inundated with the following error:

{"component":"store","key":"/sensu.io/checks/","level":"warning","msg":"resultChan buffer is full, watch events are not processed fast enough, incoming events from the watcher will be blocked","time":"2021-06-15T15:45:22Z"}

Server setup is as follows:
EC2 m5.xlarge, 60 GB gp3 volume. embedded etcd, sensugo 5.21.5
177 entities (17 agents, 164 proxy)
318 Checks

We have already done the following:

upgraded to latest 5.21 version (we have not evaluated 6.x yet for issues in our environment)
upgraded volume to gp3
upgraded instance from m5large (2 vCPU) to m5.xlarge (4 vCPU)

Backend was running at about 49% cpu usage prior to instance upsize, now running at ~19% CPU regularly

jspaleta · June 15, 2021, 7:01pm

Hey,

So those messages are associated etcd watch event notifications coming in too fast for the sensu-backend to process and overrunning the watch event buffer. An etcd watchers are being used to notify the sensu-backend service when a resource in the etcd store has changed, in the case of your log message the checks have change. I’d expect these watch events to only be happening if you are updating your resources.

Do you perhaps have some automation running that is updating the check definitions in a bulk fashion? If so maybe this autiomation is firing more frequently than is required?

jtenzer · June 16, 2021, 1:44pm

Hmm. Never thought of that to be a possible culprit. Yes, one of our team members set up an automation that is doing just that. Let me adjust the time frame and see what happens. Thanks.

For reference, the automation is a simple for loop through our yaml files and doing a sensuctl create on all of them. Is there a “preferred” way of automated resource updates that we should be leveraging rather than the simple way we are doing it?

jspaleta · June 16, 2021, 6:26pm

You know, that a very good question.

jspaleta · June 16, 2021, 7:23pm

quick follow up tuning question.

Is the atime support on the etcd store filesystem disabled?

Having atime enabled for the etcd store filesystem may cause spurious watch notifications to trigger.
Especially if you also have fstrim enabled to run periodically, fstrim with atime enabled may create a lot of spurious watch notifications from etcd… i think.

jtenzer · June 16, 2021, 10:09pm

We have not used any etcd flags, so however the default installation is configured, that’s our situation.

jspaleta · June 16, 2021, 10:12pm

Hey,

So after discussion with engineering about this, we’re hopeful this is primarily a historic problem that has been solved in check scheduler fixes over the last year or so and that new releases will behave better for you.

Consensus is, for current support sensu releases, these messages would only show up on a sensu-backend that was under heavy cpu load, to the extent that the sensu-backend wasn’t able to service check watch notifications from etcd fast enough. This would only be true on sensu-backends that were under cpu pressure. This could a short spike in cpu pressure for the host system (for example when embedded etcd process is doing a snapshot)

I’m going to be doing some local testing on an under provisioned cluster and attempt to create the log messages just using scripted sensuctl using latest sensu backend release. If I’m successful at that (I’ll try to see if I can correlate the behavior with any health metrics using the dashboard from:
https://github.com/sensu/sensu-perf

jspaleta · June 16, 2021, 10:15pm

This wouldn’t be etcd flags… the atime support would be a mount option for the host filesystem used for the etcd store. fstrim would typically be run as host service or cronjob (configured to run periodically and would be host OS configuration dependent)

jspaleta · June 16, 2021, 10:17pm

Another question.

Are you running a sensu-backend cluster with multiple backends or is this a single backend configuration?

If this is a multiple member sensu-backend cluster…are all members of the cluster showing these messages?

jtenzer · June 16, 2021, 10:33pm

Single backend. And I’ll get the mount options to you when I can. However, it’s a default EBS attachment mount if you can infer anything from that, but again I can get the options later tonight.

Regarding newer releases, we have a ticket in the queue to evaluate 6.x. Though, since editing the frequency of our auto-updater from 5 to 15 min, I have yet to have the referenced error in my logs.

jspaleta · June 16, 2021, 10:45pm

I keep the 5 to 15 minute thing in mind when I test. That’s still a little odd.

How long does it take that script to run typically?
What’s scheduling the script cronjob or systemd.timer? Or something else?

And can you share the script?

I don’t need the resource definitions its updating, but if you can tell me how many check definition that script is updating, that will give me target to test. Right now I’ll just assume its updating 400-ish checks every X minutes, until i hear otherwise. But I would like to see the specific loop logic so I test the exact same logic.

For a best practice standpoint, you might want to add some logic that stores a hash of the resource collection and only do the update if the hash has changed. You’re potentially asking the etcd store to do a lot of unnecessary work by reinjecting the same files every X number of minutes. But even so, I’m surprised your hitting that warning message, unless your just unlucky enough to be doing that script exactly when that host system is spiking a cpu load.

jtenzer · June 16, 2021, 11:03pm

The script usually runs for under a minute. It’s a kubernetes cron job (backend is NOT on k8s)

I don’t have exact script atm, but is essentially a bash script that does

Git pull
Sensuctl configure
Cd config folder
For * in config folder do sensuctl create -f

This is done for entities, assets, checks etc…

About 400 resources is a good estimate.

jspaleta · June 17, 2021, 12:34am

Any chance that script is doing something fancy and running multiple sensuctl instances in parallel?

jtenzer · June 17, 2021, 1:50am

It shouldn’t be. But now that I think about it. There may be another instance of it for a different namespace, though that namespace is super small with under 20 total objects.

jtenzer · June 21, 2021, 3:41pm

Just an update, since changing to 15 min, i have only received 1 error for the resultchan issue. I think, for the most part, this did the trick…

jtenzer · July 21, 2021, 5:31pm

Last Update,

I did what you suggested and re-wrote the script using a hash map for changed config files, and now we only process those that are new, or changed.

Topic		Replies	Views
Sensu-backend service is throwing an erorr as below Sensu Go sensu-go , sensu-go-release	10	945	January 31, 2023
SensuGo - check not exist but the event still occurring Sensu Go	12	654	December 14, 2019
Failed to write response #4109 Sensu Go	8	477	December 22, 2020
Sensu WebUI offline - restarting service solves the problem Sensu Go	21	1668	March 9, 2021
Error starting etcd: wal: max entry size limit exceeded Sensu Go sensu-go , etcd , sensu-enterprise	8	2414	February 25, 2023

resultChan buffer full Log messages

Related topics