resultChan buffer full Log messages

My backend logs are inundated with the following error:

{"component":"store","key":"/","level":"warning","msg":"resultChan buffer is full, watch events are not processed fast enough, incoming events from the watcher will be blocked","time":"2021-06-15T15:45:22Z"}

Server setup is as follows:
EC2 m5.xlarge, 60 GB gp3 volume. embedded etcd, sensugo 5.21.5
177 entities (17 agents, 164 proxy)
318 Checks

We have already done the following:

upgraded to latest 5.21 version (we have not evaluated 6.x yet for issues in our environment)
upgraded volume to gp3
upgraded instance from m5large (2 vCPU) to m5.xlarge (4 vCPU)

Backend was running at about 49% cpu usage prior to instance upsize, now running at ~19% CPU regularly


So those messages are associated etcd watch event notifications coming in too fast for the sensu-backend to process and overrunning the watch event buffer. An etcd watchers are being used to notify the sensu-backend service when a resource in the etcd store has changed, in the case of your log message the checks have change. I’d expect these watch events to only be happening if you are updating your resources.

Do you perhaps have some automation running that is updating the check definitions in a bulk fashion? If so maybe this autiomation is firing more frequently than is required?

Hmm. Never thought of that to be a possible culprit. Yes, one of our team members set up an automation that is doing just that. Let me adjust the time frame and see what happens. Thanks.

For reference, the automation is a simple for loop through our yaml files and doing a sensuctl create on all of them. Is there a “preferred” way of automated resource updates that we should be leveraging rather than the simple way we are doing it?

You know, that a very good question.

quick follow up tuning question.

Is the atime support on the etcd store filesystem disabled?

Having atime enabled for the etcd store filesystem may cause spurious watch notifications to trigger.
Especially if you also have fstrim enabled to run periodically, fstrim with atime enabled may create a lot of spurious watch notifications from etcd… i think.

We have not used any etcd flags, so however the default installation is configured, that’s our situation.


So after discussion with engineering about this, we’re hopeful this is primarily a historic problem that has been solved in check scheduler fixes over the last year or so and that new releases will behave better for you.

Consensus is, for current support sensu releases, these messages would only show up on a sensu-backend that was under heavy cpu load, to the extent that the sensu-backend wasn’t able to service check watch notifications from etcd fast enough. This would only be true on sensu-backends that were under cpu pressure. This could a short spike in cpu pressure for the host system (for example when embedded etcd process is doing a snapshot)

I’m going to be doing some local testing on an under provisioned cluster and attempt to create the log messages just using scripted sensuctl using latest sensu backend release. If I’m successful at that (I’ll try to see if I can correlate the behavior with any health metrics using the dashboard from:

This wouldn’t be etcd flags… the atime support would be a mount option for the host filesystem used for the etcd store. fstrim would typically be run as host service or cronjob (configured to run periodically and would be host OS configuration dependent)

Another question.

Are you running a sensu-backend cluster with multiple backends or is this a single backend configuration?

If this is a multiple member sensu-backend cluster…are all members of the cluster showing these messages?

Single backend. And I’ll get the mount options to you when I can. However, it’s a default EBS attachment mount if you can infer anything from that, but again I can get the options later tonight.

Regarding newer releases, we have a ticket in the queue to evaluate 6.x. Though, since editing the frequency of our auto-updater from 5 to 15 min, I have yet to have the referenced error in my logs.

I keep the 5 to 15 minute thing in mind when I test. That’s still a little odd.

How long does it take that script to run typically?
What’s scheduling the script cronjob or systemd.timer? Or something else?

And can you share the script?

I don’t need the resource definitions its updating, but if you can tell me how many check definition that script is updating, that will give me target to test. Right now I’ll just assume its updating 400-ish checks every X minutes, until i hear otherwise. But I would like to see the specific loop logic so I test the exact same logic.

For a best practice standpoint, you might want to add some logic that stores a hash of the resource collection and only do the update if the hash has changed. You’re potentially asking the etcd store to do a lot of unnecessary work by reinjecting the same files every X number of minutes. But even so, I’m surprised your hitting that warning message, unless your just unlucky enough to be doing that script exactly when that host system is spiking a cpu load.

The script usually runs for under a minute. It’s a kubernetes cron job (backend is NOT on k8s)

I don’t have exact script atm, but is essentially a bash script that does

Git pull
Sensuctl configure
Cd config folder
For * in config folder do sensuctl create -f

This is done for entities, assets, checks etc…

About 400 resources is a good estimate.

Any chance that script is doing something fancy and running multiple sensuctl instances in parallel?

It shouldn’t be. But now that I think about it. There may be another instance of it for a different namespace, though that namespace is super small with under 20 total objects.

Just an update, since changing to 15 min, i have only received 1 error for the resultchan issue. I think, for the most part, this did the trick…

Last Update,

I did what you suggested and re-wrote the script using a hash map for changed config files, and now we only process those that are new, or changed.

1 Like