Hey all,
we started to get alert from the sensu-go healtcheck check, I noticed that we get timeout when trying to query healthcheck (http://sensu-go:8080/health).
we are using Postgresql RDS as Sensu-Go datastore.
I’ve noticed 2 errors messages in Sensu backend logs -
“could not get the cluster member list”
“unable to retrieve Postgres health”
both with “context canceled” error
Sensu go is listening on port 8080, tried restart, network connectivity is ok (from Sensu to RDS).
any ideas ?
It’s difficult to determine what the issue may be with the information provided. So that we have everything we need to help you troubleshoot quickly, can you please send over the following information (feel free to DM anything you want to keep private):
• Is this a problem with the check, or is your cluster down at the moment?
• It will be helpful to have logs from each of your Sensu backends (e.g., journalctl -u sensu-backend --since ‘1 week ago’ > sensu-backend-$(date +%Y-%m-%d).log).
• The output of `sensuctl cluster health’
• May I have a copy of your check definition?
• How many entities do you have?
• How many checks?
• Are you using embedded etcd?
• What is the underlying disk on your sensu backends, or your external etcd (if you are using external)? Do you have provisioned IOPs volumes? If so, what did you provision?
Once we receive those pieces of information from you, we’ll work to find what’s causing your issue.
Im not sure if problem in check, cause I do see strange behaviors. cluster looks like working but slow.
95% of the file created is this -
Feb 08 09:22:16 sensu-go sensu-backend[4401]: {“error”:“context canceled”,“level”:“error”,“msg”:“unable to retrieve postgres health”,“time”:“2023-02-08T09:22:16Z”}
Feb 08 09:22:16 sensu-go sensu-backend[4401]: {“component”:“store”,“error”:“context canceled”,“level”:“error”,“msg”:“could not get the cluster member list”,“time”:“2023-02-08T09:22:16Z”}
I get: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I have this check on other servers and it works\ also worked before.
this is the check command -
http-json --url http://sensu-go:8080/health --query “.ClusterHealth.[0].Healthy” --expression “== true”
we have 4 sensu-backends in cluster and 67 sensu-agents
263 checks
we are using postgresql rds
we are using gp2 with 180iops (in AWS, instance type is m5.xlarge)
Thanks for sending over that info. I am curious if your sensu backends are in a reboot loop. Can you check the age of the sensu-backend process on each of your backends?
It’s hard to say exactly what’s happening, but I believe there are some performance issues here. Have you implemented some checks around the performance of your nodes?
If you haven’t implemented this already, it may be helpful to alert when useful thresholds are met.
I noticed that you’re using an even number of Sensu backends. Sensu currently utilizes etcd for configuration storage, so we recommend an odd number of Sensu backends when using embedded etcd. It may be prudent to add an additional backend to your cluster so you can have fault tolerance to survive a 2-member failure. Here is a link to the etcd docs on this issue:
A question about your AWS instances, are all of your backends and RDS/Postgres instances in the same region?
Also, I noticed that you’re utilizing m5.xlarge; our recommendations are to use the m5d.xlarge with 150-GB NVMe SSD directly attached to the instance host, which is optimal for sustained disk IOPS. This is critical for etcd performance on the Sensu backends:
Finally, it will be helpful to see metrics from each of your backends; this may reveal some issues that we aren’t aware of yet: