Sensu Backend Healthcheck timeout

Itamar_Shpak · February 7, 2023, 8:38am

Hey all,
we started to get alert from the sensu-go healtcheck check, I noticed that we get timeout when trying to query healthcheck (http://sensu-go:8080/health).

we are using Postgresql RDS as Sensu-Go datastore.
I’ve noticed 2 errors messages in Sensu backend logs -
“could not get the cluster member list”
“unable to retrieve Postgres health”

both with “context canceled” error

Sensu go is listening on port 8080, tried restart, network connectivity is ok (from Sensu to RDS).
any ideas ?

Operating system information

Debian

Package version

Sensu Go: 6.9.1
Etcd:
PostgreSQL: 13.7

justinhenderson · February 7, 2023, 8:43pm

Hey @Itamar_Shpak,

It’s difficult to determine what the issue may be with the information provided. So that we have everything we need to help you troubleshoot quickly, can you please send over the following information (feel free to DM anything you want to keep private):
• Is this a problem with the check, or is your cluster down at the moment?
• It will be helpful to have logs from each of your Sensu backends (e.g., journalctl -u sensu-backend --since ‘1 week ago’ > sensu-backend-$(date +%Y-%m-%d).log).
• The output of `sensuctl cluster health’
• May I have a copy of your check definition?
• How many entities do you have?
• How many checks?
• Are you using embedded etcd?
• What is the underlying disk on your sensu backends, or your external etcd (if you are using external)? Do you have provisioned IOPs volumes? If so, what did you provision?

Once we receive those pieces of information from you, we’ll work to find what’s causing your issue.

Itamar_Shpak · February 8, 2023, 9:37am

Hey, thanks for the quick response !

Im not sure if problem in check, cause I do see strange behaviors. cluster looks like working but slow.
95% of the file created is this -
Feb 08 09:22:16 sensu-go sensu-backend[4401]: {“error”:“context canceled”,“level”:“error”,“msg”:“unable to retrieve postgres health”,“time”:“2023-02-08T09:22:16Z”}
Feb 08 09:22:16 sensu-go sensu-backend[4401]: {“component”:“store”,“error”:“context canceled”,“level”:“error”,“msg”:“could not get the cluster member list”,“time”:“2023-02-08T09:22:16Z”}
I get: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I have this check on other servers and it works\ also worked before.
this is the check command -
http-json --url http://sensu-go:8080/health --query “.ClusterHealth.[0].Healthy” --expression “== true”
we have 4 sensu-backends in cluster and 67 sensu-agents
263 checks
we are using postgresql rds
we are using gp2 with 180iops (in AWS, instance type is m5.xlarge)

thanks !

justinhenderson · February 8, 2023, 9:40pm

Hey @Itamar_Shpak,

Thanks for sending over that info. I am curious if your sensu backends are in a reboot loop. Can you check the age of the sensu-backend process on each of your backends?

systemctl status sensu-backend

Itamar_Shpak · February 9, 2023, 8:32am

Hey nope, they are up for few hours\days since last time I restarted them.

justinhenderson · February 9, 2023, 8:58pm

Hey @Itamar_Shpak,

It’s hard to say exactly what’s happening, but I believe there are some performance issues here. Have you implemented some checks around the performance of your nodes?

If you haven’t implemented this already, it may be helpful to alert when useful thresholds are met.

I noticed that you’re using an even number of Sensu backends. Sensu currently utilizes etcd for configuration storage, so we recommend an odd number of Sensu backends when using embedded etcd. It may be prudent to add an additional backend to your cluster so you can have fault tolerance to survive a 2-member failure. Here is a link to the etcd docs on this issue:

A question about your AWS instances, are all of your backends and RDS/Postgres instances in the same region?

Also, I noticed that you’re utilizing m5.xlarge; our recommendations are to use the m5d.xlarge with 150-GB NVMe SSD directly attached to the instance host, which is optimal for sustained disk IOPS. This is critical for etcd performance on the Sensu backends:

Finally, it will be helpful to see metrics from each of your backends; this may reveal some issues that we aren’t aware of yet:

curl -X GET \
 http://127.0.0.1:8080/metrics

Itamar_Shpak · February 12, 2023, 9:23am

Hey thanks for reply, looks like we had some internal network issues, now all look working good.
thanks for helping & for you time !

justinhenderson · February 13, 2023, 5:01pm

Hey @Itamar_Shpak,

I’m glad to help, and happy this is resolved!

system · March 15, 2023, 5:02pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sensu-backend service is throwing an erorr as below Sensu Go sensu-go , sensu-go-release	10	945	January 31, 2023
Sens-backend service is failing to start with error Sensu Go sensu-go , sensu-go-release	2	548	October 25, 2022
Sensu backend failed to start etcd Sensu Go	1	896	October 5, 2020
Sensu is not stable Sensu Go	3	437	March 29, 2021
Sensu WebUI offline - restarting service solves the problem Sensu Go	21	1668	March 9, 2021

Sensu Backend Healthcheck timeout

Operating system information

Package version

Related topics