SensuGo - check not exist but the event still occurring

Hi all,

I’m running a 5 nodes sensugo cluster version sensuctl version 5.14.0#071880b, build 071880b248c40729fb4b7b848df2964612c29707, built 2019-10-08T16:52:58Z
Running on Ubuntu 14.04

I had a check that executes the following command on my agents every 60’s : env
I have deleted the check but i can see that the events still occurring every 60’s.

I have already tried to re-create the check with the same name and delete it again, delete all the events, delete the entities, stop and start the agent - with no luck

Couldn’t find any related docs about this issue
Please advice

1 Like

Hey,
Is there anyway you could share a redacted version of the event? It might provide some clues. You aren’t making use of the agent’s event api via an external process are you? Looking at the detailed event json would provide some clues there.

Easiest way to export an event for review is with sensuctl:

sensuctl event info --format json <entity> <check>

Elsewise, there may be an internal problem with the sensu-backend. Have you tried restarting the sensu-backends the agent is communicating with? You’ll probably be able to restart them one at a time.

The backend scheduler maybe using a memory cached representation of the check, and the logic meant to watch for check changes and refresh that cache could be having problems.

-jef

sensuctl event info --format json ************** test :

	"timestamp": 1575964143,
	"entity": {
		"entity_class": "agent",
		"system": {
			"hostname": "sensugo",
			"os": "linux",
			"platform": "ubuntu",
			"platform_family": "debian",
			"platform_version": "14.04",
			"network": {
				"interfaces": [{
						"name": "lo",
						"addresses": [
							"127.0.0.1/8",
							"::1/128"
						]
					},
					{
						"name": "eth0",
						"mac": "DELETED",
						"addresses": [
							"DELETED",
							"DELETED"
						]
					},
					{
						"name": "docker0",
						"mac": "DELETED",
						"addresses": [
							"172.17.0.1/16"
						]
					}
				]
			},
			"arch": "amd64"
		},
		"subscriptions": [
			"DELETED",
			"DELETED",
			"DELETED",
			"DELETED",
			"DELETED",
			"DELETED",
			"entity:DELETED"
		],
		"last_seen": 1575963171,
		"deregister": false,
		"deregistration": {},
		"user": "agent",
		"redact": [
			"password",
			"passwd",
			"pass",
			"api_key",
			"api_token",
			"access_key",
			"secret_key",
			"private_key",
			"secret"
		],
		"metadata": {
			"name": "DELETED",
			"namespace": "default",
			"labels": {
				"region": "DELETED"
			},
			"annotations": {

			}
		},
		"sensu_agent_version": "5.14.0#071880b"
	},
	"check": {
		"command": "env",
		"handlers": [
			"slack-handler"
		],
		"high_flap_threshold": 0,
		"interval": 60,
		"low_flap_threshold": 0,
		"publish": true,
		"runtime_assets": null,
		"subscriptions": [
			"system"
		],
		"proxy_entity_name": "",
		"check_hooks": null,
		"stdin": false,
		"subdue": null,
		"ttl": 0,
		"timeout": 180,
		"round_robin": false,
		"duration": 0.001506105,
		"executed": 1575964143,
		"history": [{
				"status": 1,
				"executed": 1575962943
			},
			{
				"status": 1,
				"executed": 1575963003
			},
			{
				"status": 1,
				"executed": 1575963063
			},
			{
				"status": 1,
				"executed": 1575963123
			},
			{
				"status": 1,
				"executed": 1575963183
			},
			{
				"status": 1,
				"executed": 1575963243
			},
			{
				"status": 1,
				"executed": 1575963303
			},
			{
				"status": 1,
				"executed": 1575963363
			},
			{
				"status": 1,
				"executed": 1575963423
			},
			{
				"status": 1,
				"executed": 1575963483
			},
			{
				"status": 1,
				"executed": 1575963543
			},
			{
				"status": 1,
				"executed": 1575963603
			},
			{
				"status": 1,
				"executed": 1575963663
			},
			{
				"status": 1,
				"executed": 1575963723
			},
			{
				"status": 1,
				"executed": 1575963783
			},
			{
				"status": 1,
				"executed": 1575963843
			},
			{
				"status": 1,
				"executed": 1575963903
			},
			{
				"status": 1,
				"executed": 1575963963
			},
			{
				"status": 1,
				"executed": 1575964023
			},
			{
				"status": 1,
				"executed": 1575964083
			},
			{
				"status": 1,
				"executed": 1575964143
			}
		],
		"issued": 1575964143,
		"output": "",
		"state": "failing",
		"status": 1,
		"total_state_change": 0,
		"last_ok": 1575905583,
		"occurrences": 976,
		"occurrences_watermark": 976,
		"silenced": [
			"entity:DELETED"
		],
		"output_metric_format": "",
		"output_metric_handlers": [
			"DELETED"
		],
		"env_vars": null,
		"metadata": {
			"name": "test",
			"namespace": "default",
			"labels": {
				"region": "DELETED"
			},
			"annotations": {}
		}
	},
	"metadata": {
		"namespace": "default"
	}
}

Restart to all sensu-backend process helped

Hey!
Glad the remediation helped.

Follow up questions,
Are you using embedded etcd or external etcd for your backend cluster?

how much load is the backend cluster under?
How many Sensu events per minute are being processed?
How many Sensu checks per minute are being processed?
Are the backend system experience memory pressure and swapping?
Are the backend systems under high cpu load?

Trying to get a sense of what condition is stressing the backend so this can be reproduced.

There maybe useful errors in the backend logs indicated an error associated with etcd operation that can give us pointers.

If you are willing to share the raw sensu-backend in logs with me, that might be helpful as well. I know you are sensitive to information leaking, if you prefer you can send the logs to me via email (jef@sensu.io) instead of here so I can share them internally with the engineering team. Just a matter of what you are comfortable with.

Hi @jspaleta,
I really appreciate your help.

I will try to provide as much context as i can, about your questions :

running on embedded etcd cluster
how much load is the backend cluster under? - None, i’ve got only 20 checks.
How many Sensu events per minute are being processed? - 30 ±
How many Sensu checks per minute are being processed? - 8
Are the backend system experience memory pressure and swapping? - Nope
Are the backend systems under high cpu load? - Nope

I’m not sure if i can share the raw logs, i will check it tomorrow morning.

if you can’t share the raw logs, you should be able to grep for “error” messages and potentially share redacted versions of those messages.

hmm generally when there is a failure of this mode there is something stressing the etcd throughput and causing etcd to misbehave.

Other potential problems include
slow disk i/o: are you using a spinning disk instead of an SSD?
or
high network latency between the etcd peers: what high is is a little fuzzy

There’s definitely some potential here for someone to build a useful grafana dashboard for users to use to turn their operational install to discover these sorts of bottle necks.
There is a dashboard as part of the sensu-perf repository that the engineering team uses when performance tuning under load. I wonder if that dashboard could be adapted easily for general use, to help users like you discover potential resource bottlenecks…like the disk or network io issues.

Here’s the link to the performance tuning repo:

I’m running on SSD disks (AWS ebs) and use up to 25GB NIC
I would like to run the perf-test on my cluster, where can i find the /root/go/bin/loadit binary file ?

Hey
loadit is part of the sensu-go codebase. Right now its meant as a development tool…so its not provided as part of the end-user packaging. It puts your system under vey aggressive load. You can pull the source and build it from the sensu-go repo…but doing it in production is going to be…inadvisable.

I think more interesting for you is to use the grafana dashboard and install a prometheus exporter on that system, so you can collect metrics from sensu-backend and the system’s prom exporter and try to catch something.

-jef

I already implemented the grafana dashboard but with graphite metrics (Scraping the prometheus endpoint with telegraf ) needed to change the metric path for all graphs from prometheus to graphite but it looks OK.

I changed a bit my sensugo setup, created 5 nodes of sensu-backend (c4.xlarge) + 5 nodes etcd cluster (c5n.xlarge).
BTW - I watched Sean Porter’s video and he that etcd should run on 3 nodes cluster but most of the designs that uses etcd recommend 5 nodes setup. what is the reason to use maximum of 3 nodes ? (https://www.youtube.com/watch?v=EhmzmEJ4EmI)

About the perf-tests - i have few clusters that i’ve setup for test purposes .