Inconsistently some checks does not run for a few min or hours

I’m using sensu-server and sensu-agent of version 6.4.2. Below is my setup.

3 node sensu-backend / etcd cluster

3 node cluster of sensu-server

11 agents running in 11 nodes

I’m facing an issue where all of a sudden a check stops running (getting scheduled). Automatically it’ll resume back after few min or few hours. During this period, there is no logs for this particular check. This issue is very inconsistent.

Not sure is it anything to do with how the check gets scheduled (using cron or interval). I have tried both.

Below are my test checks.

{
  "api_version":"core/v2",
  "type":"Check",
  "metadata":{
    "namespace":"default",
    "name":"check1", 
    "annotations": {
      "fatigue_check/occurrences": "5",
      "fatigue_check/interval": "3600",
      "sensu.io.json_attributes": "{\"type\":\"standard\",\"occurrences\":5,\"refresh\":3600}"
    }
  },
  "spec":{
    "command":"python3.6 /etc/sensu/plugins/check1.py", 
    "subscriptions":[
      "worker"
    ],
    "publish":true,
    "round_robin":true,
    "interval":60,
    "handlers":[
      "tester_handler"
    ],
    "proxy_entity_name":"proxyclient",
    "timeout":50
  }
}
{
    "api_version": "core/v2",
    "type": "Check",
    "metadata": {
        "namespace": "default",
        "name": "check2",
        "labels": {},
        "annotations": {
            "fatigue_check/occurrences": "5",
            "fatigue_check/interval": "3600",
            "sensu.io.json_attributes": "{\"type\":\"standard\",\"occurrences\":5,\"refresh\":3600}"
        }
    },
    "spec": {
        "command": "python3.6 /etc/sensu/plugins/check2.py",
        "subscriptions": [
            "worker"
        ],
        "publish": true,
        "round_robin": true,
        "cron": "*/2 * * * *",
        "handlers": [
            "alert_handler",
            "resolve_handler",
            "tester_handler"
        ],
        "proxy_entity_name": "proxyclient",
        "timeout": 110
    }
}

Hey there,

Without logs, or knowing more about your environment, it’s difficult to know why Sensu’s behaving this way. It seems like this may be due to disk performance, so knowing more about your environment (specifically what sort of disks are used in your environment) would be super helpful.

Was able to capture logs stating - error : etcd-server no leader, msg : error scheduling check

Hmmm, that sounds like your environment is in bad shape. When you check the health of your deployment what does it say?