Troubleshooting handler workflows using the Sensu Agent API

One popular question that often comes up in the Sensu Community Slack channel is “How can I troubleshoot a handler that isn’t working the way I expect it to?”

In this post, I’ll walk you through the way that I typically troubleshoot handlers that aren’t doing what I expect them do using the Sensu Agent API.

What you’ll need:

  • A working Sensu deployment with an agent and backend
  • A system that has curl installed

Getting started

To get started let’s create an asset and handler definition for Slack:

sensuctl asset add sensu/sensu-slack-handler

api_version: core/v2
metadata:
  labels:
    sensu.io/managed_by: sensuctl
  name: slack-testing
  namespace: default
spec:
  command: sensu-slack-handler --channel '#alerts' --username 'sensu'
  env_vars:
  - SLACK_WEBHOOK_URL=https://hooks.slack.com/services/123456789/ABCDEFGHIJK/0987654321
  filters:
  - is_incident
  - not_silenced
  runtime_assets:
  - sensu/sensu-slack-handler
  timeout: 10
  type: pipe

We can verify that our asset and handler definitions have been created successfully by running:

sensuctl asset info sensu/sensu-slack-handler

AND

sensuctl handler info slack

Once we’ve verified that our definitions are present, we can move on to testing the handler.

Testing the handler

If you’ve not already seen the agent API docs, the TL;DR here is that we can use the agent API to create mock events for testing handlers.

In our case, we’re going to post a mock event to the agent API using curl:

curl -X POST \
-H 'Content-Type: application/json' \
-d '{
  "check": {
    "metadata": {
      "name": "testing-slack-handler"
    },
    "status": 2,
    "output": "this is a test event to see if Slack works",
    "handlers": [
      "slack-testing"
    ]
  }
}' \
http://127.0.0.1:3031/events

In normal circumstances, we might see a Slack alert pop up in our Slack channel that looks something like this:

But what happens if we don’t see an alert pop up?

We can examine the logs on most Linux variants using `journalctl -u sensu-backend | grep slack-testing’. However, we’re not going to see much if we’ve not set our log level on the backend to “debug.”

So let’s edit our backend config file by adding the line:

log-level: "debug"

And then restart the backend via systemctl restart sensu-backend.

Now at this point, we could comb back through the journald log entries, but that can be a bit cumbersome, so you’ll want to open 2 terminal windows side by side where in one window, we’ll be running journalctl -fu sensu-backend | grep slack-testing and in the other window, we’ll post our mock check again via:

curl -X POST \
-H 'Content-Type: application/json' \
-d '{
  "check": {
    "metadata": {
      "name": "testing-slack-handler"
    },
    "status": 2,
    "output": "this is a test event to see if Slack works",
    "handlers": [
      "slack-testing"
    ]
  }
}' \
http://127.0.0.1:3031/events

That’s when you’ll get a message that might look something like this:

May 01 16:27:15 backend.example.com sensu-backend[3424]: {"assets":["sensu/sensu-slack-handler"],"check":"testing-slack-handler","component":"pipelined","entity":"agent.example.com","event_uuid":"21a86c57-f6ec-495d-b436-7bc6f7bd495c","handler":"slack-testing","level":"info","msg":"event pipe handler executed","namespace":"default","output":"Usage:\n sensu-slack-handler [flags]\n sensu-slack-handler [command]\n\nAvailable Commands:\n help Help about any command\n version Print the version number of this plugin\n\nFlags:\n -c, --channel string The channel to post messages to (default \"#general\")\n -t, --description-template string The Slack notification output template, in Golang text/template format (default \"{{ .Check.Output }}\")\n -h, --help help for sensu-slack-handler\n -i, --icon-url string A URL to an image to use as the user avatar (default \"https://www.sensu.io/img/sensu-logo.png\")\n -u, --username string The username that messages will be sent as (default \"sensu\")\n -w, --webhook-url string The webhook url to send messages to (default \"https://hooks.slack.com/services/123456789/ABCDEFGHIJK/0987654321\")\n\nUse \"sensu-slack-handler [command] --help\" for more information about a command.\n\nError executing sensu-slack-handler: error executing handler: invalid_token\n","status":1,"time":"2020-05-01T16:27:15-04:00"}

Oh! Now we have something interesting! We see that we have an error that indicates our token is invalid. At this point, we can go and edit our token to be the correct token and then test again to see if we receive the notification.

Note that this workflow can be used to test any workflow. We might even modify our JSON payload to provide an override if we were interested in testing annotation-based overrides for a particular check using something like:

curl -X POST \
-H 'Content-Type: application/json' \
-d '{
  "check": {
    "metadata": {
      "name": "testing-slack-handler",
      "annotations": {
        "sensu.io/plugins/slack/config/channel": "#monitoring"
      }
    },
    "status": 2,
    "output": "this is a test event to see if this sends to the #monitoring channel",
    "handlers": [
      "slack-testing"
    ]
  }
}' \
http://127.0.0.1:3031/events

Which would instead send our alerts to the “#monitoring” channel.

Other plugins like the automated remediation handler or the fatigue check filter could be tested in a similar manner by changing the event body to include annotations that those handlers use:

curl -X POST \
-H 'Content-Type: application/json' \
-d '{
  "check": {
    "metadata": {
      "name": "testing-autoremediation-handler",
      "annotations": {
      "io.sensu.remediation.config.actions": "[\n  {\n    \"description\": \"Perform this action once after Nginx has been down for 30 seconds.\",\n    \"request\": \"systemd-start-nginx\",\n    \"occurrences\": [ 3 ],\n    \"severities\": [ 1,2 ]\n  },\n  {\n    \"description\": \"Perform this action once after Nginx has been down for 120 seconds.\",\n    \"request\": \"systemd-restart-nginx\",\n    \"occurrences\": [ 12 ],\n    \"severities\": [ 1,2 ]\n  }\n]\n"
      }
    },
    "status": 2,
    "output": "this is a test event to see if autoremediation works",
    "handlers": [
      "slack-testing"
    ]
  }
}' \
http://127.0.0.1:3031/events

Summing it up

So putting this all together, workflows can be tested in an ad-hoc manner by posting mock events to the agent API and reviewing the backend logs to see if handlers or filters applied on handlers work as we expect.