Last_ok field in events submitted to agent API not being consumed

I’m implementing a plugin for LibreNMS that translates alerts into Sensu events via the agent API. I’m finding that as LibreNMS only generates alerts on failure, the incident start/last ok values are not being set correctly.

I tried manually setting a last_ok value in the events I’m generating, but it’s being ignored (i.e. the UI still reflects the timestamp of the last actual ok event):

{
  "check": {
    "metadata": {
      "name": "lnms-test-rule",
      "namespace": "default",
      "annotations": {
        "generated-by": "LibreNMS",
        "acknowledged": "false"
      }
    },
    "command": "LibreNMS: macros.device = 1",
    "executed": 1586183569,
    "interval": 300,
    "issued": 1586183569,
    "last_ok": 1586183269,
    "output": "This is a test alert",
    "status": 2
  },
  "entity": {
    "metadata": {
      "name": "ndc-ipam-001",
      "namespace": "default"
    },
    "system": {
      "hostname": "ndc-ipam-001"
    }
  }
}

So instead of last_ok being set to 1586183269, the resulting critical event has a last_ok of the last time the event was resolved.

Is what I’m trying to do possible, or should I just spoof an “ok” event before raising the alert?

You won’t be able to set last_ok yourself, that’s managed internally.

Does LibreNMS not generate an ‘ok’ event that you could ingest as well? Otherwise how are you going to clear the alerts?

It does, which clears the alert as expected, but if the same fault is raised a second time Sensu only knows about the single ‘ok’ instance.

This makes it incorrectly report that the alert has been open for much longer than it actually has - despite the test alert above only being raised at around 2pm today, the dashboard stated the alert been running since last week (the last time I raised and cleared the alert).

sounds like it’s intended behaviour, so I can send a fake “ok” to the agent to set the timer before the real alert is sent, which should do the trick.

1 Like

I think I understand the problem… the first injection of the event is non-zero…
and this is probably something we’ll need to think about for synthetic events that aren’t scheduled checks on a regular interval.

Can you do me a favor, can you paste a redacted json event while the event is in the bad state for me to look at?

I’m most interested in the check.history from the event record

example sensuctl + jq command to grab what i’m looking for:

sensuctl event info --format json thing1 diagnostic_check | jq .check.history

[
  {
    "status": 0,
    "executed": 1586320680
  },
  {
    "status": 0,
    "executed": 1586320740
  },
  {
    "status": 0,
    "executed": 1586320800
  },
  {
    "status": 0,
    "executed": 1586320860
  },
  {
    "status": 0,
    "executed": 1586320920
  }
]

Sure - I’ve just sent a test event to the socket, so in the UI we have:

atl-csw-001 > lnms-interface-speed-reduced
Incident started 2 days ago and has continued for 3 consecutive executions."

In the JSON we have:

[
  {
    "status": 1,
    "executed": 1586203381
  },
  {
    "status": 1,
    "executed": 1586203681
  },
  {
    "status": 1,
    "executed": 1586203981
  },
  {
    "status": 1,
    "executed": 1586204281
  },
  {
    "status": 1,
    "executed": 1586204581
  },
  {
    "status": 1,
    "executed": 1586204881
  },
  {
    "status": 1,
    "executed": 1586205181
  },
  {
    "status": 1,
    "executed": 1586205481
  },
  {
    "status": 0,
    "executed": 1586205781
  },
  {
    "status": 0,
    "executed": 1586206381
  },
  {
    "status": 1,
    "executed": 1586206686
  },
  {
    "status": 1,
    "executed": 1586206981
  },
  {
    "status": 0,
    "executed": 1586207281
  },
  {
    "status": 1,
    "executed": 1586213581
  },
  {
    "status": 1,
    "executed": 1586213881
  },
  {
    "status": 1,
    "executed": 1586214181
  },
  {
    "status": 1,
    "executed": 1586214481
  },
  {
    "status": 0,
    "executed": 1586214781
  },
  {
    "status": 1,
    "executed": 1586356381
  },
  {
    "status": 1,
    "executed": 1586356681
  },
  {
    "status": 1,
    "executed": 1586356981
  }
]

I don’t see anything hugely sensitive in the full event, so it’s here if needed with a couple of annotations changed:

{
  "timestamp": 1586357581,
  "entity": {
    "entity_class": "proxy",
    "system": {
      "network": {
        "interfaces": null
      }
    },
    "subscriptions": [
      "entity:atl-csw-001",
      "entity:atl-csw-001"
    ],
    "last_seen": 0,
    "deregister": false,
    "deregistration": {},
    "metadata": {
      "name": "atl-csw-001",
      "namespace": "default"
    },
    "sensu_agent_version": ""
  },
  "check": {
    "command": "LibreNMS: {\"condition\":\"AND\",\"rules\":[{\"id\":\"ports.ifHighSpeed\",\"field\":\"ports.ifHighSpeed\",\"type\":\"string\",\"input\":\"text\",\"operator\":\"less\",\"value\":\"`ports.ifHighSpeed_prev`\"}],\"valid\":true}",
    "handlers": [],
    "high_flap_threshold": 0,
    "interval": 300,
    "low_flap_threshold": 0,
    "publish": false,
    "runtime_assets": null,
    "subscriptions": [],
    "proxy_entity_name": "atl-csw-001.inf.vjn",
    "check_hooks": null,
    "stdin": false,
    "subdue": null,
    "ttl": 0,
    "timeout": 0,
    "round_robin": false,
    "executed": 1586357581,
    "history": [
      {
        "status": 1,
        "executed": 1586203981
      },
      {
        "status": 1,
        "executed": 1586204281
      },
      {
        "status": 1,
        "executed": 1586204581
      },
      {
        "status": 1,
        "executed": 1586204881
      },
      {
        "status": 1,
        "executed": 1586205181
      },
      {
        "status": 1,
        "executed": 1586205481
      },
      {
        "status": 0,
        "executed": 1586205781
      },
      {
        "status": 0,
        "executed": 1586206381
      },
      {
        "status": 1,
        "executed": 1586206686
      },
      {
        "status": 1,
        "executed": 1586206981
      },
      {
        "status": 0,
        "executed": 1586207281
      },
      {
        "status": 1,
        "executed": 1586213581
      },
      {
        "status": 1,
        "executed": 1586213881
      },
      {
        "status": 1,
        "executed": 1586214181
      },
      {
        "status": 1,
        "executed": 1586214481
      },
      {
        "status": 0,
        "executed": 1586214781
      },
      {
        "status": 1,
        "executed": 1586356381
      },
      {
        "status": 1,
        "executed": 1586356681
      },
      {
        "status": 1,
        "executed": 1586356981
      },
      {
        "status": 1,
        "executed": 1586357281
      },
      {
        "status": 1,
        "executed": 1586357581
      }
    ],
    "issued": 1586357581,
    "output": "Alert for device atl-csw-001 - Interface speed reduced",
    "state": "failing",
    "status": 1,
    "total_state_change": 29,
    "last_ok": 1586214781,
    "occurrences": 5,
    "occurrences_watermark": 5,
    "output_metric_format": "",
    "output_metric_handlers": null,
    "env_vars": null,
    "metadata": {
      "name": "lnms-interface-speed-reduced",
      "namespace": "default",
      "annotations": {
        "location": "Somewhere",
        "description": "Core Switch",
        "acknowledged": "false",
        "generated-by": "LibreNMS",
        "librenms-rule-id": "59",
        "librenms-device-id": "114"
      }
    }
  },
  "metadata": {
    "namespace": "default"
  }
}

Thanks, so based on the history there… let’s see if I can match your experience with the UI problem.

Based on the history in the event, currently the UI is showing the correct information concerning the length of the current incident?

Assuming this is the entire history of the event (sensu only holds last 20)… it looks like the event history in the UI starts to looks correct from Monday, April 6, 2020 8:53:01 PM UTC once there are two “status” 0 in the history, for there to be a last_okay.

It looks like there’s a 30 minute period or so from Monday, April 6, 2020 8:43:01 PM UTC to Monday, April 6, 2020 8:53:01 PM UTC where the UI’s start of incident could be wrong…assuming this is the entire history of this event.

Maybe there’s a breakdown in the logic generating status 0 sensu events when LibreNMS clears an event? The sensu event history array records up to the last 20 updates to that particular event and if you are expecting to see a status 0 since 1586214781 then that would indicate the status 0 events aren’t making into sensu as you expect.

Maybe the librNMS status 0 events are malformed somehow, using a different check or entity name in the data payload? If that were true they’d show up as a different event in sensu. Sensu catelogies events by entity name and check name, so if you had a small typo in either one in the status 0 generated event, Sensu would catalog those as a different event. Just a good idea to review your list of events currently in status 0 state and see if something like that is happening.

In the trace above I hadn’t resolved the incident so it’s correct that there isn’t a final “ok” event. But the incident started today, not 2 days ago.

I think it’s just down to terminology - the UI uses the value of “last_ok” to be the time an incident started, but the API docs use it as the last time an “ok” event was seen.
As you noted, for events generated by the Sensu agent that’s 100% correct on both sides. Things get wonky when event sources only generate failures and recovery notices.

With the old version of my code, because I can’t update the last_ok value when an incident is started (from LibreNMS’ perspective), Sensu thinks the incident has been open since the time the last ok was received - which is the time of the last incident being closed for that check/device tuple or the unix epoch for brand new incidents.

In the current version of my code I explicitly send an “ok” to update the last_ok before sending the real event - that looks like this:

{
  "timestamp": 1586376781,
  "entity": {
    "entity_class": "proxy",
    "system": {
      "network": {
        "interfaces": null
      }
    },
    "subscriptions": [
      "entity:lho-esw-001",
      "entity:lho-esw-001"
    ],
    "last_seen": 0,
    "deregister": false,
    "deregistration": {},
    "metadata": {
      "name": "lho-esw-001",
      "namespace": "default"
    },
    "sensu_agent_version": ""
  },
  "check": {
    "command": "LibreNMS: {\"condition\":\"AND\",\"rules\":[{\"id\":\"ports.ifHighSpeed\",\"field\":\"ports.ifHighSpeed\",\"type\":\"string\",\"input\":\"text\",\"operator\":\"less\",\"value\":\"`ports.ifHighSpeed_prev`\"}],\"valid\":true}",
    "handlers": [],
    "high_flap_threshold": 0,
    "interval": 300,
    "low_flap_threshold": 0,
    "publish": false,
    "runtime_assets": null,
    "subscriptions": [],
    "proxy_entity_name": "lho-esw-001",
    "check_hooks": null,
    "stdin": false,
    "subdue": null,
    "ttl": 0,
    "timeout": 0,
    "round_robin": false,
    "executed": 1586376781,
    "history": [
      {
        "status": 1, << incident refreshed
        "executed": 1586206566
      },
      {
        "status": 1, << incident refreshed
        "executed": 1586206861
      },
      {
        "status": 1, << incident refreshed
        "executed": 1586207161
      },
      {
        "status": 0, << incident ended
        "executed": 1586207461
      },
      {
        "status": 0, << additional event sent to update last ok
        "executed": 1586214066
      },
      {
        "status": 1, << incident started
        "executed": 1586214071
      },
      {
        "status": 1, << incident refreshed
        "executed": 1586214361
      },
      {
        "status": 0, << incident ended
        "executed": 1586214661
      },
      {
        "status": 0, << additional event sent to update last ok
        "executed": 1586375881
      },
      {
        "status": 1, << incident started
        "executed": 1586375886
      },
      {
        "status": 1, << incident refreshed
        "executed": 1586376181
      },
      {
        "status": 1, << incident refreshed
        "executed": 1586376481
      },
      {
        "status": 0,  << incident ended
        "executed": 1586376781
      }
    ],
    "issued": 1586376781,
    "output": "Device lho-esw-001 recovered from Interface speed reduced",
    "state": "passing",
    "status": 0,
    "total_state_change": 0,
    "last_ok": 1586376781,
    "occurrences": 1,
    "occurrences_watermark": 3,
    "output_metric_format": "",
    "output_metric_handlers": null,
    "env_vars": null,
    "metadata": {
      "name": "lnms-interface-speed-reduced",
      "namespace": "default",
      "annotations": {
        "generated-by": "LibreNMS",
        "librenms-rule-id": "59",
        "librenms-device-id": "300",
        "location": "Somewhere",
        "description": "Edge Switch",
        "acknowledged": "false"
      }
    }
  },
  "metadata": {
    "namespace": "default"
  }
}

Which works as expected, the only hardship on my side is a little bit of added complexity to send two events to the socket rather than one when an incident starts.

It’d simplify my code if I could just include the time that the check last succeeded (from LibreNMS’ perspective) in the failure event, but that’s all.

I’m gonna have to create some synthetic events myself to see if I can create a reproducible pattern I can hand to engineering to test against. This feels like a corner case bug that we only get into with non regular interval based checks.

We are facing a similar issue with the last_ok field. We generate events via the EventsAPI using the sensu-go client. The Events contain “fake” checks which source are prometheus alerts. We maintain a check history which we send along with the event and send the event everytime, a prometheus alert is received via an alertmanager. This way we have a nice timeline in the sensu UI.
When we send the event for the very first time, the LastOK in the check is set to never, even if we set the field. As soon as the alert recovers (check status in event set to 0) the last ok field is set to the executed time.
When the alert triggers again, we manually set the LastOK in the check to a minute before the alert was triggered, but it gets overridden with the last executed that carries the status 0.
If I’m not wrong, it is this line of code which causes this behavior.
It would be very cool to just set the LastOK if the field is not set or 0. This way we would be able pass meaningful values for checks which simply do not exist or are not executed via an agent.

1 Like