Bulk Proxy Checks

We’ve got some routers which each have several hundred BGP sessions which I want to have checks for.

My first attempt at this was to write a script which looped round these, generated a check definition for each session and poked them in to the Sensu API.

That works ok (well, except using an interval of 60s for hundreds of checks wasn’t the best idea!), but I feel like there must be a better way.

That’s a lot of checks (which are almost the same) to manage & maintain, it makes things like ‘sensuctl check list’ very long and is a lot of checks/events happening - 10s per second just for a single entity.

Anyone have a similar scenario or have any advice?

I don’t believe I can use the token stuff as the tokens can only be entity attributes?

I’m thinking of having a single check with a script which loops round the sessions (I could have something regularly dump them out to json or yaml to make this quick/easy), perform the check on each.

It could then use the sensu API to add events for each item. The check itself could then return an aggregate (“there are 195 sessions up, 5 sessions down”) and alert on that.

That’s still a lot of events (and i’m not sure how the backend/api would cope with having a few hundred suddenly spat at it like that every few mins), but would at least only be a single check.

Is that sane? - or is there a better way?

I guess I could just create events for sessions which fail - but then the script would have to some how keep track and store that so it can do the resolution events.

Any thoughts?

Thanks,

Ian

Hey!

Yeah I think your approach seems reasonable. Let me restate what I heard you say:

Have your check command script read/write to a tmpfile holding a simple modulo incrementing run counter… pick the session you want to generate an event for based on the value of the run counter. Write json event to the agent’s events api (default port 3031) let agent send event back to sensu backend. Build your check using that script on a tight 10 second interval…so session being checked rotates every 10 seconds, Build second check aggregating the session check at some other interval.

I’m also interested in understanding what you would like to do with the tokens that you don’t feel like you can do right now. There might be something interesting there to think about for future direction.

That’s a good tip about using the agent rather than calling the API on the backend.

What you said sounds much more complicated than I was thinking!

I was thinking i’d have a check, at an interval of a few minutes, which pointed to a script. Said script would loop round each session, run a check, eg:

check_snmp_cisco_bgp.pl -H <router ip> -C public -2 -P <session ip>

… and raise an event with the result - as if it had some from a separate check.

Yes, it would be really nice to do this with tokens, but I read they can only substitute values from the entity?

I’ve got a single proxy entity, “myrouter” and I want to make lots (a few hundred) similar checks. I want to check a list of session ips with: check_snmp_cisco_bgp.pl -H 1.2.3.4 -C public -2 -P <SESSION IP HERE>. So having hundreds of checks which are identical except for a different ip feels really messy. Make sense?

I’m not quite sure how you’d supply that list though. It’s almost like a special type of check with the ability to add a list of items to substitute in to the check in turn.

Thanks,

Ian

Nearly equivalent logic. I prefer to break the looping like that so the check returns quickly inside a time budget if possible…so if there is an unexpected error, i’m not waiting longer than i have to. Its not much different really.

So thinking about this more after a little beer…

It might not be too difficult to build a generalized wrapper sensu check command that could use a json string encoded into a check annotation to control the looping and generate the sub events accordingly.

Hmm… interesting idea! - you should drink more of that beer!! :wink:

What would be even better would be the ability to read an external json file with the data - and the filename and what fields to use comes from annotations.

Ian

IMHO, reading an external file is a bit of an anti pattern for pub/sub+assets workflow. I want to establish patterns where people can feel confident relying on just the agent+check config to define the operation so there’s less of a requirement to precook the running environment where the check is running.

This helps us create better check templates that people can put into service quickly without having to touch the target environment where the agent is running. Especially important in containerized environments where you don’t necessarily have the ability to recook the running environment without losing state and you need to add a check in a JIT manner.

True, I see your point!

Is there a limit to how much data you could put in to annotations?

uhm… not sure…
But the new sensu go plugin SDK has introduce the concept of a keyspace prefix inside the annotations so even if individual annotations have a string length limit, we can probably construct a pattern using the keyspace prefix idea that works reasonably well for this.

this would also be one of those less common check commands that would read in info via stdin to get access to annotations directly. It’s a less common pattern, so there are probably some sharp corners there still to work out.

hey,
let me back up a little… and ask the more obvious question.

Why not just write the custom check command that does the loop and reports a non zero condition if you reach a threshold of bad connections? That way you have one check.

is there a human or organizational topology workflow reason as to why you want to have each connection produce its own event? Do you need to remediate via different contacts based on which connection it is or something similar? Or are you just interested in the aggregate percentage of failed connections?

Because I need a slack channel (and a handler) with every event appearing separately (but running that many separate checks was intensive/problematic).

I then need alerts on the ‘threshold’ to a different channel / set of handlers.

okay just making sure there was a workflow need for the separate events.