Bulk Proxy Checks

ichilton · May 16, 2020, 7:14pm

We’ve got some routers which each have several hundred BGP sessions which I want to have checks for.

My first attempt at this was to write a script which looped round these, generated a check definition for each session and poked them in to the Sensu API.

That works ok (well, except using an interval of 60s for hundreds of checks wasn’t the best idea!), but I feel like there must be a better way.

That’s a lot of checks (which are almost the same) to manage & maintain, it makes things like ‘sensuctl check list’ very long and is a lot of checks/events happening - 10s per second just for a single entity.

Anyone have a similar scenario or have any advice?

I don’t believe I can use the token stuff as the tokens can only be entity attributes?

I’m thinking of having a single check with a script which loops round the sessions (I could have something regularly dump them out to json or yaml to make this quick/easy), perform the check on each.

It could then use the sensu API to add events for each item. The check itself could then return an aggregate (“there are 195 sessions up, 5 sessions down”) and alert on that.

That’s still a lot of events (and i’m not sure how the backend/api would cope with having a few hundred suddenly spat at it like that every few mins), but would at least only be a single check.

Is that sane? - or is there a better way?

I guess I could just create events for sessions which fail - but then the script would have to some how keep track and store that so it can do the resolution events.

Any thoughts?

Thanks,

Ian

jspaleta · May 18, 2020, 5:33pm

Hey!

Yeah I think your approach seems reasonable. Let me restate what I heard you say:

Have your check command script read/write to a tmpfile holding a simple modulo incrementing run counter… pick the session you want to generate an event for based on the value of the run counter. Write json event to the agent’s events api (default port 3031) let agent send event back to sensu backend. Build your check using that script on a tight 10 second interval…so session being checked rotates every 10 seconds, Build second check aggregating the session check at some other interval.

I’m also interested in understanding what you would like to do with the tokens that you don’t feel like you can do right now. There might be something interesting there to think about for future direction.

ichilton · May 18, 2020, 7:59pm

That’s a good tip about using the agent rather than calling the API on the backend.

What you said sounds much more complicated than I was thinking!

I was thinking i’d have a check, at an interval of a few minutes, which pointed to a script. Said script would loop round each session, run a check, eg:

check_snmp_cisco_bgp.pl -H <router ip> -C public -2 -P <session ip>

… and raise an event with the result - as if it had some from a separate check.

Yes, it would be really nice to do this with tokens, but I read they can only substitute values from the entity?

I’ve got a single proxy entity, “myrouter” and I want to make lots (a few hundred) similar checks. I want to check a list of session ips with: check_snmp_cisco_bgp.pl -H 1.2.3.4 -C public -2 -P <SESSION IP HERE>. So having hundreds of checks which are identical except for a different ip feels really messy. Make sense?

I’m not quite sure how you’d supply that list though. It’s almost like a special type of check with the ability to add a list of items to substitute in to the check in turn.

Thanks,

Ian

jspaleta · May 18, 2020, 10:33pm

Nearly equivalent logic. I prefer to break the looping like that so the check returns quickly inside a time budget if possible…so if there is an unexpected error, i’m not waiting longer than i have to. Its not much different really.

jspaleta · May 20, 2020, 10:09pm

So thinking about this more after a little beer…

It might not be too difficult to build a generalized wrapper sensu check command that could use a json string encoded into a check annotation to control the looping and generate the sub events accordingly.

ichilton · May 20, 2020, 10:35pm

Hmm… interesting idea! - you should drink more of that beer!!

What would be even better would be the ability to read an external json file with the data - and the filename and what fields to use comes from annotations.

Ian

jspaleta · May 20, 2020, 10:54pm

IMHO, reading an external file is a bit of an anti pattern for pub/sub+assets workflow. I want to establish patterns where people can feel confident relying on just the agent+check config to define the operation so there’s less of a requirement to precook the running environment where the check is running.

This helps us create better check templates that people can put into service quickly without having to touch the target environment where the agent is running. Especially important in containerized environments where you don’t necessarily have the ability to recook the running environment without losing state and you need to add a check in a JIT manner.

ichilton · May 20, 2020, 11:00pm

True, I see your point!

ichilton · May 20, 2020, 11:00pm

Is there a limit to how much data you could put in to annotations?

jspaleta · May 20, 2020, 11:02pm

uhm… not sure…
But the new sensu go plugin SDK has introduce the concept of a keyspace prefix inside the annotations so even if individual annotations have a string length limit, we can probably construct a pattern using the keyspace prefix idea that works reasonably well for this.

jspaleta · May 20, 2020, 11:04pm

this would also be one of those less common check commands that would read in info via stdin to get access to annotations directly. It’s a less common pattern, so there are probably some sharp corners there still to work out.

jspaleta · May 21, 2020, 2:06am

hey,
let me back up a little… and ask the more obvious question.

Why not just write the custom check command that does the loop and reports a non zero condition if you reach a threshold of bad connections? That way you have one check.

is there a human or organizational topology workflow reason as to why you want to have each connection produce its own event? Do you need to remediate via different contacts based on which connection it is or something similar? Or are you just interested in the aggregate percentage of failed connections?

ichilton · May 21, 2020, 9:09pm

Because I need a slack channel (and a handler) with every event appearing separately (but running that many separate checks was intensive/problematic).

I then need alerts on the ‘threshold’ to a different channel / set of handlers.

jspaleta · May 21, 2020, 9:10pm

okay just making sure there was a workflow need for the separate events.

Topic		Replies	Views
Creating multiple proxy entities from a single check Sensu Go	2	360	March 24, 2022
Higher level docs, concrete examples Sensu Go sensu-go	4	352	November 12, 2021
SensuGo - check not exist but the event still occurring Sensu Go	12	654	December 14, 2019
Deepen your Sensu knowledge with email courses on checks & external services Sensu Go sensu-go	0	335	June 23, 2020
Organize checks Sensu Classic (EOL)	6	586	November 22, 2018

Bulk Proxy Checks

Related topics