Migrating form Nagios to Sensu GO - Help me kill Nagios!

I’m working as a sysadmin/devops engineer for enterprise client and I’m looking to an alternative to our Nagios monitoring.
We implemented Nagios as a monitoring tool, since it was default choice, and our engineers already know how to implement and configure it.
But as we went along, its painfully obvious that it’s dated, and lacking in integration department.
After many weeks of research, I’ve found that Sensu (Go) connected to InfluxDB would be most suitable and futureproof solution moving forward.
But to pitch migration to team and business, I first must make sure that all our current use cases will be carried over to new monitoring application.

I was kind of hoping that you veterans can give me some pointers on how to do that : )

Anyhow our use cases:

  1. Nagios checks using community plugins (in pull model using CHECK_NRPE and NSClient++). Some checks are written by team members to address more complex issues (in both pull and push model, depending on scenario)
  2. We use PNP4Nagios to graph our metrics (which I found awful and time consuming to dig in)
  3. We use Nagstamon to route events/alerts to engineers since I’m allergic to automatic incident creation in ITSM tools (years working with SCOM will do that to you). Main purpose is the sound notifications, that are very hard to ignore. Internally we are using Microsoft Teams, but our client forbids us to send events there (due to security compliance requirements). Same story with Slack.
  4. We use Nagios out-of-the-box history for check, to review how long things like backups or updates took or correlate if they could affect other processes (so basically trend and patterns analysis).

And here’s my ideas and struggles on how to migrate/address them with Sensu GO / InfluxDB / Grafana:

  1. As I dig in to Sensu GO, I’ve found that fist two requirements can be easily replicated by standard agents for the pull scenarios. Push scenarios can be done either by taking advantage of backend REST API (which is brilliant!) or pushing result to StatsD component of agent. So far so good.

  2. Metric collection part is easy due to your awesome InfluxDB handler. From there we can use Grafana to visualize and analyze trends.
    I do however struggle a bit figuring out how would be possible to handle combination of service check with metric gathering the similar way that Nagios does.
    In Nagios world, there is something Status Information and Performance Data. This lets us have one check that is doing both: outputting humanly readable messages, and data that lets us do graphing for later analysis.
    So far, I’ve found two ways to replicate that in Sensu:
    a) Outputting check result as Nagios Performance Data, setting check output_metric_format to nagios_perfdata and push it to InfluxDB via sensu-influxdb-handler handler
    b) Creating two separate checks, one for alerting, and one for metric gathering
    Option a: worked okay in testing but is kind of awkward. In Dashboard UI there isn’t any separation between humanly readable information and performance data, which depending on the length of output and number of performance indicators can get long, messy and affect readability. This also limits our options when it comes to existing plugins. I would either have to write a wrapper script to transform data or create mutator.
    Option b: One could argue, that this kind of defeats the purpose of implementing Sensu in the first place, since if there is a need of doing it twice. Simpler way would be probably just use Telegraf to get data and Grafana Alerts. And if that’s the case, I would have much more difficult time to present case to kill Nagios and switch over to Sensu.
    Is my way of thinking correct here? Or am I missing something?

  3. The Nagstamon software is currently only compatible with Sesnu Core. This would probably mean that I would have to write my own desktop agent or web application to meet that requirement.
    Do you guys have any suggestion what desktop agent can we use instead? Or perhaps there is a way to implement sound notifications into Sensu Go Dashboard?

  4. As far as I can tell, Sensu GO backend is keeping information about 21 previous check, but this isn’t something available trough Dashboard, only via API call.
    Is there any way to export service check results into database?
    If not, what would be recommended approach to do so? Using custom handler to send it to InfluxDB or something similar?

Thank you for all suggestions!

Before digging into your issues,
I’ve been trying to find the time to write up my thoughts with regard to a full migration pattern I think is good to start with, but it covers so much ground its hard to get traction on it. I’d be really interested if you could talk though some of the patterns you are using to translate Nagios config resources to Sensu Go resources.

  1. One comment on the push model, make sure you take a look at the sensu go agent events api as an alternative to the sensu backend events api. Depending on your network topology you might find it easier to push into the agent events api.

  2. I think the intent is for option a) to be used, a single check with both status and metric handlers defined. I think you’ve described a bit of a papercut in the UI for metrics ingestion that is probably a good target for a github issue to see if we can get that cleared up. I’ll file an issue and see if the engineering team has some thoughts on how to clean that up. Seems to me the way this should work is agent should optionally? mutate the output and strip the metrics data that it converted.

  3. I wasn’t aware of Nagstamon. I’ll reach out and see what the developers need to work with Sensu Go. Hopefully its just read access to the events api endpoint. If so… then getting them up and running with Sensu Go support shouldm’t be difficult… in fact… with RBAC baked in now… really good option now to use Nagstamon securely with a Sensu user role targetted for Nagstamon access.

  4. Sensu Go and Sensu Classic event model both hold just enough status history to implement flapping detection, similar to Nagios flap detection actually. Sensu Go event model now includes timestamps as well. If you want to hold onto history long term, then a db handler would work here. In fact that might be something you could share as an asset. A simple event handler that ingested an event and wrote tagged status information into influx as a metric.

Thanks for sharing your migration status!