I’m working as a sysadmin/devops engineer for enterprise client and I’m looking to an alternative to our Nagios monitoring.
We implemented Nagios as a monitoring tool, since it was default choice, and our engineers already know how to implement and configure it.
But as we went along, its painfully obvious that it’s dated, and lacking in integration department.
After many weeks of research, I’ve found that Sensu (Go) connected to InfluxDB would be most suitable and futureproof solution moving forward.
But to pitch migration to team and business, I first must make sure that all our current use cases will be carried over to new monitoring application.
I was kind of hoping that you veterans can give me some pointers on how to do that : )
Anyhow our use cases:
- Nagios checks using community plugins (in pull model using CHECK_NRPE and NSClient++). Some checks are written by team members to address more complex issues (in both pull and push model, depending on scenario)
- We use PNP4Nagios to graph our metrics (which I found awful and time consuming to dig in)
- We use Nagstamon to route events/alerts to engineers since I’m allergic to automatic incident creation in ITSM tools (years working with SCOM will do that to you). Main purpose is the sound notifications, that are very hard to ignore. Internally we are using Microsoft Teams, but our client forbids us to send events there (due to security compliance requirements). Same story with Slack.
- We use Nagios out-of-the-box history for check, to review how long things like backups or updates took or correlate if they could affect other processes (so basically trend and patterns analysis).
And here’s my ideas and struggles on how to migrate/address them with Sensu GO / InfluxDB / Grafana:
As I dig in to Sensu GO, I’ve found that fist two requirements can be easily replicated by standard agents for the pull scenarios. Push scenarios can be done either by taking advantage of backend REST API (which is brilliant!) or pushing result to StatsD component of agent. So far so good.
Metric collection part is easy due to your awesome InfluxDB handler. From there we can use Grafana to visualize and analyze trends.
I do however struggle a bit figuring out how would be possible to handle combination of service check with metric gathering the similar way that Nagios does.
In Nagios world, there is something Status Information and Performance Data. This lets us have one check that is doing both: outputting humanly readable messages, and data that lets us do graphing for later analysis.
So far, I’ve found two ways to replicate that in Sensu:
a) Outputting check result as Nagios Performance Data, setting check output_metric_format to nagios_perfdata and push it to InfluxDB via sensu-influxdb-handler handler
b) Creating two separate checks, one for alerting, and one for metric gathering
Option a: worked okay in testing but is kind of awkward. In Dashboard UI there isn’t any separation between humanly readable information and performance data, which depending on the length of output and number of performance indicators can get long, messy and affect readability. This also limits our options when it comes to existing plugins. I would either have to write a wrapper script to transform data or create mutator.
Option b: One could argue, that this kind of defeats the purpose of implementing Sensu in the first place, since if there is a need of doing it twice. Simpler way would be probably just use Telegraf to get data and Grafana Alerts. And if that’s the case, I would have much more difficult time to present case to kill Nagios and switch over to Sensu.
Is my way of thinking correct here? Or am I missing something?
The Nagstamon software is currently only compatible with Sesnu Core. This would probably mean that I would have to write my own desktop agent or web application to meet that requirement.
Do you guys have any suggestion what desktop agent can we use instead? Or perhaps there is a way to implement sound notifications into Sensu Go Dashboard?
As far as I can tell, Sensu GO backend is keeping information about 21 previous check, but this isn’t something available trough Dashboard, only via API call.
Is there any way to export service check results into database?
If not, what would be recommended approach to do so? Using custom handler to send it to InfluxDB or something similar?
Thank you for all suggestions!