Free Monitoring Solutions: Trouble

Prior blogs in this series have described the complexity of sourcing, deploying, configuring, and operating free monitoring solutions. We’ve discussed the challenge of developing the guru-level in-house expertise required to run “kit of parts” monitoring in demanding, large-scale enterprise environments. And we’ve examined some of the negative impacts this can have on careers: taking talented, driven, fast-learning folks away from more strategic and profitable work.

Now we’re looking at time to value, and at the value of time.

In real IT environments, monitoring is a “high touch” undertaking. Every time you add something new to your hybrid IT estate, you need to figure out how to monitor it and then get it monitored. Every time you change or update something, you need to make sure you haven’t broken its monitoring in the process. Monitoring reports, direct alerts, and integrations to issue-tracking and collaboration systems all trigger activity: from low-intensity, solo maintenance tasks (e.g., replacing failing hard drives in an array or cluster) to all-hands-on-deck fire drills.

All these tasks and processes take time. Time is money. And lost time is lost opportunity. So it’s a good thing when monitoring helps you do things fast, fast, fast, with minimal errors -- in biz-speak, when “time to value” is short.

In monitoring, a lot of time -- maybe the greatest fraction -- gets spent ingesting information: obtaining and verifying configuration data for systems in place, establishing processes for compiling and storing new and changed configurations as the IT estate evolves and grows, then creating means to automate all or part of the process of getting new infrastructure monitored.

Sadly most free monitoring solutions don’t provide mature, fully-baked, sleekly-productized answers to the data-ingestion and “get things monitored fast” problems.

Instead, they offer tools and points of entry. Often, these are technically elegant, flexible, and coding-friendly -- making them intellectually attractive to technophiles (i.e., to most people who choose IT as a career) and compliant with the philosophy and culture of open source, which seeks to avoid forcing detailed, “opinionated,” end-to-end solutions on users (more on open source in an upcoming blog).

Unfortunately, getting benefit from unopinionated tools itself takes time. Before work with the most elemental of free monitoring solutions can be accelerated, IT folks need to learn the mechanics of a core engine well enough to source and integrate first-order productivity aids (e.g., WebUIs, CLIs), then wrap these further in adopted or (more likely) home-grown automation for faster, more self-assured use at scale.

A free monitoring engine, like Nagios Core, has many such relatively “unfinished” edges. For example:

CGIs for accessing major functions. Nagios Core provides what -- these days -- is considered a fairly old-school user-facing API, comprising common gateway interface back-end functions that can be called by web pages (which you need to source+integrate or compose yourself), command-line APIs (ditto), and/or via a RESTful interface (ditto again).

Every one of these interface tools is required for what most enterprise IT organizations would think of as “normal operations.” Without a consistent, simple to use, visually clear and informative webUI, you can’t easily see what’s going on in your datacenter, and have no foundation for building dashboards or other tools for accelerated visualization. Without CLI or REST interfaces (plus, most likely, additional custom integration layers), you can’t integrate with a CMDB (e.g., ServiceNow) to extract configuration data, or (as is increasingly common) use the monitoring system as a “single source of truth,” to insert new configuration records. You can’t consume output from “autodiscovery” tools,

Heavy use of verbose configuration files. Nagios is configured using human-readable text files -- making it possible to search, modify, and auto-generate with standard Linux command-line utilities, BASH scripts, Python, Ansible/Puppet/Chef and/or other tools. You can modify, store and retrieve them from repositories. You can version-control them: making and testing changes, then roll back troubled configs. Awesome!

Unfortunately, they’re complicated. Making what seem like small changes may require modifying multiple, interdependent files or could have widespread unintended consequences. Just getting a new Linux server monitored with NRPE under Nagios is a bear. Steps on the remote host. Creation of user accounts. Installation of software dependencies (like gcc?). Compile-in-place of components (oh, that’s why the C compiler). Local testing. Then more steps on the monitoring server: inheriting from config templates, adding service definitions to lists, touching multiple files, more testing.

Once you’ve built the tooling, moreover, it becomes yours to maintain: a form of technical debt, partly or wholly unique to your implementation. Because the configuration and its tooling comprise custom software, it may never be fully understood by colleagues. What, after all, would be their motivation, so long as the solution works right now? Result: as we discussed in our previous blog, when an expert in free monitoring tools leaves the company, their prior work stands a very good chance of being abandoned. Back to the drawing board.

What’s the cost of moving slowly? A lot. According to Forbes, adopting a digital-first business strategy (i.e., “becoming a technology company instead of a <your industry here> company”) makes organizations up to 34% more profitable. Revenue per employee for highly tech-forward organizations (e.g., Facebook, Google) can range from around $200K to over $1 million USD per year. So the opportunity cost of shifting just one seasoned DevOps pro into non-profit-generating activities (e.g., creating and babysitting a free monitoring platform) is enormous.

The table below, drawn from the experience of Opsview’s Customer Success team, should let you estimate the cost of implementing free monitoring, as opposed to a mature, fully-supported, completely “productized” monitoring solution.

Time to Value Comparison
Activity	Free Monitoring	Mature, Supported Monitoring
Plan, source, and deploy	Order of months	Order of hours to days
Create data ingestion plan	Order of days to weeks	Order of days
Develop and test tooling for: Data ingestion Discovery Data normalization Data insertion	Order of weeks to months	Already built. Order of days to adapt, and can be achieved easily with APIs and standardized interfaces.
Develop and validate monitoring plans for: On-premises infrastructure Business services Cloud estates, etc.	Order of weeks to months	Order of hours to days
Develop and test code to automate monitoring of all standard assets and services	Order of weeks to months	Already built. Order of days to adapt
Get everything monitored	Order of weeks	Order of days
Develop and validate integrations with external business infrastructure and key ITOM applications	Order of weeks to months	Already built. Order of hours to apply
Develop and validate “quality of life” add-ons, e.g., dashboards, complex reports, deep analytics, etc.	Order of months to years and may never be truly completed.	Much tooling is already built or templated. Order of days to weeks to produce fully-custom from templates. Basic dashboards can be available in minutes