You are here
Free Monitoring Solutions: Hostage Situation
In prior blogs in this series, we’ve talked about the complexity of engineering a platform for enterprise IT monitoring out of free components. You need to select an engine, source a complete kit of parts and integrate them (including back-end databases, alternative and supplementary webUIs, reporting tools, APIs, plugins, configuration and automation tools, integrations to other enterprise IT solutions), configure and test service checks, etc. You may need to equip your system to observe more than one datacenter location or provide visibility into both on-premises and cloud IT estates (or indeed, multiples of both). You’ll need to take steps to ensure high availability, data integrity, and failure recovery.
Most important, you’ll need to learn how to use your system efficiently: to scope out problems, triage and solve them quickly, and keep business services online and compliant with SLOs/SLAs -- doing all this with (the usual) limited headcount, limited budget, and without giving yourself and your team nervous breakdowns. All without support.
This puts huge pressure on IT operators, and inevitably, this pressure gets passed on to the organizations for which they work.
Pressure-points include:
Need to maintain non-strategic technical expertise, in-house. Up to a point, it can be argued that monitoring (like automation) is a critical skill for IT operations and DevOps. Knowing how to monitor things makes teams better at developing new applications, delivering them on time, keeping them running, and generally being agile.
But there’s a limit to everything. In software development, the goal of most businesses is to consume, rather than create, fundamental technology. You don’t want to build complete, generalized httpd servers in node.js, because you have NginX for front end processing, and the Express framework to simplify webby data processing tasks. Likewise, with monitoring, while you certainly can (and many do) undertake the challenge of assembling a solution from the community’s kit of parts and using it, day to day, this requires developing and maintaining deep knowledge -- both about individual components, and about exactly how your team has wired them together. Deep knowledge, and significant effort.
Most of this knowledge and effort isn’t relevant to advancing your organization’s business goals. That’s a problem, both for the business and, potentially, for you.
Technical effort expended on non-strategic tasks. The opportunity cost of spending DevOps/IT time non-strategically is high, and becoming higher rapidly. As Google so well articulates in its SRE books, the goal of SREs is not to spend time maintaining and fixing stuff, and that includes maintaining critical-path business applications. On the contrary, the goal of SRE is to create apps that use automation to deploy, scale, monitor, fix, and manage themselves. While most organizations are currently far from achieving anything close to this goal, those who aren’t working towards it risk falling far behind the competitive curve.
Bottlenecks are bad. It can be (briefly) nice to be the local authority on critical tech and systems. But the “guru’s goal” should always be to share knowledge quickly and widely, eliminating themselves as a bottleneck and potential single point of failure. If you’re the only person who understands the monitoring system (because you built it), that makes you the blocker in getting any new assets monitored, major releases deployed, critical updates performed, new capabilities provided to end users. When things break, chances are good you’ll be called upon to help fix anything non-trivial. This can slow your company down, burn you out, and will ultimately take a toll on your career, as well.
Staff turnover can become a crisis. When home-grown monitoring experts move on, organizations can be faced with a real crisis in trying to replace their highly idiomatic institutional knowledge. On the surface, it might seem that the popularity of certain free monitoring solutions would make replacing experts fairly easy. But while it may be possible to find and hire folks with expertise in your free monitoring engine, each component and subsystem integrated around it expands the skill-set required, as do all the (possibly non-standard) tweaks and adaptations made to enable the solution to monitor specific infrastructure and applications. It’s not coincidental that losing a monitoring expert is often the trigger that forces an organization to replace its entire monitoring solution.