How to Build a Bullet-Proof* On-Prem Monitoring Server

Setting aside the debate of Cloud-based vs on-premises for your IT monitoring system, we’ll discuss the steps you can take to build a resilient monitoring server in your data center.

There are many considerations when installing a monitoring server locally. We will discuss some of the steps you can take with the goal of improving the chances that your monitoring server is available during a service outage at your organization.

Name Resolution
Take the steps to add local host file entries of your database server or other monitoring servers. This should include forward and reverse name resolution. You can also add entries for key infrastructure like core routers, critical app servers, etc. Beware though, if there are any changes to the addresses you will need to update your hosts file. Another option is to use IP addresses rather than hostnames to avoid relying on your DNS infrastructure, although this can have the disadvantage that it’s easier to maintain or end up monitoring the ‘wrong’ device.
Shared Infrastructure
For a resilient monitoring server, local storage is much preferred. Shared storage such as SAN, (No SAN or shared storage), local RAID array, HDD, SSD, Hybrid, or anything locally installed.
Platform
If possible, using a physical server can be a life saver when mysterious and unexplained outages occur with the virtualization platform. This request may seem at odds with the goals of your IT organization, but the cost savings on virtualizing your monitoring server can be far outweighed by the impact of a single undetected outage. As an alternative, some protection against failure could be added if you are required to use a virtual server. Some options could include: OS-level HA across multiple physical hosts, application-level redundancy such as DR sync, and database copies.
Database
Using a shared database can lead to unintended consequences like patching, updates and upgrades in support of other applications using the same database. Much like virtualization, other issues can arise from noisy neighbors who like to use more than their fair share of resources. A database dedicated to the monitoring application is preferred.
Local Collector
When the network fails, a local collector installed in your monitored environment can continue to process events from agents or agentless and save to a buffer when connectivity is restored. Some local collectors can also perform actions, notifications, and fix-it scripts while disconnected from the mothership.
Backups
In addition to centralized backup software solutions, make sure you have copies of any configurations including users, dashboards, searches, etc. You can save these on a shared drive, but a properly paranoid monitoring administrator will also save to other destinations that are infosec approved.
Local User Accounts
LDAP and AD can fail and prevent you and your users from accessing the monitoring consoles. A good compromise is to have a backup set of locally authenticated accounts you and your users can revert to when shared authentication services fail.