By: Chris Franco >> In the past, organizations built IT monitoring solutions based on knowing if a device was working. Is the router up or down? Can traffic flow through it? What does the current capacity look like?
This was then used to triage a problem as it occurred. To make matters worse, these problems were often reported by the customers after the fact. We can do better than this, but how do we take data and turn it into a solution?
To get to this new stage of IT monitoring we need to be able to preempt the problem, learn to combine relevant data to solve bigger problems, and use that data to make business decisions. The goal is to move IT monitoring from a reactive response based system to a proactive system that can help align or lead business decisions.
How can we intuitively predict a problem with only the available data from currently working systems? First monitor and alarm on predicted future capacity metrics. From this step, we can now predict overcapacity before customers are affected on things like storage or networking.
Next, alarm on activity outside of normal ranges. Personally I use 1 to 2 standard deviations from the normal range. Using this data, we can investigate the reason for the abnormal range to determine if it is symbolic of a future problem. These are 2 things that have been a part of the Zenoss platform for a long time and should be part of any modern monitoring solution.
Finally, we have to understand how to take individual data points and combine them into pieces of a larger puzzle. At Zenoss, this has been core to our product for a long time and is something that modifies the value of the product from monitoring to decreasing incident resolution times. Having a single device down or over capacity could potentially harm many other devices and business systems understanding that relationship can eliminate a flood of alarms into a single alarm for the direct cause of the incident.
This needs to incorporate application performance metrics, log data, monitoring and many data points to create a holistic understanding of the environment. However, one must let the business model dictate the data necessary to create a holistic view. For example, I have seen engineering firms use data from individual HPC workloads and combine them with individual machine data during the time the job was running to determine if workloads were being oversubscribed by the job scheduler. This company was able to free up hundreds of hours a month by decreasing the minimum required specs to load a job. I have also seen MSP’s use data collected by customer to pre-empt customer needs and plan accordingly by customer.
This leads nicely into the real goal of the new IT monitoring paradigm, making and aligning with business decisions. A MSP that can predict and prepare for customer capacity is more likely to succeed than one that cannot. That same MSP could offer customers the ability to see the monitoring of their devices for an additional fee. Software engineering firms that can predict the footprint of their new product based on the usage of the engineers still creating it can be better prepared for the needs of the future. Helpdesks, and IT Administrators can decrease downtime by predicting what could be a problem. The IT focus will move from responding to customer problems to working on things that affect the business. When there are problems the time to resolution will be significantly less than before allowing them to get back to more important projects. The role of modern monitoring needs to make use of the vast amounts of data at its disposal to predict potential problems before they affect users, proactively determine the cause of current issues and solve a larger business problem.