Most organizations track a data center’s “Mean Time to Resolve” (MTTR), which is the measure of the average time it takes to detect and resolve a problem. IT Operations, for their part, can only control a portion of the overall MTTR interval, but their ability to efficiently and precisely identify the root cause is by far the most important driver of delivering a shortened MTTR.
Figure 1 decomposes MTTR into its 4 major time segments:
- MTTI (Mean Time to Identify) is the time it takes to detect a problem, whether that be from user complaints or monitoring tools.
- MTTK (Mean Time to Know) is the time it takes to prioritize problems and identify their root cause amongst various secondary symptoms.
- MTTF (Mean Time to Fix) is the time it takes to resolve and deploy a solution.
- MTTV (Mean Time to Verify) is the time it takes to confirm that the solution has indeed resolved the problem.
Figure 1: Breakdown of MTTR into component intervals
Imprecise / incorrect root cause analysis significantly extends MTTK intervals and overhead.
According to Forrester, MTTI and MTTV are often accomplished in a few minutes. Once a problem’s cause is known, MTTF is also relatively quick in taking a few minutes up to a few hours. By contrast, MTTK consumes the vast majority of the MTTR interval and can often take several hours or days. The obvious conclusion being that focusing on MTTK is the key to making meaningful improvements in MTTR.
Let's break down MTTK further into its key milestones of problem triage, isolation, and diagnosis. Triage is a critical aspect of MTTK for IT Ops to make sure that the highest business- impacting problems are prioritized first. IT should never be caught repainting a lifeboat that is sinking instead of plugging the actual holes causing the boat to take on water.
Often multiple errors and events are detected all at once, making it hard to separate secondary symptoms from the original cause. During isolation, the goal is to identify the infrastructure team that owns the source of the problem for further diagnosis. Correct owner identification eliminates substantial delays due to redundant investigations, convening “war rooms,” or passing ownership around like a hot potato across multiple teams.
How can IT Ops improve MTTK efficiency?
An example problem and two different analysis scenarios can help illustrate the value of improving the MTTK efficiency.
IT Operations receives a large number of seemingly unrelated events. Among those notifications are two events for the different servers, Server-1 and Server-2. Both events identify they’re having problems sending information to another server. Unknown to IT Ops, Server-1 is a print server managing a departmental printer. Server-2 is the company’s E-commerce application server that is failing to finalize sales transactions in the purchase orders database.
Inefficient, but common problem resolution workflow:
Both events affect servers and look related. With little more information beyond the event notifications, the Ops team starts investigating the first reported problem with Server-1.
- [footnote]“<em>Enhance Service Visibility to Reduce MTTR</em>” by Glenn O’Donnell with Robert Whiteley, Doug Washburn, John Rakowski, and Alex Crumb. Forrester (June 30, 2011).[/footnote].IT Ops assigns the problem to the network team. An hour later, they determine it is not a network problem and send the problem to the application team.
- The application team’s analysis proves the application code and server are both fine. Since the application manages printers, they bounce the problem to the printer team.
- Finally, this team identifies that a printer is paused because it is out of ink and can’t accept any more documents for printing.
In this scenario, the problem MTTK was significantly delayed by the ownership being passed around to three different teams doing three separate investigations. Additionally, lacking the business context involving the two server devices, the Ops team began working on the least important problem first instead of the E-commerce service which is now costing the business revenue and potential customers.
Zenoss Service Impact Root Cause Analysis Difference:
Let’s contrast this with the solution used by Zenoss Service Dynamics customers, who can attack the same MTTK situation and handle it much more effectively.
The same set of events, including the two server events, are generated as before. Because Zenoss Service Impact identifies those server problems are affecting two different business services, it also generates a set of service impact events for each.
Figure 2: Service Impact Event Example
With the service-level events identifying the business context of each device, IT Ops can appropriately prioritize working on the E-commerce site first. What’s more, when they look at the E-commerce service event’s added root cause analysis (RCA) details, they can quickly determine that the most probable cause was a related database crash.
Similarly, for the printer service event RCA, Zenoss identifies the out-of-ink printer status event as the most likely cause for the print service outage.
In both cases, using Zenoss Service Impact dramatically reduced MTTK - along with overall MTTR of the incidents.
Service events generated by Zenoss Service Impact provide root cause information that enables operations personnel, with no prior knowledge of the application infrastructure, to accurately target the appropriate departmental resources to take ownership of an issue. In the example, IT Ops avoided the costly delays and overhead associated with bouncing the problem across the server, application code, and network teams.
In addition, the team assigned to further diagnose and fix the problem was provided with valuable, time-saving information including: business service context, priority, root cause analysis, and all related device event details. This rich information empowers support teams to accelerate diagnosis and further reduce MTTK. Ultimately, it results in faster resolution times and reduced effort.
Interested in learning more? Enter your email address below to subscribe to our blog!