Earlier this week, I found myself at my nearby Starbucks, enjoying a Tall Caramel Macchiato and perhaps, even more, the presence of free WiFi brought to me by AT&T. Now, I will have you know this is not typical for me. Usually, I am huddled up in my home office oblivious to my surroundings. Outside of my trips to the restroom or water breaks, my folks don’t see much of me. But today, thanks to a major outage at our large national ISP, I was out there, literally smelling the coffee, fraternizing with other home office employees who were there for the same reason.
To the ISP’s credit, within 2 minutes of the outage they were in crisis response mode - they had stopped taking support calls and already had a recording in place that confirmed that it was indeed an outage and that they expected to have it fixed within 4 hours. Unfortunately, as it turned out 4 hours was a well-intentioned albeit conservative, random number. The outage lasted more like 6!
As I sipped my Macchiato, which by the way was as good as ever, my trained IT mind started to drift to issues beyond the work I was there to do. No, I was not looking to solve world hunger, but just thinking about what could have gone wrong at my provider. To be sure, I do not have the foggiest idea on what caused this failure, whether there was a hardware issue or a software issue or both. But my mind did not let that be a deterrent.
So what could have caused their service to go down, I thought. Certainly they would have monitoring tools in place that would have helped them quickly react to this issue, if not completely avoid it. Undoubtedly the Service Owner would have in place, SLAs with her providers and OLAs within her own company to ensure that the underpinning infrastructure is committed to at least 99.9% availability. The agreements would have gone to great lengths to define accountability and repercussions in an uncontroverted manner. And yet, I was sitting here with no connectivity at my home office.
In a zone now, I continued my analysis - what does 99.9% component level availability really mean? Well it means every year, a component (sever, storage array etc.) can be mal-functioning for up to (99.9% * 365 * 24) 8.76 hours and still be under the performance warranty. Now most modern services are supported by multiple components, for our discussion, let’s take a service that utilizes 6 components, the allowable downtime rockets up to (99.9% * 365 * 24 *6) 52.5 hours. Our true service uptime just slid down to 99.6%. While not trivial, I was reminded of a recent survey from Forrester Research that indicated that most companies were struggling to even get to this point! Of the 160+ companies surveyed only 38% of the companies said that they were achieving 99% or more uptime. Almost 40% admitted to a number of 97% or less, in other words they were experiencing an annual downtime of over 260 hours. And this is for their critical business applications! Given that each hour of downtime costs an average of $100,000 one can only imagine how this could impact the financial statement of an organization.
Knowing all this, I could have felt bad for my provider, but I didn’t, for now I was just very unhappy customer. All I cared about was the fact that my provider was NOT providing me my mission critical service and that was significantly hampering my ability to do my work. With each passing minute, the MTTR for my provider was sliding up as was the likelihood of my shopping for an alternate provider as soon as the service was restored.