Most IT organizations like to think of themselves as continuously improving. The best ones constantly invest in building new skills, deploying new infrastructure, acquiring new tools, creating new processes, or even tuning what they already have in order to wring more efficiency and productivity out of their environments. Many are migrating to the cloud, as cloud service providers (or CSPs) cheerily advertise three, four or even five 9s of availability (e.g., 99.999% uptime). And many IT practitioners take victory laps when they remove troublesome legacy hardware, applications or tools from their data centers in order to simplify IT operations and head off future outages.
So I was a bit shocked to read that, in a recent survey, the portion of respondents that reported an outage to the Uptime Institute (a division of 451 Research) has actually INCREASED by 24% in the past year. You can see their report here. Kind of throws the value of our entire practice into question, doesn’t it?
Basic Infrastructure Issues Cause Outages
Things get even more worrisome when you dig into the causes for these outages. IT organizations have increasingly deployed infrastructure under the assumption that failure is inevitable, so application infrastructure should be designed with failure tolerance as a first-order requirement. This is why modern data centers use virtualization, containerization and cloud orchestration technologies as tools to abstract hardware from applications — so they can be seamlessly relocated and restarted when performance issues occur.
Unfortunately, these techniques get you exactly nowhere when you have a power failure or a network outage, which was the case in 61% of outages reported publicly since 2016. Loss of power, in particular, was the single biggest cause of failures reported to the Uptime Institute for their 2018 report, applying to a third of the organizations surveyed.
This doesn’t matter to me — some of you might be thinking — I was smart and moved all my critical workloads to the cloud, so I don’t have to worry about this stuff. Think again. This same report showed that 31% of outages reported were caused by a third party — a colo, hosting provider or public cloud platform.
Many of us were affected by major outages experienced by Southwest, Delta and British Airways within a single 12-month span beginning in the summer of 2016 (caused by a router failure, a power outage and an electrical surge, respectively). Need more? Here are some recent examples — all from this year.
- Bad weather caused a power outage at an Equinix data center in Virginia — knocking out service for some AWS customers, including Alexa, Slack and Atlassian. (Lest you think the Equinix team didn’t train for storm-related power interruptions, feel free to read their blog from 2015 fatefully titled, “Disasters or Duds, Equinix is Always Ready When Storms Hit”.)
- Cloud services provider iomart suffered a major network outage that cut off access to scores of U.K. businesses for over 12 hours. The outage was caused by a farmer who cut through a fiber-optic cable while digging a trench.
- A power disconnection at National Australia Bank knocked out a mainframe, affecting ATMs, payment processing and online banking for five hours.
- Storage services in Azure’s northern Europe data center were down for 11 hours due to elevated outside temperatures and a rise in humidity. That elevated temperature? 64°F (18°C).
Turn the Ship Around
These numbers are clearly moving in the wrong direction. How do we reverse this trend?
There are no easy answers, partly because so few organizations have done meaningful studies to understand the systemic issues that caused their outages in the first place, along with how much those outages actually cost them (only 43% of those surveyed by Uptime Institute completed such an analysis). But here are two thoughts to keep in mind as you try to steer your IT organization toward a more idyllic life without outages.
- Watch out for technology underinvestment.
Digital transformation isn’t just a buzzword anymore — it’s a new way of life for IT organizations in every geography and industry. But when the technologies you use for your infrastructure are rapidly changing, it can be easy to find yourself underinvesting in legacy technology assets to keep them running smoothly. It’s critically important to maintain vigilance around the health and currency of your existing on-premises infrastructure, even if you’re moving to the cloud.
- Maintain complete visibility, even on IT assets you don’t own.
At the risk of stating the obvious, just because you’re migrating applications to the cloud doesn’t mean you can abdicate your responsibility to see and track the health of any of your infrastructure — whether it’s hosted by a cloud provider or “left behind” in your own data center. Make sure that any technology transition plan includes elements to maintain visibility into your entire hybrid IT infrastructure environment, so you’ll get early warnings when an infrastructure issue is about to impact a critical service upon which your users depend.
At Zenoss, we’ve focused over a decade of investments in making sure our clients maintain complete visibility over their legacy infrastructure, their resources in the cloud, and everything in between. This gives them an early warning system to predict issues that can, with the right integrations, enable IT operations teams to rapidly resolve problems before any angry texts or tweets are sent. Zenoss Cloud, released earlier this year, takes these capabilities to the next level with a SaaS platform that scales to any environment you care to throw at it — enabling your team to focus on the service delivery projects that matter. For more, or to see a demo, please reach out!