The Cloud is a Magical Place...To Fail
I’ve never liked the public cloud. My poor co-workers are frequently subject to my rants about how public cloud reduces qualified and intelligent IT ops/engineers/analysts to powerless infrastructure end users whose only recourse in an outage is to sit on hold with their public cloud provider’s support hotline. “Your call is very important to us ... Please continue to hold … You are 73rd in line … Your expected wait time is 98 minutes …”
Even though I’m right (I’m always right), I’m also wrong. It doesn’t have to be this way! You can design your new app to use the public cloud and be resilient to their failures.
How To Design Around a Public Cloud Failure
Automation saves time here. So, if you’re going to do it once, ensure you can do it again many times. Once you’ve designed the infrastructure and underlying software stack, be smart about how you deploy it. This means avoiding cloud providers who share any of the following:
- Software Stack. Relying on one software stack (such as OpenStack) creates a single point of failure if your software triggers a bug.
- Data Center. The power went out. And so did your redundancy.
- Connectivity. Both of those data centers are peered with Ashburn, Virginia? Bad idea.
- Geography. The tornado ate my application!
- Politics. Though not so much an issue in the U.S., if you have to contend with countries that like firewalling providers, you may want to leverage several providers in several countries.
Now that you’ve selected two to three providers, simply configure DNS, load balancing, and failover to ensure your app stays up.
That's Too Easy — What's the Catch?
Aah, you are perceptive! The catch is that your end users can still lose their Internet connections, resulting in a perceived outage. When you’re deploying hosted CRM, ERP, or similar, you need to ensure your users can always access it. By providing two Internet connections as well as BGP at large offices or load balancing with failover at small offices, you can ensure that your users don’t call you to tell you your app is down.
I usually don’t write about Zenoss Service Dynamics, but in this case I’m going to, because there’s a piece you’re going to need. Adding redundancy with the public cloud, and multihoming your offices creates more points of failure (which aren’t single points of failure any more). The best way to ensure that multiple points of failure don’t become an outage is Zenoss Service Impact. Service Impact doesn’t just tell you if it’s up or down but can also handle degraded applications gracefully — ensuring that you get the events you need to keep everything up and in the green.