Architecting your environment to ensure high availability means designing an architecture where single points of failure are identified and mitigated. In most IT infrastructures, IT services are designed with multiple redundant components, and multiple failures must occur before you finally arrive at a single point of failure ─ the point where, if the final remaining component in the redundant configuration fails, the entire service stops working.
Should I Panic or Not?
Of course, in large, highly redundant IT infrastructures, all of the redundant components can generate a lot of events, alerts, and noise. This makes it hard for you to know how concerned you should be over any one alert. How do you know when to worry?
For example, when you see an alert telling you that one of the servers in your web server farm failed, what exactly does this mean? Is the failure related to only one of your ten web servers, and the rest are still up and running? If your alert is intelligent enough to provide this context, you don’t have to hit the panic button. You know that since just one out of ten redundant components has failed, you have time to resolve the issue, and your service is not at risk.
However, what if the alert you are seeing about the web servers is from your last redundant web server failing, so now there is only one web server left standing and you are a single point of failure away from a critical, very visible service outage? Are your alerts intelligent enough to provide this much context, so when you see an alert coming through, you know when its “business as usual” versus “time to hit the panic button”?
Hey, Wait – I Think I’m Already Protected
Of course, sometimes the technologies you are already using in your IT infrastructure already have built-in capabilities that are smart enough to allow you to easily monitor for single points of failure. For example, two of the most popular virtualization solutions – VMware and XenServer – have built-in configuration policies that can help you identify when you’re in a reduced redundancy state. Cisco switches also have built-in capabilities that allow you to monitor for single points of failure and alert you when you no longer have enough redundancy remaining and have reached the point where you are at risk of a service outage due to a single point of failure.
If you are using Zenoss to monitor VMware or XenServer hosts with single point of failure capabilities built in, Zenoss leverages the built-in technology already provided by VMware and XenServer for detecting host failures in VMWare and XenServer environments and preventing single points of failure in virtualized environments out-of-the-box. If you are using Zenoss to monitor Cisco networking equipment, Zenoss also automatically leverages single point of failure capabilities built in by Cisco. For example, for port channels on Cisco networking equipment, Zenoss knows that you have to lose all the members of the port channel before the port channel will really be down. If you lose one member of the port channel, you will be degraded. Zenoss automatically applies this knowledge for you when you monitor Cisco networking equipment using Zenoss.
But Am I Really Protected in All Cases?
However, while it’s great to monitor and alert on single points of failure using the built-in technologies provided by Cisco, VMWare, and XenServer, this is only one piece of the pie. Single points of failure don’t limit themselves to just virtualized clusters and ports on Cisco networking devices ─ they are something you want to avoid pretty much anywhere in your infrastructure.
With Zenoss Service Impact, you can configure services in Zenoss for your key infrastructure components to generate alerts if x% of the components that the service depends on fails.
For example, assume you have four different uplinks to the Internet from your data center through various Tier 3 providers. With Zenoss, you could take to the four different edge routers and put them in a service. You can then configure the server using Service Impact policies to say that your Internet is going to be degraded if one of the edge routes goes down, you are not only degraded, but you have also reached a single point of failure when three of the four edge routers are down, and your Internet is completely down if all four different edge routers are down.
Creating Services and Monitoring and Alerting on Single Points of Failure
Creating services designed to warn you about single points of failure so you can prevent outages is a pretty straight forward process in Zenoss.
To create a service that will proactively alert you when your redundant components have failed to the point where you now have a single point of failure, complete the following steps:
- In the Zenoss Console, in the Services view, create a service and add the appropriate elements to the service. Because Zenoss is a unified monitoring service, include all relevant elements ─ application, network, server, and storage ─ in your service as appropriate.
- For each element in your service, define a policy for the state of the service if a certain performance or availability trigger is met, including triggers that generate alerts when elements is the service reach a single point of failure threshold.
- Configure appropriate notifications for the service, so that when the multiple redundant components that support the service have failed to the point where you now have a single point of failure, Zenoss generates a service alert that lets you know you need to immediately remediate the single point of failure before you have a service outage.
By using Service Impact and creating services with impact policies in Zenoss, when key components of the underlying infrastructure that support a service fails, you have the context you need to know how the service is affected.
For more information about how to create services in Zenoss, see the Service Dynamics Impact and Event Management Installation and Administration guide.
Knowing Your Risks with Unified Monitoring
Receiving proactive alerts when IT services have crossed a single point of failure threshold, but before the service has actually failed, is one of the benefits of using a unified monitoring platform. This is because unified monitoring platforms discover and understand the relationships between various infrastructure components and how they relate to delivering a service across all of the elements in the service ─ server, network, storage, and applications.
For example, a unified platform can understand that the four network interfaces are part of a single port channel. If events come in that indicate that two of those network interfaces are down, a unified platform can calculate the overall business risk because it knows what host that port channel is connected to, what VMs are running on that host, and what services those VMs are delivering. With this knowledge, the administrator can address the issue as needed before it ever impacts services.
Share This Tip!
Given the number of infrastructure components needed to support your business critical services today – whether VM clusters, networking devices, or storage hardware – identifying single points of failure before they become failures is pretty valuable!
If you’ve found this article helpful, feel free to share it with others on LinkedIn, Twitter, Google+ or Facebook, or follow our blog to get the latest news and information from Zenoss.
If you are new to Zenoss and would like to learn more about how others are using the Zenoss Service Dynamics Unified IT Operations platform today to improve their monitoring efficiency and productivity and avoid outages, check out our Four Profiles in Unified Monitoring Success white paper.
If you would like to learn more about additional best practices for managing virtualized environments, check out our Best Practices for Managing Virtualized Environments white paper.