Large organizations that wish to scale at an aggressive pace need IT departments that can be both nimble and agile. With DevOps and site reliability engineering (SRE) methodologies, IT teams can improve the agility, availability and performance of applications and services in their infrastructure. For those who are new to both concepts, here is a primer on how DevOps and SRE can work together to evolve IT operations.
What is DevOps?
DevOps is a set of practices that facilitates collaboration between developers and operations. The term DevOps was coined back in 2009 by Patrick Debois. Today, DevOps teams use new tool sets to rapidly build, test and deploy with the goal of delivering services faster to market.
What is SRE?
SRE is responsible for availability, performance, latency, monitoring, troubleshooting and capacity planning. The term SRE (coined by Ben Treynor, who founded Google’s Site Reliability Team) has been around since 2003 — making it even older than DevOps. But both DevOps and SRE have to leverage automation to enable widespread orchestration of infrastructure management for multiple teams.
Cut Through Silos With DevOps
Large enterprises usually have a complicated organizational structure with a lot of teams working in silos. Each pulls the product in a different direction without communicating with the others. This can blind IT teams that fail to see the big picture as a whole, which leads to deployment issues and high costs. The primary objective of DevOps is to reduce the silos and improve alignment between teams.
Apply SRE Principles for Monitoring
Monitoring is a significant engineering endeavor. SREs strongly favor building service-level objectives and service-level agreements on small groups of related, easily understood service-level indicators. (Learn more about SRE basics in our previous blog post.) For example, Google SREs likes to perform deeply introspective monitoring of target systems grouped by application. Viewing related metrics from all systems supporting an application lets them identify root causes with less ambiguity. SRE practices are more about how you can enable everyone to use the same tools and techniques across your organization, which in turn creates a sense of shared ownership for everyone.
Most SREs try to keep alerting rules as simple as possible, without complex dependency hierarchies. There are exceptions to this case when SREs build alerts that react to anomalous patterns, since this affects usability. Some best practices for applying SRE principles include:
- Improving the whole life cycle of services through deployment, operation and refinement
- Maintaining services by measuring and monitoring availability, latency and overall system health
- Scaling systems through automation that improves reliability and velocity
- Practicing sustainable incident response
How are SREs Evolving?
Even though SREs entered the scene nearly a decade ago, the IT world is drastically different now. Cloud-based infrastructure has brought microservices into mainstream IT, and, with the principles of lean organizations, DevOps teams are often small, focused and function as collective SREs. This results in more collaboration and less conflict between development and operations. While vendor- and cloud-based tools provide basic information about their respective domains, collecting this information into a single source creates context across disparate infrastructure and provides a common tool for troubleshooting complex problems. This powerful combination allows IT Ops to manage legacy and multicloud applications while implementing SRE practices.