
Understanding the importance of reliability – Understanding Key Performance Indicators (KPIs) for Your Production Service-2
In summary, software reliability is not just a technical concern; it has wide-reaching implications for user satisfaction, business success, and even legal and financial aspects. Therefore, investing in ensuring the reliability of software is a prudent and strategic decision for organizations.
Historically, running and managing software in production was the job of the Ops team, and most organizations still use it. The Ops team comprises a bunch of system administrators (SysAdmins) who must deal with the day-to-day issues of running the software in production. They implement scaling and fault tolerance with software, patch and upgrade software, work on support tickets, and keep the systems running so the software application functions well.
We’ve all experienced the divide between Dev and Ops teams, each with its own goals, rules, and priorities. Often, they found themselves at odds because what benefited Dev (software changes and rapid releases) created challenges for Ops (stability and reliability).
However, the emergence of DevOps has changed this dynamic. In the words of Andrew Shafer and Patrick Debois, DevOps is a culture and practice in software engineering aimed at bridging the gap between software development and operations.
Looking at DevOps from an Ops perspective, Google came up with site reliability engineering (SRE) as an approach that embodies DevOps principles. It encourages shared ownership, the use of common tools and practices, and a commitment to learning from failures to prevent recurring issues. The primary objective is to develop and maintain a dependable application without sacrificing the speed of delivery – a balance that was once thought contradictory (that is, create better software faster ).
The idea of SRE is a novel thought about what would happen if we allowed software engineers to run the production environment. So, Google devised the following approach for running its Ops team.
For Google, an ideal candidate for joining the SRE team should exhibit two key characteristics:
- Firstly, they quickly become disinterested in manual tasks and seek opportunities to automate them
- Secondly, they possess the requisite skills to develop software solutions, even when faced with complex challenges
Additionally, SREs should share an academic and intellectual background with the broader development organization. Essentially, SRE work, traditionally within the purview of operations teams, is carried out by engineers with strong software expertise. This strategy hinges on the natural inclination and capability of these engineers to design and implement automation solutions, thus reducing reliance on manual labor.
By design, SRE teams maintain a strong engineering focus. Without continuous engineering efforts, the operational workload escalates, necessitating an expansion of the team to manage the increasing demands. In contrast, a conventional operations-centric group scales in direct proportion to the growth of the service. If the services they support thrive, operational demands surge with increased traffic, compelling the hiring of additional personnel to perform repetitive tasks.
To avert this scenario, the team responsible for service management must incorporate coding into their responsibilities; otherwise, they risk becoming overwhelmed.
Accordingly, Google establishes a 50% upper limit on the aggregate “Ops” work allocated to all SREs, encompassing activities such as handling tickets, on-call duties, and manual tasks. This constraint guarantees that SRE teams allocate a substantial portion of their schedules to enhancing the stability and functionality of the service. While this limit serves as an upper bound, the ideal outcome is that, over time, SREs carry minimal operational loads and primarily engage in development endeavors as the service evolves to a self-sustaining state. Google’s objective is to create systems that are not merely automated but inherently self-regulating. However, practical considerations such as scaling and introducing new features continually challenge SREs.
SREs are meticulous in their approach, relying on measurable metrics to track progress toward specific goals. For instance, stating that a website is running slowly is vague and unhelpful in an engineering context. However, declaring that the 95th percentile of response time has exceeded the service-level objective (SLO) by 10% provides precise information. SREs also focus on reducing repetitivetasks, known as toil, by automating them to prevent burnout. Now, let’s look at some of the key SRE performance indicators.
Leave a Reply