Running distributed applications in production – Understanding Key Performance Indicators (KPIs) for Your Production Service

So far, we’ve been discussing KPIs for running an application in production, taking inspiration from SRE principles. Now, let’s understand how we will put these thoughts in a single place to run a distributed application in production.

A distributed application or a microservice is inherently different from a monolith. While managing a monolith revolves around ensuring all operational aspects of one application, the complexity increases manyfold with microservices. Therefore, we should take a different approach to it.

From the perspective of SRE, running a distributed application in production entails focusing on ensuring the application’s reliability, scalability, and performance. Here’s how SREs approach this task:

  • SLOs: SREs begin by defining clear SLOs that outline the desired level of reliability for the distributed application. SLOs specify the acceptable levels of latency, error rates, and availability. These SLOs are crucial in guiding the team’s efforts and in determining whether the system is meeting its reliability goals.
  • SLIs: SREs establish SLIs, which are quantifiable metrics that are used to measure the reliability of the application. These metrics could include response times, error rates, and other performance indicators. SLIs provide a concrete way to assess whether the application meets its SLOs.
  • Error budgets: Error budgets are a key concept in SRE. They represent the permissible amount of downtime or errors that can occur before the SLOs are violated. SREs use error budgets to strike a balance between reliability and innovation. If the error budget is exhausted, it may necessitate a focus on stability and reliability over introducing new features.
  • Monitoring and alerting: SREs implement robust monitoring and alerting systems to continuously track the application’s performance and health. They set up alerts based on SLIs and SLOs, enabling them to respond proactively to incidents or deviations from desired performance levels. In the realm of distributed applications, using a service mesh such as Istio or Linkerd can help. They help you visualize parts of your application through a single pane of glass and allow you to monitor your application and alert on it with ease.
  • Capacity planning: SREs ensure that the infrastructure supporting the distributed application can handle the expected load and traffic. They perform capacity planningexercises to scale resources as needed, preventing performance bottlenecks during traffic spikes. With modern public cloud platforms, automating scalability with traffic is all the more easy to implement, especially with distributed applications.
  • Automated remediation: Automation is a cornerstone of SRE practices. SREs develop automated systems for incident response and remediation. This includesauto-scaling, self-healing mechanisms, and automated rollback procedures to minimize downtime.
  • Chaos engineering: SREs often employ chaos engineering practices to introduce controlled failures into the system deliberately. This helps identify weaknesses and vulnerabilities in the distributed application, allowing for proactive mitigation of potential issues. Some of the most popular chaos engineering tools are Chaos Monkey, Gremlin, Chaos Toolkit, Chaos Blade, Pumba, ToxiProxy, and Chaos Mesh.
  • On-call and incident management: SREs maintain on-call rotations to ensure 24/7 coverage. They follow well-defined incident management processes to resolve issues quickly and learn from incidents to prevent recurrence. Most SRE development backlogs come from this process as they learn from failures and, therefore, automate repeatable tasks.
  • Continuous improvement: SRE is a culture of continuous improvement. SRE teams regularly conduct post-incident reviews (PIRs) and root cause analyses (RCAs) to identify areas for enhancement. Lessons learned from incidents are used to refine SLOs and improve the overall reliability of the application.
  • Documentation and knowledge sharing: SREs document best practices, runbooks, and operational procedures. They emphasize knowledge sharing across teams to ensure that expertise is not siloed and that all team members can effectively manage and troubleshoot the distributed application. They also aim to automate the runbooks to ensure that manual processes are kept at a minimum.

In summary, SRE’s approach to running a distributed application in production focuses on reliability, automation, and continuous improvement. It sets clear goals, establishes metrics for measurement, andemploys proactive monitoring and incident management practices to deliver a highly available and performant service to end users.

Summary

This chapter covered SRE and the KPIs for running our service in production. We started by understanding software reliability and examined how to manage an application in production using SRE. We discussed the three crucial parameters that guide SREs: SLI, SLO, and SLA. We also explored error budgets and their importance in introducing changes within the system. Then, we looked at software disaster recovery, RPO, and RTO and how they define how complex or costly our disaster recovery measures will be. Finally, we looked at how DevOps or SRE will use these concepts to manage a distributed application in production.

In the next chapter, we will put what we’ve learned to practical use and explore how to manage all these aspects using a service mesh called Istio.

Leave a Reply

Your email address will not be published. Required fields are marked *