SLOs – Understanding Key Performance Indicators (KPIs) for Your Production Service

SLOs

Google’s definition of SLOs states that they “establish a target level for the reliability of your service.” They specify the percentage of compliance with SLIs required to consider your site reliable. SLOs are formulated by combining one or more SLIs.

For instance, if you have an SLI that mandates request latency to remain below 500ms within the last 15 minutes with a 95th percentile measurement, an SLO would necessitate the SLI to be met 99% of the time for a 99% SLO.

While every organization aims for 100% reliability, setting a 100% SLO is not a practical goal. A system with a 100% SLO tends to be costly, technically complex, and often unnecessary for most applications to be deemed acceptable by their users.

In the realm of software services and systems, the pursuit of 100% availability is generally misguided because users cannot feel any practical distinction between a system that is 100% available and one that is 99.999% available. Multiple intermediary systems exist between the user and the service, such as their personal computer, home Wi-Fi, Internet Service Provider (ISP), and the power grid, and these collectively exhibit availability far lower than 99.999%. Consequently, the negligible difference between 99.999% and 100% availability becomes indistinguishable amidst the background noise of other sources of unavailability. Thus, investing substantial effort to attain that last 0.001% availability yields no noticeable benefit to the end user.

In light of this understanding, a question arises: if 100% is an inappropriate reliability target, what constitutes the right reliability target for a system? Interestingly, this is not a technical inquiry but rather a product-related one, necessitating consideration of the following factors:

  • User satisfaction: Determining the level of availability that aligns with user contentment, considering their typical usage patterns and expectations
  • Alternatives: Evaluating the availability of alternatives available to dissatisfied users, should they seek alternatives due to dissatisfaction with the product’s current level of availability
  • User behavior: Examining how users’ utilization of the product varies at different availability levels, recognizing that user behavior may change in response to fluctuations in availability

Moreover, a completely reliable application leaves no room for the introduction of new features, as any new addition has the potential to disrupt the existing service. Therefore, some margin for error must be built into your SLO.

SLOs represent internal objectives that require consensus among the team and internal stakeholders, including developers, product managers, SREs, and CTOs. They necessitate the commitment of the entire organization. Not meeting an SLO does not carry explicit or implicit penalties.

For example, a customer cannot claim damages if an SLO is not met, but it may lead to dissatisfaction within organizational leadership. This does not imply that failing to meet an SLO should be consequence-free. Falling short of an SLO typically results in fewer changes and reduced feature development, potentially indicating a decline in quality and increased emphasis on the development and testing functions.

SLOs should be realistic, with the team actively working to meet them. They should align with the customer experience, ensuring that if the service complies with the SLO, customers do not perceive any service quality issues. If performance falls below the defined SLOs, it may affect the customer experience, but not to the extent that customers raise support tickets.

Some organizations implement two types of SLOs: achievable and aspirational. The achievable SLO represents a target the entire team should reach, while the aspirational SLO sets a higher goal and is part of an ongoing improvement process.

Leave a Reply

Your email address will not be published. Required fields are marked *