SLAs – Understanding Key Performance Indicators (KPIs) for Your Production Service

SLAs

According to Google, SLAs are “formal or implicit agreements with your users that outline the repercussions of meeting (or failing to meet) the contained SLOs.

These agreements are of a more structured nature and represent business-level commitments made to customers, specifying the actions that will be taken if the organization fails to fulfill the SLA. SLAs can be either explicit or implicit. An explicit SLA entails well-defined consequences, often in terms of service credits, in case the expected reliability is not achieved. Implicit SLAs are evaluated in terms of potential damage to the organization’s reputation and the likelihood of customers switching to alternatives.

SLAs are typically established at a level that is sufficient to prevent customers from seeking alternatives, and consequently, they tend to have lower thresholds compared to SLOs. For instance, when considering the request latency SLI, the SLO might be defined at a 300ms SLI value, while the SLA could be set at a 500ms SLI value. This distinction arises from the fact that SLOs are internal targets related to reliability, whereas SLAs are external commitments. By striving to meet the SLO, the team automatically satisfies the SLA, providing an added layer of protection for the organization in case of unexpected failures.

To understand the correlation between SLIs, SLOs, and SLAs, let’s look at the following figure:

Figure 14.2 – SLIs, SLOs, and SLAs

This figure shows how customer experience changes with the level of latency. If we keep the latency SLO at 300ms and meet it, everything is good! Anything between 300ms to 500ms and the customer starts experiencing some degradation in performance, but that is not enough for them to lose their cool and start raising support tickets. Therefore, keeping the SLA at 500ms is a good strategy. As soon we cross the 500ms threshold, unhappiness sinks in, and the customer starts raising support tickets for service slowness. If things cross the 10s mark, then it is a cause of worry for your Ops team, and Everything is burning at this stage. However, as we know, the wording of SLOs is slightly different from what we imagine here. When we say that we have an SLO for 300ms latency, it does not mean anything. A realistic SLO for an SLI mandating request latency to remain below 300ms within the last 15 minutes with a 95th percentile measurement would be to meet the SLI x% of the time. What should that x be? Should it be 99%, or should it be 95%? How do we decide this number? Well, for that, we’ll have to look at error budgets.

Leave a Reply

Your email address will not be published. Required fields are marked *