
Error budgets – Understanding Key Performance Indicators (KPIs) for Your Production Service
As defined by Liz Fong-Jones and Seth Vargo, error budgets represent “a quantitative measure shared between product and SRE teams to balance innovation and stability.”
In simpler terms, an error budget quantifies the level of risk that can be taken to introduce new features, conduct service maintenance, perform routine enhancements, manage network and infrastructure disruptions, and respond to unforeseen situations. Typically, the monitoring system measures the uptime of your service, while SLOs establish the target you aim to achieve. The error budget is the difference between these two metrics and represents the time available to deploy new releases, provided it falls within the error budget limits.
This is precisely why a100% SLO is not usually set initially. Error budgets serve the crucial purpose of helping teams strike a balance between innovation and reliability. The rationale behind error
budgets lies in the SRE perspective that failures are a natural and expected part of system operations. Consequently, whenever a new change is introduced into production, there is an inherent risk of disrupting the service. Therefore, a higher error budget allows for introducing more features:
Error Budget = 100% — SLO
For instance, if your SLO is 99%, your error budget would be 1%. If you calculate this over a month, assuming 30 days/month and 24 hours/day, you will have a 7.2- hour error budget to allocate for maintenance or other activities. For a 99.9% SLO, the error budget would be 43.2 minutes per month, and for a 99.99% SLO, it would be 4.32 minutes monthly. You can refer to the following figure for more details:
These periods represent actual downtime, but if your services have redundancy, high availability measures, and disaster recovery plans in place, you can potentially extend these durations because the service remains operational while you patch or address issues with one server.
Now, whether you want to keep on adding 9s within your SLO or aim for a lower number would depend on your end users, business criticality, and availability requirements. A higher SLO is more costly and requires more resources than a lower SLO. However, sometimes, just architecting your
Leave a Reply