I have worked as an engineer for more than a decade and something I’ve seen everyone struggle with is how to manage reliability expectations for their service. This is particularly challenging for data streaming platforms. To that end I am starting a series that I’m calling “Site Reliability Engineering for Streaming Platforms”.
This post is the first in that series. In this I want to provide a refresher on some common terminologies used in the Site Reliability Engineering that seems to confound a fair number of us.
SLI(Service Level Indicator)
Indicators help measure the reliability of a service. These are metrics that indicate how a service is doing. Availability and Latency are commonly used service level indicators. These are metrics that your customers care about and directly impacts how they would interact with your service. Indicators are usually second order effects and symptoms of a problem instead of the problem itself. For example: Service downtime is a second order effect of high CPU. Users don’t care about your service’s CPU usage in and of itself. You as a service owner may alert on high CPU so you can mitigate the issue but you would not put it forward as an indicator of how your service is doing.
Good SLIs are:
- ✅ Things your users care about
- ✅ Is intrinsically connected to your service’s performance. In other words is impacted by your service’s performance and hence a reflection of it.
Let us look at some possible metrics and why or why not it may be a good SLI
Availability
- ✅ Users care about the probability with which their requests will be served.
- ✅ Aspects of the service’s performance impacts its availability so it can be controlled through actions by service owners
Downtime
- ✅ Directly impacts customers. If service is down its not serving requests
- ✅ Usually impacted by issues within the service*
Latency
- ✅ Depending on the type of service, users may have a baseline expectation since this may dictate how long they should wait for a response before retrying or giving up.
- ✅ In most cases it is impacted by service’s performance. There may be some situations where external factors can have an impact. For example: external data sinks not able to handle incoming throughput leading to back pressure and hence impacting latency(story for another day)
CPU
- 🔴 To users of a service, CPU metrics mean nothing
- 🔴 High CPU is not impacted by service performance. Rather vice-versa is true, high CPU may have an impact on availability or latency. Because of this reason it is not a good SLI
Requests/second
- 🔴 Users typically do not care about the overall requests being made to the service per second
- 🔴 Is not impacted by a service’s performance. Therefore not a good SLI
Data Loss
- 🔴 For stateless services, availability is often used in lieu of data loss. Data loss is relevant to data pipelines. However, users often do not care about the magnitude of data loss. The question users ask is not “how much data loss?” but “is there any data loss?”. This is often codified in terms of “at least once”, “at most once” or “exactly once” delivery guarantees. This is a long topic in and of itself — story for another day.
- ✅ Data loss is often impacted by service’s performance and hence can be controlled by service owners. In some situations it is reliant on external dependencies in which case a baseline expectation can be set.
Once you have identified your SLIs, plot the past performance over a relatively long period of time say 30 days. Now draw a line through it to mark an acceptable threshold for this SLI. Let’s say your SLI is “latency”. You drew a line at the 0.5 second latency mark. Based on past performance over 30 days requests were handled by your service with a latency of 0.5 seconds about 90% of the time. So you frame your service’s latency SLO like so: ”Over a period of 30 days, 90% of the requests will get a response within 0.5 seconds”

This is a great start! However after speaking to your customers you realize that they are ok, nay happy with a 5 sec latency at least 90% of the time. So you bump that threshold line to the 5 sec mark and notice that in the last 30 days almost 99% of the requests were being served under 5 seconds.

This is great! By upping the latency threshold to 5 seconds while keeping the probability at 90% you’re giving yourself leeway to make mistakes in the future. This leeway will allow you to keep operational costs low since your goalpost is a little closer and also gives you a ramp to take risks since you have some runway to make mistakes. With this new adjustment your new latency SLO will read: ”Over a period of 30 days, 90% of the requests will get a response within 5 seconds”
This leeway is called Error Budget
Error Budget
You can very rarely in fact never get 100% availability for your service. Shit happens to distributed systems. You may be able to get 99% availability pretty reliably. Your customers may be happy with 90% availability. As a result, there is no reason to promise 99% availability when 90% is enough. This runway is your error budget.
Error Budget = (100 — SLO)
Let’s say your availability SLO is 90%, then your error budget will be 10%.
Why is error budget important:
- Lowers operational cost by making errors less expensive — If achieving your SLO means round the clock monitoring and error handling by your Ops team, you’re spending a lot of resources.
- More runway for innovation — Adding new features to an existing service often comes with migrations. Downtimes can happen during this time, not to mention human errors. Having a good error budget gives you the room to innovate without worrying about an SLO miss.
- Past does not dictate the future — This is true for stock markets and also true for your service. Your service traffic may grow in the future. You may acquire new users. Heck your application may go viral! Having more room for errors and downtimes is good in terms of being prepared for future unknowns.
“Under promise, over deliver”
SLA(Service Level Agreement)
SLA is basically SLO with a penalty for SLO miss. For example: I could tell my customers “If less than 90% of requests get latency under 5 seconds, I am obligated to give you a discount”. Most of time you don’t need to specify an SLA unless you are a SaaS provider or charging for your service. SLAs are drawn by the legal team of a company and codified into the contract that customers and service providers sign. In most cases SLO is sufficient.
Burn Rate
You have your SLIs, SLOs and SLAs in place. Congratulations! Now it’s go time. Monitoring key metrics and setting up alerts to detect and mitigate issues is bread and butter for any production system. But thats not all. You also need to alert on SLO burn rate. But what is burn rate anyway?
Burn rate is the rate at which a service consumes its error budget.
Let’s say the error budget for your availability SLO is 10%. This means over a 30 day period your service is allowed to be unavailable for at most 10% of the requests without an SLO miss. However, Its the 15th of the month — 8% of your requests have already hit 404. You only have a runway of 2% before you miss SLO and you still have 15 days to go. What type of actions can you take to avoid an SLO miss?
- Pause new features and migrations
- Re-provision infrastructure components to allow higher traffic — if applicable. For example: adding more server instances.
- Inform customer if applicable so they’re prepared for an SLO miss
- Add gate-keeping for new use case. This is important for multi-tenant systems where a customer use case can have wide reaching impact.
That is all folks! Let me know if you found this useful. And if you would like me to drill down into any specific topic.

