Category: SRE

  • “Perilwork” and the cost of toil

    “Perilwork” and the cost of toil

    Once, curious about the hours we lose to drudgery, I ran a survey to measure what we call “toil”. Toil is the slow, unglamorous work: the repetitive, invisible, uncelebrated tasks that keep the wheels turning but drain the spirit. In my data engineering group, the work splits cleanly in two. There are those who tend to the pipes and girders and those who craft tools for non-data engineers to use — tools that conceal the messy labyrinth beneath. It was no surprise, then, when the first group — the builders of the bones — reported spending, on average, half their days in toil, with some caught in it for 90% of their time. In career talks and side conversations, these same engineers puzzled over how to make their labor visible in a world that values only the obvious: cost savings, infrastructure speed, tidy dashboards of metrics. And I saw it then, how toil had crept in like an invasive species of weed taking over the garden. It needed constant tending to, that only increased with scale. What doesn’t get measured, doesn’t get fixed. The question then becomes — how do we measure it? 

    I finally settle at my freshly cleaned desk. The coffee’s gone cold. A second glass of water waits on a coaster. The laptop hums, warm, ready. I sink heavily into the leather chair, fingers hesitating over Command + T, hovering on the task I’ve managed to avoid all month. One week left in the quarter — it’s now or miss the OKR. It’s a simple change to migrate to a new protocol. Two lines of Terraform, maybe less. Two lines of change that can break all the ETL jobs if anything goes wrong. I open a pull request, carefully outline the steps for deploying and testing, and send it off for review. But my mind’s already slipping away. There’s that new idea I’ve been turning over, waiting to be sketched out. Steve needs help with a ticket I could solve in ten minutes. Anything but this. I try to picture the neat little line in my promo packet: “Helped achieve team goals by migrating services to the new protocol.” I hear my manager’s voice: “But what was YOUR impact?” It loops in my head, distracting, dragging me sideways. I push the PR, merge, deploy, test..done. Grab the leash and take the dog for a walk. The late afternoon light makes the sidewalk glow, and for a moment it’s all pleasant. Then the phone buzzes in my pocket. I pull it out fast, heat rises in my neck, my ears burn. An incident. Was it my change that did it? In my mind, the tidy line in my promo packet vanishes. I see my manager’s face again: “You should have been careful…”

    The Google SRE book defines toil as tasks that are manual, repetitive, automatable, tactical, grow at an O(n) rate, and offer no enduring value. That list — last updated in 2017 — still holds up, but I’d argue it’s incomplete. To it, I’d add a new category: high-risk, low- or no-reward tasks. Let’s call it perilwork. You’ll know perilwork when you see it. A dragging weight in your gut when the request comes in. A rational, well-earned fear of making a mistake. A quiet, knowing cynicism toward so-called “blameless postmortems”. It’s the kind of work no one volunteers for, but everyone has to do. Luckily, perilwork is also the easiest kind of toil to reason about when assessing prioritization and impact — the cost of getting it wrong is too high to ignore. SLA breaches. Brand damage. Revenue loss.

    In the medical field, perilwork has another name, it’s called “effort-reward imbalance” and its impact on patient safety has been extensively studied. One of the mitigating suggestions is higher rewards for toil tasks to balance the effort-reward. This may also be the reason why during my time at Google, SREs would be paid more. They also had the most well-stocked bar at the Google campus. As of 2022, Google also paid a stipend for on-call rotations. This takes the sting out of grunt work. Most companies, though, still treat on-call as just another part of the job. And for infrastructure teams, on-call is only one source of toil. Migrations, upgrades, deployments — these make up a significant portion of perilwork. The most effective way to address it isn’t just to reward it, but to reduce it: to automate, to standardize, to chip away at the risk until what remains is manageable, predictable. Lower the peril, ease the stress.

    What might that look like in practice? Imagine every system carrying a number, something we will call its peril potential — a score between 0 and 100 that reflects the chance something might break. Because problems rarely show up when everything’s calm; they tend to arrive when change rolls through. This peril potential would act as a simple signal, an early warning. When the number starts to climb, it’s a cue to shift focus toward the maintenance work we often postpone – for lack of perceived impact. Tackling them at the right moment lowers the chances of incidents and eases the weight of invisible work on engineers. It’s a way to steady systems and reduce the quiet, grinding stress that builds up over time. Each system would start with a peril score of 0, recalculated after every SLA breach, incident, security event, or major alert tied to a change. The exact thresholds? That’s a judgment call. It would depend on your service tier, your team size, the strength of your tooling, and how easily you can automate away risk. Each organization would have to decide what “too risky” looks like for them.

    Of course, peril scores alone won’t clear your backlog. An astute reader like you might ask — what about the toil itself? How do we decide which pieces are worth tackling? For that, start by digging into your JIRA backlog. Look for the P2 and P3 postmortem follow-ups, the ones gathering dust quarter after quarter, always deprioritized because the immediate impact wasn’t obvious or the return on investment seemed questionable. After all, how risky could a two-line Terraform change be? Or that canary deployment we never fully automated. Or that brittle CI/CD pipeline no one quite trusts. Those are your starting points. Why? Because we already know — from the last incident, the last outage, the postmortem someone quietly typed up — that those fixes would have made a difference. The only reason they’ve stayed untouched is because no one had a way to measure their value. Peril potential gives you that odometer. It surfaces the invisible, lets you track the risk you’re chipping away at, and turns overlooked toil into clear, measurable progress. A small, steady way to make an outsized impact.

    Invisible work has always been the quiet backbone of our systems, and toil — especially perilwork — is where risk hides in plain sight. We can’t eliminate it entirely, but we can get smarter about when and how we tackle it. A simple, transparent measure like peril potential turns gut instinct into data, giving teams a way to prioritize the small, unglamorous fixes before they turn into costly problems. It offers engineers a way to make their impact visible, to reduce stress, and to chip away at risk in a way that scales. And while no metric is perfect, having even a rough signal is better than navigating blind. Start where you are. Pick a threshold. Surface the neglected tasks. You’ll be surprised how quickly the garden starts to clear.

  • Site Reliability Engineering – SLA, SLO and SLIs

    Site Reliability Engineering – SLA, SLO and SLIs

    I have worked as an engineer for more than a decade and something I’ve seen everyone struggle with is how to manage reliability expectations for their service. This is particularly challenging for data streaming platforms. To that end I am starting a series that I’m calling “Site Reliability Engineering for Streaming Platforms”.

    This post is the first in that series. In this I want to provide a refresher on some common terminologies used in the Site Reliability Engineering that seems to confound a fair number of us.

    SLI(Service Level Indicator)

    Indicators help measure the reliability of a service. These are metrics that indicate how a service is doing. Availability and Latency are commonly used service level indicators. These are metrics that your customers care about and directly impacts how they would interact with your service. Indicators are usually second order effects and symptoms of a problem instead of the problem itself. For example: Service downtime is a second order effect of high CPU. Users don’t care about your service’s CPU usage in and of itself. You as a service owner may alert on high CPU so you can mitigate the issue but you would not put it forward as an indicator of how your service is doing.

    Good SLIs are:

    • ✅ Things your users care about
    • ✅ Is intrinsically connected to your service’s performance. In other words is impacted by your service’s performance and hence a reflection of it.

    Let us look at some possible metrics and why or why not it may be a good SLI

    Availability

    • ✅ Users care about the probability with which their requests will be served.
    • ✅ Aspects of the service’s performance impacts its availability so it can be controlled through actions by service owners

    Downtime

    • ✅ Directly impacts customers. If service is down its not serving requests
    • ✅ Usually impacted by issues within the service*

    Latency

    • ✅ Depending on the type of service, users may have a baseline expectation since this may dictate how long they should wait for a response before retrying or giving up.
    • ✅ In most cases it is impacted by service’s performance. There may be some situations where external factors can have an impact. For example: external data sinks not able to handle incoming throughput leading to back pressure and hence impacting latency(story for another day)

    CPU

    • 🔴 To users of a service, CPU metrics mean nothing
    • 🔴 High CPU is not impacted by service performance. Rather vice-versa is true, high CPU may have an impact on availability or latency. Because of this reason it is not a good SLI

    Requests/second

    • 🔴 Users typically do not care about the overall requests being made to the service per second
    • 🔴 Is not impacted by a service’s performance. Therefore not a good SLI

    Data Loss

    • 🔴 For stateless services, availability is often used in lieu of data loss. Data loss is relevant to data pipelines. However, users often do not care about the magnitude of data loss. The question users ask is not “how much data loss?” but “is there any data loss?”. This is often codified in terms of “at least once”, “at most once” or “exactly once” delivery guarantees. This is a long topic in and of itself — story for another day.
    • ✅ Data loss is often impacted by service’s performance and hence can be controlled by service owners. In some situations it is reliant on external dependencies in which case a baseline expectation can be set.

    Once you have identified your SLIs, plot the past performance over a relatively long period of time say 30 days. Now draw a line through it to mark an acceptable threshold for this SLI. Let’s say your SLI is “latency”. You drew a line at the 0.5 second latency mark. Based on past performance over 30 days requests were handled by your service with a latency of 0.5 seconds about 90% of the time. So you frame your service’s latency SLO like so: ”Over a period of 30 days, 90% of the requests will get a response within 0.5 seconds”

    This is a great start! However after speaking to your customers you realize that they are ok, nay happy with a 5 sec latency at least 90% of the time. So you bump that threshold line to the 5 sec mark and notice that in the last 30 days almost 99% of the requests were being served under 5 seconds.

    This is great! By upping the latency threshold to 5 seconds while keeping the probability at 90% you’re giving yourself leeway to make mistakes in the future. This leeway will allow you to keep operational costs low since your goalpost is a little closer and also gives you a ramp to take risks since you have some runway to make mistakes. With this new adjustment your new latency SLO will read: ”Over a period of 30 days, 90% of the requests will get a response within 5 seconds”

    This leeway is called Error Budget

    Error Budget

    You can very rarely in fact never get 100% availability for your service. Shit happens to distributed systems. You may be able to get 99% availability pretty reliably. Your customers may be happy with 90% availability. As a result, there is no reason to promise 99% availability when 90% is enough. This runway is your error budget.

    Error Budget = (100 — SLO)

    Let’s say your availability SLO is 90%, then your error budget will be 10%.
    Why is error budget important:

    • Lowers operational cost by making errors less expensive — If achieving your SLO means round the clock monitoring and error handling by your Ops team, you’re spending a lot of resources.
    • More runway for innovation — Adding new features to an existing service often comes with migrations. Downtimes can happen during this time, not to mention human errors. Having a good error budget gives you the room to innovate without worrying about an SLO miss.
    • Past does not dictate the future — This is true for stock markets and also true for your service. Your service traffic may grow in the future. You may acquire new users. Heck your application may go viral! Having more room for errors and downtimes is good in terms of being prepared for future unknowns.

    “Under promise, over deliver”

    SLA(Service Level Agreement)

    SLA is basically SLO with a penalty for SLO miss. For example: I could tell my customers “If less than 90% of requests get latency under 5 seconds, I am obligated to give you a discount”. Most of time you don’t need to specify an SLA unless you are a SaaS provider or charging for your service. SLAs are drawn by the legal team of a company and codified into the contract that customers and service providers sign. In most cases SLO is sufficient.

    Burn Rate

    You have your SLIs, SLOs and SLAs in place. Congratulations! Now it’s go time. Monitoring key metrics and setting up alerts to detect and mitigate issues is bread and butter for any production system. But thats not all. You also need to alert on SLO burn rate. But what is burn rate anyway?

    Burn rate is the rate at which a service consumes its error budget.

    Let’s say the error budget for your availability SLO is 10%. This means over a 30 day period your service is allowed to be unavailable for at most 10% of the requests without an SLO miss. However, Its the 15th of the month — 8% of your requests have already hit 404. You only have a runway of 2% before you miss SLO and you still have 15 days to go. What type of actions can you take to avoid an SLO miss?

    • Pause new features and migrations
    • Re-provision infrastructure components to allow higher traffic — if applicable. For example: adding more server instances.
    • Inform customer if applicable so they’re prepared for an SLO miss
    • Add gate-keeping for new use case. This is important for multi-tenant systems where a customer use case can have wide reaching impact.

    That is all folks! Let me know if you found this useful. And if you would like me to drill down into any specific topic.

    References