“Perilwork” and the cost of toil

Once, curious about the hours we lose to drudgery, I ran a survey to measure what we call “toil”. Toil is the slow, unglamorous work: the repetitive, invisible, uncelebrated tasks that keep the wheels turning but drain the spirit. In my data engineering group, the work splits cleanly in two. There are those who tend to the pipes and girders and those who craft tools for non-data engineers to use — tools that conceal the messy labyrinth beneath. It was no surprise, then, when the first group — the builders of the bones — reported spending, on average, half their days in toil, with some caught in it for 90% of their time. In career talks and side conversations, these same engineers puzzled over how to make their labor visible in a world that values only the obvious: cost savings, infrastructure speed, tidy dashboards of metrics. And I saw it then, how toil had crept in like an invasive species of weed taking over the garden. It needed constant tending to, that only increased with scale. What doesn’t get measured, doesn’t get fixed. The question then becomes — how do we measure it? 

I finally settle at my freshly cleaned desk. The coffee’s gone cold. A second glass of water waits on a coaster. The laptop hums, warm, ready. I sink heavily into the leather chair, fingers hesitating over Command + T, hovering on the task I’ve managed to avoid all month. One week left in the quarter — it’s now or miss the OKR. It’s a simple change to migrate to a new protocol. Two lines of Terraform, maybe less. Two lines of change that can break all the ETL jobs if anything goes wrong. I open a pull request, carefully outline the steps for deploying and testing, and send it off for review. But my mind’s already slipping away. There’s that new idea I’ve been turning over, waiting to be sketched out. Steve needs help with a ticket I could solve in ten minutes. Anything but this. I try to picture the neat little line in my promo packet: “Helped achieve team goals by migrating services to the new protocol.” I hear my manager’s voice: “But what was YOUR impact?” It loops in my head, distracting, dragging me sideways. I push the PR, merge, deploy, test..done. Grab the leash and take the dog for a walk. The late afternoon light makes the sidewalk glow, and for a moment it’s all pleasant. Then the phone buzzes in my pocket. I pull it out fast, heat rises in my neck, my ears burn. An incident. Was it my change that did it? In my mind, the tidy line in my promo packet vanishes. I see my manager’s face again: “You should have been careful…”

The Google SRE book defines toil as tasks that are manual, repetitive, automatable, tactical, grow at an O(n) rate, and offer no enduring value. That list — last updated in 2017 — still holds up, but I’d argue it’s incomplete. To it, I’d add a new category: high-risk, low- or no-reward tasks. Let’s call it perilwork. You’ll know perilwork when you see it. A dragging weight in your gut when the request comes in. A rational, well-earned fear of making a mistake. A quiet, knowing cynicism toward so-called “blameless postmortems”. It’s the kind of work no one volunteers for, but everyone has to do. Luckily, perilwork is also the easiest kind of toil to reason about when assessing prioritization and impact — the cost of getting it wrong is too high to ignore. SLA breaches. Brand damage. Revenue loss.

In the medical field, perilwork has another name, it’s called “effort-reward imbalance” and its impact on patient safety has been extensively studied. One of the mitigating suggestions is higher rewards for toil tasks to balance the effort-reward. This may also be the reason why during my time at Google, SREs would be paid more. They also had the most well-stocked bar at the Google campus. As of 2022, Google also paid a stipend for on-call rotations. This takes the sting out of grunt work. Most companies, though, still treat on-call as just another part of the job. And for infrastructure teams, on-call is only one source of toil. Migrations, upgrades, deployments — these make up a significant portion of perilwork. The most effective way to address it isn’t just to reward it, but to reduce it: to automate, to standardize, to chip away at the risk until what remains is manageable, predictable. Lower the peril, ease the stress.

What might that look like in practice? Imagine every system carrying a number, something we will call its peril potential — a score between 0 and 100 that reflects the chance something might break. Because problems rarely show up when everything’s calm; they tend to arrive when change rolls through. This peril potential would act as a simple signal, an early warning. When the number starts to climb, it’s a cue to shift focus toward the maintenance work we often postpone – for lack of perceived impact. Tackling them at the right moment lowers the chances of incidents and eases the weight of invisible work on engineers. It’s a way to steady systems and reduce the quiet, grinding stress that builds up over time. Each system would start with a peril score of 0, recalculated after every SLA breach, incident, security event, or major alert tied to a change. The exact thresholds? That’s a judgment call. It would depend on your service tier, your team size, the strength of your tooling, and how easily you can automate away risk. Each organization would have to decide what “too risky” looks like for them.

Of course, peril scores alone won’t clear your backlog. An astute reader like you might ask — what about the toil itself? How do we decide which pieces are worth tackling? For that, start by digging into your JIRA backlog. Look for the P2 and P3 postmortem follow-ups, the ones gathering dust quarter after quarter, always deprioritized because the immediate impact wasn’t obvious or the return on investment seemed questionable. After all, how risky could a two-line Terraform change be? Or that canary deployment we never fully automated. Or that brittle CI/CD pipeline no one quite trusts. Those are your starting points. Why? Because we already know — from the last incident, the last outage, the postmortem someone quietly typed up — that those fixes would have made a difference. The only reason they’ve stayed untouched is because no one had a way to measure their value. Peril potential gives you that odometer. It surfaces the invisible, lets you track the risk you’re chipping away at, and turns overlooked toil into clear, measurable progress. A small, steady way to make an outsized impact.

Invisible work has always been the quiet backbone of our systems, and toil — especially perilwork — is where risk hides in plain sight. We can’t eliminate it entirely, but we can get smarter about when and how we tackle it. A simple, transparent measure like peril potential turns gut instinct into data, giving teams a way to prioritize the small, unglamorous fixes before they turn into costly problems. It offers engineers a way to make their impact visible, to reduce stress, and to chip away at risk in a way that scales. And while no metric is perfect, having even a rough signal is better than navigating blind. Start where you are. Pick a threshold. Surface the neglected tasks. You’ll be surprised how quickly the garden starts to clear.