Category: Data Engineering

  • “Perilwork” and the cost of toil

    “Perilwork” and the cost of toil

    Once, curious about the hours we lose to drudgery, I ran a survey to measure what we call “toil”. Toil is the slow, unglamorous work: the repetitive, invisible, uncelebrated tasks that keep the wheels turning but drain the spirit. In my data engineering group, the work splits cleanly in two. There are those who tend to the pipes and girders and those who craft tools for non-data engineers to use — tools that conceal the messy labyrinth beneath. It was no surprise, then, when the first group — the builders of the bones — reported spending, on average, half their days in toil, with some caught in it for 90% of their time. In career talks and side conversations, these same engineers puzzled over how to make their labor visible in a world that values only the obvious: cost savings, infrastructure speed, tidy dashboards of metrics. And I saw it then, how toil had crept in like an invasive species of weed taking over the garden. It needed constant tending to, that only increased with scale. What doesn’t get measured, doesn’t get fixed. The question then becomes — how do we measure it? 

    I finally settle at my freshly cleaned desk. The coffee’s gone cold. A second glass of water waits on a coaster. The laptop hums, warm, ready. I sink heavily into the leather chair, fingers hesitating over Command + T, hovering on the task I’ve managed to avoid all month. One week left in the quarter — it’s now or miss the OKR. It’s a simple change to migrate to a new protocol. Two lines of Terraform, maybe less. Two lines of change that can break all the ETL jobs if anything goes wrong. I open a pull request, carefully outline the steps for deploying and testing, and send it off for review. But my mind’s already slipping away. There’s that new idea I’ve been turning over, waiting to be sketched out. Steve needs help with a ticket I could solve in ten minutes. Anything but this. I try to picture the neat little line in my promo packet: “Helped achieve team goals by migrating services to the new protocol.” I hear my manager’s voice: “But what was YOUR impact?” It loops in my head, distracting, dragging me sideways. I push the PR, merge, deploy, test..done. Grab the leash and take the dog for a walk. The late afternoon light makes the sidewalk glow, and for a moment it’s all pleasant. Then the phone buzzes in my pocket. I pull it out fast, heat rises in my neck, my ears burn. An incident. Was it my change that did it? In my mind, the tidy line in my promo packet vanishes. I see my manager’s face again: “You should have been careful…”

    The Google SRE book defines toil as tasks that are manual, repetitive, automatable, tactical, grow at an O(n) rate, and offer no enduring value. That list — last updated in 2017 — still holds up, but I’d argue it’s incomplete. To it, I’d add a new category: high-risk, low- or no-reward tasks. Let’s call it perilwork. You’ll know perilwork when you see it. A dragging weight in your gut when the request comes in. A rational, well-earned fear of making a mistake. A quiet, knowing cynicism toward so-called “blameless postmortems”. It’s the kind of work no one volunteers for, but everyone has to do. Luckily, perilwork is also the easiest kind of toil to reason about when assessing prioritization and impact — the cost of getting it wrong is too high to ignore. SLA breaches. Brand damage. Revenue loss.

    In the medical field, perilwork has another name, it’s called “effort-reward imbalance” and its impact on patient safety has been extensively studied. One of the mitigating suggestions is higher rewards for toil tasks to balance the effort-reward. This may also be the reason why during my time at Google, SREs would be paid more. They also had the most well-stocked bar at the Google campus. As of 2022, Google also paid a stipend for on-call rotations. This takes the sting out of grunt work. Most companies, though, still treat on-call as just another part of the job. And for infrastructure teams, on-call is only one source of toil. Migrations, upgrades, deployments — these make up a significant portion of perilwork. The most effective way to address it isn’t just to reward it, but to reduce it: to automate, to standardize, to chip away at the risk until what remains is manageable, predictable. Lower the peril, ease the stress.

    What might that look like in practice? Imagine every system carrying a number, something we will call its peril potential — a score between 0 and 100 that reflects the chance something might break. Because problems rarely show up when everything’s calm; they tend to arrive when change rolls through. This peril potential would act as a simple signal, an early warning. When the number starts to climb, it’s a cue to shift focus toward the maintenance work we often postpone – for lack of perceived impact. Tackling them at the right moment lowers the chances of incidents and eases the weight of invisible work on engineers. It’s a way to steady systems and reduce the quiet, grinding stress that builds up over time. Each system would start with a peril score of 0, recalculated after every SLA breach, incident, security event, or major alert tied to a change. The exact thresholds? That’s a judgment call. It would depend on your service tier, your team size, the strength of your tooling, and how easily you can automate away risk. Each organization would have to decide what “too risky” looks like for them.

    Of course, peril scores alone won’t clear your backlog. An astute reader like you might ask — what about the toil itself? How do we decide which pieces are worth tackling? For that, start by digging into your JIRA backlog. Look for the P2 and P3 postmortem follow-ups, the ones gathering dust quarter after quarter, always deprioritized because the immediate impact wasn’t obvious or the return on investment seemed questionable. After all, how risky could a two-line Terraform change be? Or that canary deployment we never fully automated. Or that brittle CI/CD pipeline no one quite trusts. Those are your starting points. Why? Because we already know — from the last incident, the last outage, the postmortem someone quietly typed up — that those fixes would have made a difference. The only reason they’ve stayed untouched is because no one had a way to measure their value. Peril potential gives you that odometer. It surfaces the invisible, lets you track the risk you’re chipping away at, and turns overlooked toil into clear, measurable progress. A small, steady way to make an outsized impact.

    Invisible work has always been the quiet backbone of our systems, and toil — especially perilwork — is where risk hides in plain sight. We can’t eliminate it entirely, but we can get smarter about when and how we tackle it. A simple, transparent measure like peril potential turns gut instinct into data, giving teams a way to prioritize the small, unglamorous fixes before they turn into costly problems. It offers engineers a way to make their impact visible, to reduce stress, and to chip away at risk in a way that scales. And while no metric is perfect, having even a rough signal is better than navigating blind. Start where you are. Pick a threshold. Surface the neglected tasks. You’ll be surprised how quickly the garden starts to clear.

  • Data Contracts —what is it and why should you care?

    Data Contracts —what is it and why should you care?

    Is it a schema, is it an api, is it a bird, a plane…I’m getting carried away. Much like the ETL versus ELT debate of 2021, data contract was the hot topic of 2022. But what is it really?

    Producer, meet Consumer

    Few months ago I wrote about “Bridging the data gap” which talked about a communication gap between producers and consumers of data. It is a tale as old as time — frontend engineer configures a click event to be fired from the mobile application. It gets picked up by the Data Platform and gets stored in different formats. Maybe it will go through several transformations. By the time an analyst decides that they want to use this to run some funnel analysis, they have to jump through hoops and walk through fire to figure out basic details about the event.

    1. “What is the schema?”
    2. “What is its freshness? How often is it synced to the analytical database?”
    3. “What kind of quality can I expect? Would there be a lot of duplicates? What about dropped data?”
    4. “What is the business context? When is this event fired? Is this fired for all clicks or only when certain conditions are met? What are those conditions”

    In my opinion a good data contract would codify all these things and presents it to the consumers so that they don’t have to talk to the data producers to find answers. An API spec for data pipelines if you will.

    In a nutshell, data contract is a handshake agreement between data producers and consumers. A good contract tells the consumer everything they need to know in order to build a product on top of this data with confidence and clarity. And in case you’re wondering it is more than just schema..”

    The data contract I want…

    Since this is an emerging topic with varied opinions, I’d like to put in my wishlist of things I’d like to see in a contract and why…

    Schema

    A schema defines the expected format of the data, including the data types. This is the the bare minimum and kind of a requirement anyway if the data is serialized over the wire and needs guidance on how to deserialize. JSON, Avro, protocol buffers are popular schema definition languages for everything ranging from data objects on the wire to API request/response. Relational databases inherently have a schemaSchema registries like the one offered by Confluent has been around since 2014. Any good organization will have some kind of schema validation and enforcement at the edges of consumers. The only places where its kind of a wild wild west is in the land of logs and NoSQL DBs. But there is an argument to be made that even when this type unstructured data is converted to an analyze-able format— a schema must be defined.

    /**
    * This event is fired when a logged in user clicks the Submit button
    * on the main page. Subsequent clicks are aggregated together
    * and sent as one event.
    */
    message ClickEvent {
    // This is a message level custom option. One can customize
    // any type of option for a protocol buffer message
    option (user.event) = true;

    // event_id
    int64 id = 1;

    // logged in user
    // This is an example of a field level custom option. Can be used for
    // providing additional information about a field like whether it contains
    // personally identifiable information or not.
    User user = 2 [pii=true];

    // time when the button was clicked, comes from the client clock
    Timestamp clicked_at = 3;

    // Number of times the logged in user clicked this button over a
    // 5 second interval
    int number_of_clicks = 4;
    }

    Semantics

    Data semantics refer to the meaning or interpretation of data. It should encompass the relationships and associations between data elements and how they relate to real-world concepts or objects. In other words, data semantics is concerned with the context in which data is used and the meaning that can be derived from it. It helps ensure that data is interpreted correctly.

    For example, consider the field number_of_clicks. Does it count all the clicks of the button? Or does it only count clicks by logged in users? Without additional context or information, the data itself is meaningless.

    Semantics help establish a shared vocabulary between different systems and applications.

    Data profile

    It would be nice to get a summary or snapshot of the characteristics of a dataset. Should provide an overview of the data, including its structure, content, and quality. For example:

    1. What is the column cardinality i.e how many unique values does the column have.
    2. Number of nulls, zeros, empties etc
    3. Value distribution — what is the median(p50th) or p95th value of this column

    Why is this useful? Let’s say I’m building a data product using your dataset. I want to write validations to ensure everything is working as expected. Unless I know whats coming in I can’t validate what’s going out. This is a crucial component for ensuring data quality and anomaly detection. Speaking of….

    SLOs/SLAs and Data Quality

    Latency(or freshness), availability and consistency are some basic things the consumer of your data may care about to assess whether its fit for theitrintended use. Let me give you some examples:
    1. I’m building an executive dashboard for my CEO so she can look at the number of new customer acquire every month. When she asks me how recent the data is I want to be able to give a good answer — and for that I need to know how recent is the data coming from upstream.

    2. I’m writing a Flink streaming job that reads from your data-stream, does some windowed aggregations and writes out the output. I want to figure out what my watermarking strategy should be and for that I need to know expected lateness in your stream. A latency distribution or percentile can give me all the information I need to design a robust product myself.

    Additionally, data quality checks should be able to measure the reality against the expectations to quantify accuracy of the dataset. For example if your product has 10M unique users, but your click events table only has 5M -thats clearly wrong.

    Supported Use

    Or how not to use a data product. This is an uncommon one but one that I feel should definitely be a part of a good data contract. In my time working in data I’ve seen all kinds of bad data consumption patterns. Unless you specify supported usage up front you’ll find yourself supporting weird use cases that’ll suck up your team’s operational bandwidth. Examples of supported use:
    1. “Do not run batch queries on this stream — streaming applications only”

    2. “When running queries on this dataset, filter by time partition otherwise the queries will take a long time to finish”

    3. “Do not run scans on this table, here are some supported query patterns…”.

    Governance

    Access control and governance is often handled separately, but in my opinion it should be a part of data contract. Similar to supported use, its good for consumers to know what all they are allowed to do with the data. Does it contain confidential or sensitive information? How should it be stored, retained, displayed to end users.

    Is data contract the same as data catalog?

    Technically they serve different purposes. While former is an agreement between data producers and consumers, latter is a centralized inventory or registry of data assets that provides information about the location, ownership, quality, and usage of data. That being said a catalog could be the place where contracts are stored? Topic of discussion for another day..

    Parting Thoughts

    1. Over the years schema registry has become a popular way to validate schema at the edge. Look at Confluent schema registry for example — very popular among Kafka consumers.
    2. In my opinion data contract is the next evolution of schema registry. It goes beyond schema to encapsulates other critical info about datasets such as usage, SLO, governance, data quality etc.
    3. The underlying goal is to build a bridge between data producers and consumers.
    4. Whether a contract should exist for every hop of a data pipeline or just at the critical edges(eg: edge of mobile application and data platform) still needs to be seen.
    5. A good contract should have accountability mechanism built into it. A continuous way to monitor the aspect of the contract and clear rules for what need to happen when a contract is violated. Much like service level agreements.

    References

    1. https://dataproducts.substack.com/p/the-rise-of-data-contracts
    2. https://mlops.community/an-engineers-guide-to-data-contracts-pt-1/
    3. https://benn.substack.com/p/data-contracts