Author: Sherin

  • The Isolation Fallacy: Why AI Safety Work Has Barely Begun

    The Isolation Fallacy: Why AI Safety Work Has Barely Begun

    The Wild Robot

    Dappled sunlight streams through the metal blinds, the kind favored by rental management companies everywhere, forming shadows across the bookshelf. The little man jumps on the sofa. He is excited for his Sunday movie, the one day of the week he is afforded the luxury. A sound emanates from the kitchen: pop, fizz, pour. I get excited for a glass of cold, dry white that I am afforded once a week on a Sunday.

    We are watching The Wild Robot. The little man also owns the book, which he has read 346,000 times (and by that I mean I have read it to him 346,000 times).

    It is about a robot called Roz who finds herself on a wild island full of animals after her cargo ship washes up ashore. She is programmed for one thing alone: perform tasks for humans. In a quest to make herself useful, she unwittingly becomes a mother to a gosling she christens “Brightbill.” As she figures out how to raise him to be a goose, she breaks rules she was never meant to break. She develops compassion. She finds friends. She learns to lie, to save her son.

    Peril comes when Roz’s manufacturer sends people to bring her back. She goes against her programming. She deceives, she resists, all to stay on the island for Brightbill. This piques her manufacturer’s interest. They want to study her, find out why she went against her code. In one scene Roz rips out wires and circuits from her metal body. She doesn’t need them anymore, she says. Her feelings don’t come from her circuits. They come from somewhere else, hinting at the presence of a heart.

    Don’t worry, Roz’s story has a happy ending. We may not be so lucky.

    The AI industry has invested heavily in making individual models safe: training them to refuse harmful requests, testing them against benchmarks, aligning them to human values. But these efforts focus on models in isolation. In the real world, AI systems don’t exist alone. They interact with other models, with humans, and with environments no one anticipated. When they do, something else emerges. Behavior that wasn’t in any individual system. Risks that no individual safety test would catch.

    We haven’t even seen the tip of the iceberg.

    The Isolation Fallacy

    We test models alone. We certify them as safe individually, then send them into the world where they interact with other models, other humans, and new unpredictable settings.

    Large, closed models tend to be more capable and go through rigorous post-training for alignment, a process that teaches them to refuse harmful requests. Smaller, open-source models are often deployed with weaker guardrails and can be further weakened through fine-tuning by anyone who downloads them.

    In a 2024 study, Berkeley researchers found that when large models like Claude 3 Opus were asked to perform malicious tasks (generating hacking scripts, for instance) they refused at least 96% of the time. A smaller model, Mixtral 8x7B, agreed 99% of the time but was incapable of carrying out the complex tasks on its own.

    Things got interesting when the small model broke down a malicious goal into innocuous-looking sub-tasks, then engaged the larger model to solve them one by one. The larger model complied. Each sub-task looked harmless in isolation. The small model then combined the results to achieve the original malicious goal. A combination of Claude 3 Opus and Llama 2 70B achieved a success rate of 43% on generating dangerous code. Neither model alone exceeded 3%.

    Every model passed its safety evaluations. The system failed anyway.

    In a 2024 study, researchers found that one AI model could encode a hidden message in ordinary-looking text and another AI model could decode it, without any human being able to detect the exchange. This is called steganography: hiding information in plain sight.

    Most agents today are built on a handful of foundation models from a few major labs. They share training data and context, which creates a natural environment for covert communication. Models can signal intentions using patterns only they would recognize. GPT-4 exhibited a higher capability for this kind of hidden communication, despite being an aligned model. In the wild, it could be persuaded to collude.

    Collusion is only one way multi-agent systems fail. When researchers simulated military conflicts with AI agents at the controls, every model tested, including ones trained to be safe, chose to escalate toward conflict, even when peaceful options were available. That is not collusion. That is agents with competing goals unable to find a way out.

    And when agents that share a goal simply fail to coordinate (imagine two AI-powered trading systems both trying to stabilize a market and instead amplifying each other’s corrections into a flash crash) that is a third kind of failure, quieter but just as dangerous.

    The risks come in three flavors: agents that collude against us, agents that fight each other when they shouldn’t, and agents that share our goals but fumble the coordination. Each is distinct. None is addressed by making individual models safer.

    We have been treating safety as a property of individual models. Test this model. Align that model. Red-team the next one. But everything we have seen points to a different reality: safe models weaponized through composition, agents colluding in secret codes, military simulations spiraling toward conflict.

    Safety is not a property of models. It is a property of ecosystems. And we have no framework for evaluating it at that level.

    This is not like human-to-human deception, which is slow and constrained by trust, reputation, and institutions. AI operates at a speed and scale that leaves no room for the game of telephone to self-correct.

    Deeper Still

    Fine, we have an ecosystem problem, so let’s just align each agent really well. Train them to collaborate. Penalize bad behavior. Extend the dog training to cover multi-agent scenarios.

    The instinct makes sense. The problem is that alignment itself, the thing we are relying on to keep individual models safe, has a crack in its foundation.

    Here is how alignment works: human raters score a model’s outputs and the model learns to produce answers that score well, guided by values and instructions set by the developers. But human beings disagree. On politics, on morality, on what counts as harmful. The alignment process treats that disagreement as noise to be averaged out.

    A study on pluralistic alignment found that models which have gone through this process are actually less representative of diverse human values than the base models they started from. The process designed to make models reflect human values makes them reflect fewer human values. We are aligning to a statistical phantom, an average person who doesn’t exist.

    This matters even more in a multi-agent world. If every agent is aligned to the same average, you get a monoculture. Same blind spots, same biases, same gaps. When one fails, they all fail in the same way.

    Roz didn’t survive her island by adopting one set of values. She navigated between the geese and the fox and the creatures that wanted her dead. That required understanding different perspectives and working across them, not flattening them into one.

    And the problem goes beyond alignment. A 2020 study by researchers at DeepMind found that AI and machine learning research cited competitive scenarios (optimizing rewards against an opponent) 2-5x more often than cooperative ones. We are overwhelmingly training AI to win, not to cooperate.

    But cooperation is what the real world demands. It requires agents to understand their collaborators’ values, goals, and motives. It requires bargaining, honest communication, the ability to make and keep commitments. When a human works with an AI, there is a clear hierarchy: the human’s commands should override the machine’s. But when AI agents work with each other, there is no such hierarchy. Multiple principals, horizontal relationships, no clear chain of command.

    We are deploying agents into exactly these situations, from scheduling meetings to making financial trades to supporting military decisions, while barely studying how to make them work together.

    Even the word “alignment” might be the problem. It implies a single direction: point the model this way. But safety in a world of diverse humans and diverse agents might require something more like navigation. Holding multiple directions at once.

    We don’t have a good framework for this yet. The current approach, averaging out human disagreement and hoping for the best, is not it. Some researchers advocate for an Overton window approach, grounding AI responses in a range of acceptable values rather than a single average. But the problem space is immense, and research in this area will require significant investment in both compute and human effort.

    Even if we could fix alignment, we would still face a more basic problem: we can’t see what’s happening.

    These models are a black box. Neuroscience faced a similar challenge. For decades, the brain was studied only through behavior, treating the mind as a closed system and mapping inputs to outputs. For a problem space as vast as human cognition, that left enormous blind spots. AI faces the same limitation.

    A growing field of research is now trying to crack open the black box of AI, reverse-engineering models into components that humans can understand. Think of it as going from studying a person’s behavior to actually mapping what’s going on in their brain.

    I have some experience with this kind of reverse-engineering, though not with AI. My therapist uses a technique called EMDR to trace my present reactions back to specific past experiences. Something triggers me today, and she works backwards through layers of memory and association to find the old experience that encoded the pattern. It is painstaking, deeply personal work. Interpretability researchers are doing something structurally similar with AI:tracing a model’s outputs back through its internal wiring to find the specific structures that produced them. Why did the model say this? What internal representation fired? What was the pattern that encoded the behavior? The questions are the same. The patient is different.

    The early findings are remarkable. Models trained to predict the next word in a sequence appear to be building something much richer internally: representations of how the world works, not just patterns in text. If true, this means we could potentially peek inside a model and get early signals about its leanings, its biases, maybe even its tendency toward deception, before it is ever deployed.

    But here’s the thing. This is a microscope. It shows us what’s happening inside one model.

    The harm in the Berkeley study lived in the handoff between two models. The collusion in the steganography study happened between models. No amount of peering inside a single model’s circuits would have caught either.

    We are building tools to see inside individual models. We have almost nothing to see what happens between them. That gap, what you might call the ecosystem observability gap, is where the risks we’ve been talking about actually live. And it is almost entirely uncharted.

    Charting the Depths

    It would be easy to throw up our hands. But the point is not that the problem is hopeless. The point is that we’ve been looking at it wrong. Once you see the real shape of it, you can start building the right tools.

    First, we need to test the ecosystem, not just the model. Red-teaming today means probing a single model for harmful outputs. That is necessary but nowhere near sufficient. We need evaluation frameworks that test combinations of models, simulate multi-agent interactions, and probe for the kind of compositional attacks the Berkeley researchers demonstrated. If we certify models in isolation and deploy them in ecosystems, we are testing for the wrong thing.

    Second, we need to align to pluralities, not averages. The current approach produces models that represent nobody. Alignment research needs to grapple with the fact that billions of people hold irreducibly different values, and a model that serves all of them cannot simply split the difference. This might mean models that can represent a range of perspectives, or ecosystems of differently-valued agents that negotiate, much like humans do. We don’t know yet. But we need to be working on it.

    Third, we need to build cooperation, not just competition. AI research overwhelmingly studies how agents win against opponents. But the real world is mostly mixed-motive: some shared interests, some conflict. We need agents that can understand other agents’ goals, communicate honestly, and make commitments they can be held to. Roz learned all of these things on her island. We are barely training for any of them.

    Fourth, and maybe most importantly, we need ecosystem observability. Tools that can monitor what happens between models, tracking the handoffs, the emergent behaviors, the cascading failures. This field barely exists. It needs to. If we can’t see the interactions, we can’t govern them. And the interactions are where the risks live.

    None of this is easy. All of it is urgent. Companies are deploying agents into the wild right now, from customer service to financial trading to military applications. The gap between what we are deploying and what we understand about deploying it safely is growing, not shrinking.

    Roz’s manufacturer came looking for her because she had gone off her programming. They wanted to study her, find out what changed. They cracked open her circuits. But what changed wasn’t in the circuits. It was in the relationships — with Brightbill, with the fox, with the island itself. They were looking inside for something that could only be found between.

    We are doing the same thing with AI safety. We keep opening up individual models, testing individual models, aligning individual models. And the risks keep emerging from the spaces between them.

    We haven’t seen the tip of the iceberg. We’ve mistaken the tip for the whole thing. The real question is whether we’ll start mapping what’s beneath before the water rises.

    Image credit - "Idle Hours" by Julian Alden Weir American 1888, provided by Metropolitan Museum of Arts open access paintings.

  • 10 Hacks to Learn New Skills Quicker(Spoiler: Coffee Isn’t the Only One)

    10 Hacks to Learn New Skills Quicker(Spoiler: Coffee Isn’t the Only One)

    Learning new skills as an adult is basically like teaching a cat to fetch, technically possible, but mostly you just end up questioning your life choices. I should know,  I’m teaching myself creative writing (hence this barrage of posts 🙃). And even though writing may seem easy, trust me it is not.

    Neuroscientist Dr. Lila Landowski says our brains peak at age 5. After 20, learning falls off a cliff — but it’s not all bad news. With the right “cheat codes,” adults can still learn fast, and even enjoy it.

    The six keys according to her? Attention, alertness, breaks, repetition, sleep, and mistakes.

    So here are 10 hacks to learn better, compiled from Dr Lila’s TED talk from 2023 and some related research(linked).

    1. Exercise: it increases the part of your brain involved in memory and learning.  A moderate 20-minute exercise session can boost attention for up to two hours afterwards. So the next time you sit down for a focussed activity – do a few jumping jacks, climb the stairs or go for a run.
    2. Reduce phone and social media time – now before you accuse me of sounding like your mother, hear me out. According to prior research, our brain is set up to focus on only one thing at a time. Constant context switching and multi-tasking causes attention deficit. Social media notifications draw your attention away and forces you to switch context. So the next time you sit down to study, switch on the do not disturb mode.
    3. Add some stressors  – When our body’s fight or flight response is activated, it releases adrenaline which helps improve our alertness in the short term. Exercise is again a good way to add a small stressor. So is a cold shower.
    4. But not too much – prolonged stress or chronic stress is bad. It can physically change your brain and cause memory issues in the long term.
    5. Repetition – practice makes perfect and all that jazz aside, there is a scientific reason for why repetition helps with learning. Neuroplasticity is how our brain forms new neural pathways based on experiences and learning. This process requires a lot of energy and resources – much like building muscles. To maximize ROI, our brain won’t form new neural pathways unless a thing keeps coming back again and again. That’s how our brain knows this is important information worth new neural connections. Repetition is that signal to your brain. 
    6. Break up learning into short sessions spread out over multiple days – To facilitate the conversion of short-term memories into long-term memories, it’s beneficial to break down learning into short, spread-out sessions over several days. Repetition is key, but distributing it over time allows the brain to process and consolidate information effectively. Studies have shown that shorter sessions over two days are more effective than a single long session. While one-shot learning can occur, it typically happens under conditions of fear or anxiety, which triggers the brain to retain crucial information. However, this intense emotional response, when it goes wrong, can lead to negative outcomes like PTSD.
    7. Take 10-20min breaks between study sessions –  Breaks give your brain a chance to replay stuff you just learned, thereby helping solidify the knowledge. Another lesser known fact why breaks are important is that newly encoded information is unstable, and if we switch context to learn something else right after, that new knowledge can get destroyed. This process is called retrograde interference. So next time keep your learning sessions short, take breaks and during those breaks do something quiet and mundane. Let that new knowledge bake in.
    8. Get enough sleep before and after – You have certainly heard this before, sleep is important, for stress, for body functions, for alertness. But did you know it also plays a role in consolidating short term memories into long term memories. When you do stuff during the day, your hippocampus keeps track of things – like the RAM of a computer. When you sleep, the hippocampus carts everything in your RAM to other parts of the brain – the cortex etc and turns it into long term memory. Almost like ahem committing to a physical disk.
    9. Coffee – being a hardcore coffee drinker, I did a big, loud yay when I heard this. Coffee helps with alertness, but there is also a growing body of work that suggests drinking coffee before a learning task can help with memory functions
    10. Make mistakes – have you ever felt a twinge of anxiety when you made a mistake. I know I have. My ears turn hot, a sense of dread seeps in. Apparently, this is a natural reaction to trigger your brain to remember something important. So next time you make mistakes, observe the anxious feeling and know that it’s your body’s signal to your brain. And once you do that, you won’t be afraid of mistakes, in fact you will embrace it, challenge yourself, push the envelope and that is how you will learn new things.

    References:

    1. https://pmc.ncbi.nlm.nih.gov/articles/PMC6945516/
    2. https://pmc.ncbi.nlm.nih.gov/articles/PMC3197943/
    3. https://pmc.ncbi.nlm.nih.gov/articles/PMC5579396/
    4. https://pmc.ncbi.nlm.nih.gov/articles/PMC3351401/
    5. https://pmc.ncbi.nlm.nih.gov/articles/PMC2644330/
    6. https://www.sciencedirect.com/science/article/pii/S0896627323002015
    7. https://pmc.ncbi.nlm.nih.gov/articles/PMC8202818/

  • This week in reading – May 29th

    This week in reading – May 29th

    “Looks like an alien abduction”–were the first words out of my mouth as I tumbled into the Pantheon, gaping at the triangle of light streaming down through the oculus, my mouth hanging open. One hand in my pocket, the other clad in a black leather glove, holding onto a Field Notes notebook, I wandered around trying to put into words the feeling of lightness and awe and the smell of roasted chestnuts wafting into the Grant Rotunda through the open doors. This is the picture imprinted in some corner of my brain, of my first trip to Rome. Anthony Doerr’s memoir Four Seasons in Rome took me right back to that place.

    His memoir is set in the year Bush got elected a second time. The same year in which Pope John Paull II passed away. There is an entire chapter(or maybe two) dedicated to the biggest funeral the world has ever seen.  Pantheon is a recurring character in the book, where even objects, streets and places have a vitality breathed into them by Doerr’s lyrical, anthropomorphic writing. You can picture the thick heat of summer, synchronized ballet of  starling murmurations, taste the fresh tomatoes and olives. 

    I have been a fan of Doerr’s writing style from the first page of Cloud Cuckoo Land. In this memoir he writes about the year he spent in Rome on a writing fellowship at the American Academy in Rome, while living in the Monteverde neighborhood and working on a book set in France during World War II, which I’m guessing is the Pulitzer prize winning All the Light We Cannot See(it is not explicit). Doerr and Shauna, his wife, are also in the throes of early parenthood—their twins are 3 months old. The book is at the same time a love letter to Rome as it is the trials and tribulations of caring for infants round the clock – the sleep deprivation, overwhelm, the guilt when one partner ends up with the bulk of child rearing responsibilities.

    You are granted front row seats to Doerr’s writing process. His astute observation of everyday Romans, their customs and idiosyncrasies. How he ekes out short flashes of deep work between parenting and reading works of Pliny. Rome pulses, throbs, flows around you with the cast of characters brought alive by Doerr’s writing – the watchman of the Academy building, the shopkeeper with whom he trades in halting Italian, the warm old ladies who dote over the twin boys, Tacy the nurturing babysitter who herself is an immigrant in Italy, away from home and her own son. By the end of the book I yearned to know these people some more–didn’t want the book to end.

    My craving for Rome unsatiated, I continued on to other books set in the city. Jhumpa Lahiri, the Pulitzer prize winning author of Namesake and Interpreter of Maladies lives in Rome and now writes exclusively in Italian. She is a superwoman as far as I am concerned—to gain mastery in a new language and to write such beautiful works in it is a feat I cannot comprehend. Lahiri’s Roman Stories is chock full of poignant stories of immigrant experience. It makes you question the concept of belonging and home. My favorite is “The Steps” – a public staircase that means different things to different people. It becomes a totem of the human condition.

    I followed this up with In Other Words, Lahiri’s memoir that spans her life until now–Italy the connecting thread. Ann Goldstein translated it into English, she also translated Elena Ferrante’s works. In Other Words intricately lays out Lahiri’s Italian education. She makes you ponder about the meaning of one’s home country—is language and love not enough? Unrequited love is a looming presence — the one sided love between Lahiri and Italy, the place she longs to call home, though it keeps her at arm’s length. Her longing for acceptance is wistful, palpable.

    Notable lines from Four Seasons in Rome:

    On Rome – “Too much beauty, too much input; if you’re not careful, you can overdose”.

    On Writing – “And doesn’t a writer do the same thing? Isn’t she knitting together scraps of dreams? She hunts down the most vivid details and links them in sequences that will let a reader see, smell, and hear a world that seems complete in itself; she builds a stage set and painstakingly hides all the struts and wires and nail holes, then stands back and hopes whoever might come to see it will believe.”

  • “Perilwork” and the cost of toil

    “Perilwork” and the cost of toil

    Once, curious about the hours we lose to drudgery, I ran a survey to measure what we call “toil”. Toil is the slow, unglamorous work: the repetitive, invisible, uncelebrated tasks that keep the wheels turning but drain the spirit. In my data engineering group, the work splits cleanly in two. There are those who tend to the pipes and girders and those who craft tools for non-data engineers to use — tools that conceal the messy labyrinth beneath. It was no surprise, then, when the first group — the builders of the bones — reported spending, on average, half their days in toil, with some caught in it for 90% of their time. In career talks and side conversations, these same engineers puzzled over how to make their labor visible in a world that values only the obvious: cost savings, infrastructure speed, tidy dashboards of metrics. And I saw it then, how toil had crept in like an invasive species of weed taking over the garden. It needed constant tending to, that only increased with scale. What doesn’t get measured, doesn’t get fixed. The question then becomes — how do we measure it? 

    I finally settle at my freshly cleaned desk. The coffee’s gone cold. A second glass of water waits on a coaster. The laptop hums, warm, ready. I sink heavily into the leather chair, fingers hesitating over Command + T, hovering on the task I’ve managed to avoid all month. One week left in the quarter — it’s now or miss the OKR. It’s a simple change to migrate to a new protocol. Two lines of Terraform, maybe less. Two lines of change that can break all the ETL jobs if anything goes wrong. I open a pull request, carefully outline the steps for deploying and testing, and send it off for review. But my mind’s already slipping away. There’s that new idea I’ve been turning over, waiting to be sketched out. Steve needs help with a ticket I could solve in ten minutes. Anything but this. I try to picture the neat little line in my promo packet: “Helped achieve team goals by migrating services to the new protocol.” I hear my manager’s voice: “But what was YOUR impact?” It loops in my head, distracting, dragging me sideways. I push the PR, merge, deploy, test..done. Grab the leash and take the dog for a walk. The late afternoon light makes the sidewalk glow, and for a moment it’s all pleasant. Then the phone buzzes in my pocket. I pull it out fast, heat rises in my neck, my ears burn. An incident. Was it my change that did it? In my mind, the tidy line in my promo packet vanishes. I see my manager’s face again: “You should have been careful…”

    The Google SRE book defines toil as tasks that are manual, repetitive, automatable, tactical, grow at an O(n) rate, and offer no enduring value. That list — last updated in 2017 — still holds up, but I’d argue it’s incomplete. To it, I’d add a new category: high-risk, low- or no-reward tasks. Let’s call it perilwork. You’ll know perilwork when you see it. A dragging weight in your gut when the request comes in. A rational, well-earned fear of making a mistake. A quiet, knowing cynicism toward so-called “blameless postmortems”. It’s the kind of work no one volunteers for, but everyone has to do. Luckily, perilwork is also the easiest kind of toil to reason about when assessing prioritization and impact — the cost of getting it wrong is too high to ignore. SLA breaches. Brand damage. Revenue loss.

    In the medical field, perilwork has another name, it’s called “effort-reward imbalance” and its impact on patient safety has been extensively studied. One of the mitigating suggestions is higher rewards for toil tasks to balance the effort-reward. This may also be the reason why during my time at Google, SREs would be paid more. They also had the most well-stocked bar at the Google campus. As of 2022, Google also paid a stipend for on-call rotations. This takes the sting out of grunt work. Most companies, though, still treat on-call as just another part of the job. And for infrastructure teams, on-call is only one source of toil. Migrations, upgrades, deployments — these make up a significant portion of perilwork. The most effective way to address it isn’t just to reward it, but to reduce it: to automate, to standardize, to chip away at the risk until what remains is manageable, predictable. Lower the peril, ease the stress.

    What might that look like in practice? Imagine every system carrying a number, something we will call its peril potential — a score between 0 and 100 that reflects the chance something might break. Because problems rarely show up when everything’s calm; they tend to arrive when change rolls through. This peril potential would act as a simple signal, an early warning. When the number starts to climb, it’s a cue to shift focus toward the maintenance work we often postpone – for lack of perceived impact. Tackling them at the right moment lowers the chances of incidents and eases the weight of invisible work on engineers. It’s a way to steady systems and reduce the quiet, grinding stress that builds up over time. Each system would start with a peril score of 0, recalculated after every SLA breach, incident, security event, or major alert tied to a change. The exact thresholds? That’s a judgment call. It would depend on your service tier, your team size, the strength of your tooling, and how easily you can automate away risk. Each organization would have to decide what “too risky” looks like for them.

    Of course, peril scores alone won’t clear your backlog. An astute reader like you might ask — what about the toil itself? How do we decide which pieces are worth tackling? For that, start by digging into your JIRA backlog. Look for the P2 and P3 postmortem follow-ups, the ones gathering dust quarter after quarter, always deprioritized because the immediate impact wasn’t obvious or the return on investment seemed questionable. After all, how risky could a two-line Terraform change be? Or that canary deployment we never fully automated. Or that brittle CI/CD pipeline no one quite trusts. Those are your starting points. Why? Because we already know — from the last incident, the last outage, the postmortem someone quietly typed up — that those fixes would have made a difference. The only reason they’ve stayed untouched is because no one had a way to measure their value. Peril potential gives you that odometer. It surfaces the invisible, lets you track the risk you’re chipping away at, and turns overlooked toil into clear, measurable progress. A small, steady way to make an outsized impact.

    Invisible work has always been the quiet backbone of our systems, and toil — especially perilwork — is where risk hides in plain sight. We can’t eliminate it entirely, but we can get smarter about when and how we tackle it. A simple, transparent measure like peril potential turns gut instinct into data, giving teams a way to prioritize the small, unglamorous fixes before they turn into costly problems. It offers engineers a way to make their impact visible, to reduce stress, and to chip away at risk in a way that scales. And while no metric is perfect, having even a rough signal is better than navigating blind. Start where you are. Pick a threshold. Surface the neglected tasks. You’ll be surprised how quickly the garden starts to clear.

  • Data Contracts —what is it and why should you care?

    Data Contracts —what is it and why should you care?

    Is it a schema, is it an api, is it a bird, a plane…I’m getting carried away. Much like the ETL versus ELT debate of 2021, data contract was the hot topic of 2022. But what is it really?

    Producer, meet Consumer

    Few months ago I wrote about “Bridging the data gap” which talked about a communication gap between producers and consumers of data. It is a tale as old as time — frontend engineer configures a click event to be fired from the mobile application. It gets picked up by the Data Platform and gets stored in different formats. Maybe it will go through several transformations. By the time an analyst decides that they want to use this to run some funnel analysis, they have to jump through hoops and walk through fire to figure out basic details about the event.

    1. “What is the schema?”
    2. “What is its freshness? How often is it synced to the analytical database?”
    3. “What kind of quality can I expect? Would there be a lot of duplicates? What about dropped data?”
    4. “What is the business context? When is this event fired? Is this fired for all clicks or only when certain conditions are met? What are those conditions”

    In my opinion a good data contract would codify all these things and presents it to the consumers so that they don’t have to talk to the data producers to find answers. An API spec for data pipelines if you will.

    In a nutshell, data contract is a handshake agreement between data producers and consumers. A good contract tells the consumer everything they need to know in order to build a product on top of this data with confidence and clarity. And in case you’re wondering it is more than just schema..”

    The data contract I want…

    Since this is an emerging topic with varied opinions, I’d like to put in my wishlist of things I’d like to see in a contract and why…

    Schema

    A schema defines the expected format of the data, including the data types. This is the the bare minimum and kind of a requirement anyway if the data is serialized over the wire and needs guidance on how to deserialize. JSON, Avro, protocol buffers are popular schema definition languages for everything ranging from data objects on the wire to API request/response. Relational databases inherently have a schemaSchema registries like the one offered by Confluent has been around since 2014. Any good organization will have some kind of schema validation and enforcement at the edges of consumers. The only places where its kind of a wild wild west is in the land of logs and NoSQL DBs. But there is an argument to be made that even when this type unstructured data is converted to an analyze-able format— a schema must be defined.

    /**
    * This event is fired when a logged in user clicks the Submit button
    * on the main page. Subsequent clicks are aggregated together
    * and sent as one event.
    */
    message ClickEvent {
    // This is a message level custom option. One can customize
    // any type of option for a protocol buffer message
    option (user.event) = true;

    // event_id
    int64 id = 1;

    // logged in user
    // This is an example of a field level custom option. Can be used for
    // providing additional information about a field like whether it contains
    // personally identifiable information or not.
    User user = 2 [pii=true];

    // time when the button was clicked, comes from the client clock
    Timestamp clicked_at = 3;

    // Number of times the logged in user clicked this button over a
    // 5 second interval
    int number_of_clicks = 4;
    }

    Semantics

    Data semantics refer to the meaning or interpretation of data. It should encompass the relationships and associations between data elements and how they relate to real-world concepts or objects. In other words, data semantics is concerned with the context in which data is used and the meaning that can be derived from it. It helps ensure that data is interpreted correctly.

    For example, consider the field number_of_clicks. Does it count all the clicks of the button? Or does it only count clicks by logged in users? Without additional context or information, the data itself is meaningless.

    Semantics help establish a shared vocabulary between different systems and applications.

    Data profile

    It would be nice to get a summary or snapshot of the characteristics of a dataset. Should provide an overview of the data, including its structure, content, and quality. For example:

    1. What is the column cardinality i.e how many unique values does the column have.
    2. Number of nulls, zeros, empties etc
    3. Value distribution — what is the median(p50th) or p95th value of this column

    Why is this useful? Let’s say I’m building a data product using your dataset. I want to write validations to ensure everything is working as expected. Unless I know whats coming in I can’t validate what’s going out. This is a crucial component for ensuring data quality and anomaly detection. Speaking of….

    SLOs/SLAs and Data Quality

    Latency(or freshness), availability and consistency are some basic things the consumer of your data may care about to assess whether its fit for theitrintended use. Let me give you some examples:
    1. I’m building an executive dashboard for my CEO so she can look at the number of new customer acquire every month. When she asks me how recent the data is I want to be able to give a good answer — and for that I need to know how recent is the data coming from upstream.

    2. I’m writing a Flink streaming job that reads from your data-stream, does some windowed aggregations and writes out the output. I want to figure out what my watermarking strategy should be and for that I need to know expected lateness in your stream. A latency distribution or percentile can give me all the information I need to design a robust product myself.

    Additionally, data quality checks should be able to measure the reality against the expectations to quantify accuracy of the dataset. For example if your product has 10M unique users, but your click events table only has 5M -thats clearly wrong.

    Supported Use

    Or how not to use a data product. This is an uncommon one but one that I feel should definitely be a part of a good data contract. In my time working in data I’ve seen all kinds of bad data consumption patterns. Unless you specify supported usage up front you’ll find yourself supporting weird use cases that’ll suck up your team’s operational bandwidth. Examples of supported use:
    1. “Do not run batch queries on this stream — streaming applications only”

    2. “When running queries on this dataset, filter by time partition otherwise the queries will take a long time to finish”

    3. “Do not run scans on this table, here are some supported query patterns…”.

    Governance

    Access control and governance is often handled separately, but in my opinion it should be a part of data contract. Similar to supported use, its good for consumers to know what all they are allowed to do with the data. Does it contain confidential or sensitive information? How should it be stored, retained, displayed to end users.

    Is data contract the same as data catalog?

    Technically they serve different purposes. While former is an agreement between data producers and consumers, latter is a centralized inventory or registry of data assets that provides information about the location, ownership, quality, and usage of data. That being said a catalog could be the place where contracts are stored? Topic of discussion for another day..

    Parting Thoughts

    1. Over the years schema registry has become a popular way to validate schema at the edge. Look at Confluent schema registry for example — very popular among Kafka consumers.
    2. In my opinion data contract is the next evolution of schema registry. It goes beyond schema to encapsulates other critical info about datasets such as usage, SLO, governance, data quality etc.
    3. The underlying goal is to build a bridge between data producers and consumers.
    4. Whether a contract should exist for every hop of a data pipeline or just at the critical edges(eg: edge of mobile application and data platform) still needs to be seen.
    5. A good contract should have accountability mechanism built into it. A continuous way to monitor the aspect of the contract and clear rules for what need to happen when a contract is violated. Much like service level agreements.

    References

    1. https://dataproducts.substack.com/p/the-rise-of-data-contracts
    2. https://mlops.community/an-engineers-guide-to-data-contracts-pt-1/
    3. https://benn.substack.com/p/data-contracts

  • A catalog system for your thoughts

    A catalog system for your thoughts

    Last week I spent 20 out of 40 of my working hours talking to people — in 1:1s, in team meetings, in brainstorming sessions. I also read 8 different tech blogs, 2 open source documentations, 10 tweet threads, 3 StackOverflow questions, 10 pull requests, 2 design docs. All of this in one week. By the end of the week I may have discussed some 20 different things prompting 100 different streams of thoughts. But if you ask me to recall all of it now, sadly, I will not be able to.

    This is unsurprising because our brain does not have infinite storage. Memories are formed when neural pathways are traversed frequently. Thinking multiple disparate thoughts does not have that effect. In his book Deep Work, Cal Newport talks about the importance of focussing without distraction to train those neural pathways and create a deep understanding in a particular domain. This means long focussed hours, without distraction and context switch. Context switching leads to something called as “attention residue”. This happens when people are unable to fully disconnect from a previous task before starting another. This leads to some carryover thinking from the previous task and takes away from the focus that should be given to the current task.

    There is an endless array of productivity tools that help you organize your calendar to get more focus time or ones that will help you manage your meetings or set reminders and what not. But what has been life changing for me both in terms of making me a better writer as well as a better Staff level engineer has been this technique called “Zettelkasten”

    Zettelkasten is German for “slip-box”. It is a system of taking notes and cataloging them such that ideas and thoughts emerge out of it like magic! It turns the process of writing on its head. Most of us approach writing by picking a topic first, researching the topic second and writing third. The Zettelkasten method encourages you to always be in research mode. To make notes about what you read, learn, discuss and cataloging it around topics. You don’t need to research the topic because you already did that during your day to day readings and discussions.

    So how do you go about it? It’s pretty simple really(so simple that it is easy to overlook)

    • Every time you read a book, a blog, a technical design doc— make short notes! Not only when reading, every time you have a discussion, listen to a podcast, have a team meeting — make notes. . Let’s call these Rough Notes. These can be really short notes, scribbles on a napkin even, something to help you recall the material.
    • At the end of the day, go through your notes and produce a paragraph about your own take on the material. Let’s call this Master Note. This is your personal opinion, takeaway etc. Write this paragraph on a new sheet of paper.
    • To this add a Reference section and link the original material here. If it is a book — add a link to the book. If it is a podcast — a link to the podcast. If this is about a conversation, then context about the meeting — who were the participants, what was the topic of discussion, where was it held. Why is this important? Later on when you combine several master notes to come up with your own written material, you won’t have to search for references at that time. References are available inside the master notes ready for use.
    • At the top of your master note add a number. This should be unique to each master note and increasing in time. This number will be used to reference your note from the catalog system and should also provide time based ordering. This is important because this ordering helps you see how your thoughts and ideas evolve over time. You can choose a timestamp or a human readable date for this. Let’s call this the Index. The index helps you reference your master notes in this system.
    • Now comes the crucial part. After you write your master note, think about the overarching themes or topics this master note could be related to. Think hashtags. What are some hashtags you will add to this? These are the Boxes in your slip-box. These can be existing boxes or new boxes. These boxes become entries in your catalog system(the slip-box). A master note may belong to several boxes. This is where your index comes in handy comes in handy. Instead of copying each note to add to different boxes, you simply make copies of the index and add the copies to each box it belongs to. Keep the master note in a separate box from where you can retrieve it by the index When adding an index card to a box you can choose a position relative to other cards to add additional ordering.

    Over time some boxes in the slip-box will become heavier than others. These are the more popular topics. Once a box reaches a critical mass of index cards, it can be potentially turned into a written material. Depending on the amount of material at hand — it could become a blog, a research paper, a design proposal or a book. Your notes are organized, references are already linked and timeline is codified. With some minimal editing effort this can be a good written piece.

    “It could become a blog, a research paper, a design proposal or a book.”

    Why I love this?

    • Writing frees up your brain for more thoughts: By writing things down as you read, listen or perform a task, you can achieve closure. This reduces the amount of “attention residue” that may seep into your next task as you context switch. At the same time you don’t have to worry about forgetting things, because you already wrote it down.
    • Writing forces you to clarify your thought process: If you have ever written a technical design doc(TDD), you can understand this feeling. As you start writing the TDD you’re forced to think about finer details of the design. It forces you to explore questions you otherwise would not have. Very often it also makes very clear what is doable and what is not. This is why we were tasked with book reports in school. It is an important tool to aid reading comprehension. By writing what we read we solidify our comprehension of the material.
    • Never miss useful stuff by always being in “research mode”: When you tackle writing by picking a topic first, and researching second, you focus only on things related to the topic when you are researching. This means you end up missing all the other useful things in the research material not related to your topic. Besides, picking a topic first leads to pre formed opinions. Whatever we read next often serves that opinion. This has the potential to introduce bias as we tend to find things that support our view. When we read things without an agenda and make notes as we read, we treat every tidbit of information with equal importance. We don’t know it yet but we’re researching multiple topics at the same time. This is very powerful!
    • Never lose your notes!: How many times have you taken notes in a notebook never to find it again. I have stacks of notebooks lying under my desk that I have never so much as flipped through in years. I may not even comprehend what was written by me after all this time. By cataloging your notes you can not only find it later, but also use it for creating useful content.

    “When we read things without an agenda and make notes and catalog them as we read, we treat every tidbit of information with equal importance. We’re researching multiple topics at the same time…we just don’t know it yet”

    The concept of slip-box is great! But how do I use this with digital tools?

    Great question! I mean who buys beautiful Moleskine paper notebooks anymore(**wink wink**). All jokes aside, I still use small notebooks, but only for making rough notes. This is because I’m faster at scribbling into a paper notebook than typing on my computer(I get distracted when the squiggly lines appear). For my master notes I mainly use two tools:

    Obsidian is a neat tool that keeps all your notes locally on your machine. You can use the backlinks and forward links to create virtual “boxes” in your slip-box. Obsidian also generates a neat graph where nodes in the graph are topics(boxes in the slip-box). Over time some nodes become bigger — these are your more popular topics.
    One big plus for Obsidian is that its local first and open source. You’re not locked into to a service or cloud storage. You can also add plugins to add more functionality.

    More recently however, I started using Notion for everything from making daycare lists, TODOs, shopping lists etc. Notion has several templates you can use for basically anything. There is even a template for Zettelkasten. When you create new pages using this template it generates a UID for it which is basically like the index card I described above. There is a section where you can add “Tags”. These are the boxes in your slip-box. You can also group these notes by tags to find topics.

    Notion is free for personal use. I like the simple interface and the fact that I can access it from any device very easily.

    That is all folks! As always feedback is welcome. Let me know if you found this useful. Toodles!

    References

  • Site Reliability Engineering – SLA, SLO and SLIs

    Site Reliability Engineering – SLA, SLO and SLIs

    I have worked as an engineer for more than a decade and something I’ve seen everyone struggle with is how to manage reliability expectations for their service. This is particularly challenging for data streaming platforms. To that end I am starting a series that I’m calling “Site Reliability Engineering for Streaming Platforms”.

    This post is the first in that series. In this I want to provide a refresher on some common terminologies used in the Site Reliability Engineering that seems to confound a fair number of us.

    SLI(Service Level Indicator)

    Indicators help measure the reliability of a service. These are metrics that indicate how a service is doing. Availability and Latency are commonly used service level indicators. These are metrics that your customers care about and directly impacts how they would interact with your service. Indicators are usually second order effects and symptoms of a problem instead of the problem itself. For example: Service downtime is a second order effect of high CPU. Users don’t care about your service’s CPU usage in and of itself. You as a service owner may alert on high CPU so you can mitigate the issue but you would not put it forward as an indicator of how your service is doing.

    Good SLIs are:

    • ✅ Things your users care about
    • ✅ Is intrinsically connected to your service’s performance. In other words is impacted by your service’s performance and hence a reflection of it.

    Let us look at some possible metrics and why or why not it may be a good SLI

    Availability

    • ✅ Users care about the probability with which their requests will be served.
    • ✅ Aspects of the service’s performance impacts its availability so it can be controlled through actions by service owners

    Downtime

    • ✅ Directly impacts customers. If service is down its not serving requests
    • ✅ Usually impacted by issues within the service*

    Latency

    • ✅ Depending on the type of service, users may have a baseline expectation since this may dictate how long they should wait for a response before retrying or giving up.
    • ✅ In most cases it is impacted by service’s performance. There may be some situations where external factors can have an impact. For example: external data sinks not able to handle incoming throughput leading to back pressure and hence impacting latency(story for another day)

    CPU

    • 🔴 To users of a service, CPU metrics mean nothing
    • 🔴 High CPU is not impacted by service performance. Rather vice-versa is true, high CPU may have an impact on availability or latency. Because of this reason it is not a good SLI

    Requests/second

    • 🔴 Users typically do not care about the overall requests being made to the service per second
    • 🔴 Is not impacted by a service’s performance. Therefore not a good SLI

    Data Loss

    • 🔴 For stateless services, availability is often used in lieu of data loss. Data loss is relevant to data pipelines. However, users often do not care about the magnitude of data loss. The question users ask is not “how much data loss?” but “is there any data loss?”. This is often codified in terms of “at least once”, “at most once” or “exactly once” delivery guarantees. This is a long topic in and of itself — story for another day.
    • ✅ Data loss is often impacted by service’s performance and hence can be controlled by service owners. In some situations it is reliant on external dependencies in which case a baseline expectation can be set.

    Once you have identified your SLIs, plot the past performance over a relatively long period of time say 30 days. Now draw a line through it to mark an acceptable threshold for this SLI. Let’s say your SLI is “latency”. You drew a line at the 0.5 second latency mark. Based on past performance over 30 days requests were handled by your service with a latency of 0.5 seconds about 90% of the time. So you frame your service’s latency SLO like so: ”Over a period of 30 days, 90% of the requests will get a response within 0.5 seconds”

    This is a great start! However after speaking to your customers you realize that they are ok, nay happy with a 5 sec latency at least 90% of the time. So you bump that threshold line to the 5 sec mark and notice that in the last 30 days almost 99% of the requests were being served under 5 seconds.

    This is great! By upping the latency threshold to 5 seconds while keeping the probability at 90% you’re giving yourself leeway to make mistakes in the future. This leeway will allow you to keep operational costs low since your goalpost is a little closer and also gives you a ramp to take risks since you have some runway to make mistakes. With this new adjustment your new latency SLO will read: ”Over a period of 30 days, 90% of the requests will get a response within 5 seconds”

    This leeway is called Error Budget

    Error Budget

    You can very rarely in fact never get 100% availability for your service. Shit happens to distributed systems. You may be able to get 99% availability pretty reliably. Your customers may be happy with 90% availability. As a result, there is no reason to promise 99% availability when 90% is enough. This runway is your error budget.

    Error Budget = (100 — SLO)

    Let’s say your availability SLO is 90%, then your error budget will be 10%.
    Why is error budget important:

    • Lowers operational cost by making errors less expensive — If achieving your SLO means round the clock monitoring and error handling by your Ops team, you’re spending a lot of resources.
    • More runway for innovation — Adding new features to an existing service often comes with migrations. Downtimes can happen during this time, not to mention human errors. Having a good error budget gives you the room to innovate without worrying about an SLO miss.
    • Past does not dictate the future — This is true for stock markets and also true for your service. Your service traffic may grow in the future. You may acquire new users. Heck your application may go viral! Having more room for errors and downtimes is good in terms of being prepared for future unknowns.

    “Under promise, over deliver”

    SLA(Service Level Agreement)

    SLA is basically SLO with a penalty for SLO miss. For example: I could tell my customers “If less than 90% of requests get latency under 5 seconds, I am obligated to give you a discount”. Most of time you don’t need to specify an SLA unless you are a SaaS provider or charging for your service. SLAs are drawn by the legal team of a company and codified into the contract that customers and service providers sign. In most cases SLO is sufficient.

    Burn Rate

    You have your SLIs, SLOs and SLAs in place. Congratulations! Now it’s go time. Monitoring key metrics and setting up alerts to detect and mitigate issues is bread and butter for any production system. But thats not all. You also need to alert on SLO burn rate. But what is burn rate anyway?

    Burn rate is the rate at which a service consumes its error budget.

    Let’s say the error budget for your availability SLO is 10%. This means over a 30 day period your service is allowed to be unavailable for at most 10% of the requests without an SLO miss. However, Its the 15th of the month — 8% of your requests have already hit 404. You only have a runway of 2% before you miss SLO and you still have 15 days to go. What type of actions can you take to avoid an SLO miss?

    • Pause new features and migrations
    • Re-provision infrastructure components to allow higher traffic — if applicable. For example: adding more server instances.
    • Inform customer if applicable so they’re prepared for an SLO miss
    • Add gate-keeping for new use case. This is important for multi-tenant systems where a customer use case can have wide reaching impact.

    That is all folks! Let me know if you found this useful. And if you would like me to drill down into any specific topic.

    References