Author: Sherin

  • 10 Hacks to Learn New Skills Quicker(Spoiler: Coffee Isn’t the Only One)

    10 Hacks to Learn New Skills Quicker(Spoiler: Coffee Isn’t the Only One)

    Learning new skills as an adult is basically like teaching a cat to fetch, technically possible, but mostly you just end up questioning your life choices. I should know,  I’m teaching myself creative writing (hence this barrage of posts 🙃). And even though writing may seem easy, trust me it is not.

    Neuroscientist Dr. Lila Landowski says our brains peak at age 5. After 20, learning falls off a cliff — but it’s not all bad news. With the right “cheat codes,” adults can still learn fast, and even enjoy it.

    The six keys according to her? Attention, alertness, breaks, repetition, sleep, and mistakes.

    So here are 10 hacks to learn better, compiled from Dr Lila’s TED talk from 2023 and some related research(linked).

    1. Exercise: it increases the part of your brain involved in memory and learning.  A moderate 20-minute exercise session can boost attention for up to two hours afterwards. So the next time you sit down for a focussed activity – do a few jumping jacks, climb the stairs or go for a run.
    2. Reduce phone and social media time – now before you accuse me of sounding like your mother, hear me out. According to prior research, our brain is set up to focus on only one thing at a time. Constant context switching and multi-tasking causes attention deficit. Social media notifications draw your attention away and forces you to switch context. So the next time you sit down to study, switch on the do not disturb mode.
    3. Add some stressors  – When our body’s fight or flight response is activated, it releases adrenaline which helps improve our alertness in the short term. Exercise is again a good way to add a small stressor. So is a cold shower.
    4. But not too much – prolonged stress or chronic stress is bad. It can physically change your brain and cause memory issues in the long term.
    5. Repetition – practice makes perfect and all that jazz aside, there is a scientific reason for why repetition helps with learning. Neuroplasticity is how our brain forms new neural pathways based on experiences and learning. This process requires a lot of energy and resources – much like building muscles. To maximize ROI, our brain won’t form new neural pathways unless a thing keeps coming back again and again. That’s how our brain knows this is important information worth new neural connections. Repetition is that signal to your brain. 
    6. Break up learning into short sessions spread out over multiple days – To facilitate the conversion of short-term memories into long-term memories, it’s beneficial to break down learning into short, spread-out sessions over several days. Repetition is key, but distributing it over time allows the brain to process and consolidate information effectively. Studies have shown that shorter sessions over two days are more effective than a single long session. While one-shot learning can occur, it typically happens under conditions of fear or anxiety, which triggers the brain to retain crucial information. However, this intense emotional response, when it goes wrong, can lead to negative outcomes like PTSD.
    7. Take 10-20min breaks between study sessions –  Breaks give your brain a chance to replay stuff you just learned, thereby helping solidify the knowledge. Another lesser known fact why breaks are important is that newly encoded information is unstable, and if we switch context to learn something else right after, that new knowledge can get destroyed. This process is called retrograde interference. So next time keep your learning sessions short, take breaks and during those breaks do something quiet and mundane. Let that new knowledge bake in.
    8. Get enough sleep before and after – You have certainly heard this before, sleep is important, for stress, for body functions, for alertness. But did you know it also plays a role in consolidating short term memories into long term memories. When you do stuff during the day, your hippocampus keeps track of things – like the RAM of a computer. When you sleep, the hippocampus carts everything in your RAM to other parts of the brain – the cortex etc and turns it into long term memory. Almost like ahem committing to a physical disk.
    9. Coffee – being a hardcore coffee drinker, I did a big, loud yay when I heard this. Coffee helps with alertness, but there is also a growing body of work that suggests drinking coffee before a learning task can help with memory functions
    10. Make mistakes – have you ever felt a twinge of anxiety when you made a mistake. I know I have. My ears turn hot, a sense of dread seeps in. Apparently, this is a natural reaction to trigger your brain to remember something important. So next time you make mistakes, observe the anxious feeling and know that it’s your body’s signal to your brain. And once you do that, you won’t be afraid of mistakes, in fact you will embrace it, challenge yourself, push the envelope and that is how you will learn new things.

    References:

    1. https://pmc.ncbi.nlm.nih.gov/articles/PMC6945516/
    2. https://pmc.ncbi.nlm.nih.gov/articles/PMC3197943/
    3. https://pmc.ncbi.nlm.nih.gov/articles/PMC5579396/
    4. https://pmc.ncbi.nlm.nih.gov/articles/PMC3351401/
    5. https://pmc.ncbi.nlm.nih.gov/articles/PMC2644330/
    6. https://www.sciencedirect.com/science/article/pii/S0896627323002015
    7. https://pmc.ncbi.nlm.nih.gov/articles/PMC8202818/

  • This week in reading – May 29th

    This week in reading – May 29th

    “Looks like an alien abduction”–were the first words out of my mouth as I tumbled into the Pantheon, gaping at the triangle of light streaming down through the oculus, my mouth hanging open. One hand in my pocket, the other clad in a black leather glove, holding onto a Field Notes notebook, I wandered around trying to put into words the feeling of lightness and awe and the smell of roasted chestnuts wafting into the Grant Rotunda through the open doors. This is the picture imprinted in some corner of my brain, of my first trip to Rome. Anthony Doerr’s memoir Four Seasons in Rome took me right back to that place.

    His memoir is set in the year Bush got elected a second time. The same year in which Pope John Paull II passed away. There is an entire chapter(or maybe two) dedicated to the biggest funeral the world has ever seen.  Pantheon is a recurring character in the book, where even objects, streets and places have a vitality breathed into them by Doerr’s lyrical, anthropomorphic writing. You can picture the thick heat of summer, synchronized ballet of  starling murmurations, taste the fresh tomatoes and olives. 

    I have been a fan of Doerr’s writing style from the first page of Cloud Cuckoo Land. In this memoir he writes about the year he spent in Rome on a writing fellowship at the American Academy in Rome, while living in the Monteverde neighborhood and working on a book set in France during World War II, which I’m guessing is the Pulitzer prize winning All the Light We Cannot See(it is not explicit). Doerr and Shauna, his wife, are also in the throes of early parenthood—their twins are 3 months old. The book is at the same time a love letter to Rome as it is the trials and tribulations of caring for infants round the clock – the sleep deprivation, overwhelm, the guilt when one partner ends up with the bulk of child rearing responsibilities.

    You are granted front row seats to Doerr’s writing process. His astute observation of everyday Romans, their customs and idiosyncrasies. How he ekes out short flashes of deep work between parenting and reading works of Pliny. Rome pulses, throbs, flows around you with the cast of characters brought alive by Doerr’s writing – the watchman of the Academy building, the shopkeeper with whom he trades in halting Italian, the warm old ladies who dote over the twin boys, Tacy the nurturing babysitter who herself is an immigrant in Italy, away from home and her own son. By the end of the book I yearned to know these people some more–didn’t want the book to end.

    My craving for Rome unsatiated, I continued on to other books set in the city. Jhumpa Lahiri, the Pulitzer prize winning author of Namesake and Interpreter of Maladies lives in Rome and now writes exclusively in Italian. She is a superwoman as far as I am concerned—to gain mastery in a new language and to write such beautiful works in it is a feat I cannot comprehend. Lahiri’s Roman Stories is chock full of poignant stories of immigrant experience. It makes you question the concept of belonging and home. My favorite is “The Steps” – a public staircase that means different things to different people. It becomes a totem of the human condition.

    I followed this up with In Other Words, Lahiri’s memoir that spans her life until now–Italy the connecting thread. Ann Goldstein translated it into English, she also translated Elena Ferrante’s works. In Other Words intricately lays out Lahiri’s Italian education. She makes you ponder about the meaning of one’s home country—is language and love not enough? Unrequited love is a looming presence — the one sided love between Lahiri and Italy, the place she longs to call home, though it keeps her at arm’s length. Her longing for acceptance is wistful, palpable.

    Notable lines from Four Seasons in Rome:

    On Rome – “Too much beauty, too much input; if you’re not careful, you can overdose”.

    On Writing – “And doesn’t a writer do the same thing? Isn’t she knitting together scraps of dreams? She hunts down the most vivid details and links them in sequences that will let a reader see, smell, and hear a world that seems complete in itself; she builds a stage set and painstakingly hides all the struts and wires and nail holes, then stands back and hopes whoever might come to see it will believe.”

  • “Perilwork” and the cost of toil

    “Perilwork” and the cost of toil

    Once, curious about the hours we lose to drudgery, I ran a survey to measure what we call “toil”. Toil is the slow, unglamorous work: the repetitive, invisible, uncelebrated tasks that keep the wheels turning but drain the spirit. In my data engineering group, the work splits cleanly in two. There are those who tend to the pipes and girders and those who craft tools for non-data engineers to use — tools that conceal the messy labyrinth beneath. It was no surprise, then, when the first group — the builders of the bones — reported spending, on average, half their days in toil, with some caught in it for 90% of their time. In career talks and side conversations, these same engineers puzzled over how to make their labor visible in a world that values only the obvious: cost savings, infrastructure speed, tidy dashboards of metrics. And I saw it then, how toil had crept in like an invasive species of weed taking over the garden. It needed constant tending to, that only increased with scale. What doesn’t get measured, doesn’t get fixed. The question then becomes — how do we measure it? 

    I finally settle at my freshly cleaned desk. The coffee’s gone cold. A second glass of water waits on a coaster. The laptop hums, warm, ready. I sink heavily into the leather chair, fingers hesitating over Command + T, hovering on the task I’ve managed to avoid all month. One week left in the quarter — it’s now or miss the OKR. It’s a simple change to migrate to a new protocol. Two lines of Terraform, maybe less. Two lines of change that can break all the ETL jobs if anything goes wrong. I open a pull request, carefully outline the steps for deploying and testing, and send it off for review. But my mind’s already slipping away. There’s that new idea I’ve been turning over, waiting to be sketched out. Steve needs help with a ticket I could solve in ten minutes. Anything but this. I try to picture the neat little line in my promo packet: “Helped achieve team goals by migrating services to the new protocol.” I hear my manager’s voice: “But what was YOUR impact?” It loops in my head, distracting, dragging me sideways. I push the PR, merge, deploy, test..done. Grab the leash and take the dog for a walk. The late afternoon light makes the sidewalk glow, and for a moment it’s all pleasant. Then the phone buzzes in my pocket. I pull it out fast, heat rises in my neck, my ears burn. An incident. Was it my change that did it? In my mind, the tidy line in my promo packet vanishes. I see my manager’s face again: “You should have been careful…”

    The Google SRE book defines toil as tasks that are manual, repetitive, automatable, tactical, grow at an O(n) rate, and offer no enduring value. That list — last updated in 2017 — still holds up, but I’d argue it’s incomplete. To it, I’d add a new category: high-risk, low- or no-reward tasks. Let’s call it perilwork. You’ll know perilwork when you see it. A dragging weight in your gut when the request comes in. A rational, well-earned fear of making a mistake. A quiet, knowing cynicism toward so-called “blameless postmortems”. It’s the kind of work no one volunteers for, but everyone has to do. Luckily, perilwork is also the easiest kind of toil to reason about when assessing prioritization and impact — the cost of getting it wrong is too high to ignore. SLA breaches. Brand damage. Revenue loss.

    In the medical field, perilwork has another name, it’s called “effort-reward imbalance” and its impact on patient safety has been extensively studied. One of the mitigating suggestions is higher rewards for toil tasks to balance the effort-reward. This may also be the reason why during my time at Google, SREs would be paid more. They also had the most well-stocked bar at the Google campus. As of 2022, Google also paid a stipend for on-call rotations. This takes the sting out of grunt work. Most companies, though, still treat on-call as just another part of the job. And for infrastructure teams, on-call is only one source of toil. Migrations, upgrades, deployments — these make up a significant portion of perilwork. The most effective way to address it isn’t just to reward it, but to reduce it: to automate, to standardize, to chip away at the risk until what remains is manageable, predictable. Lower the peril, ease the stress.

    What might that look like in practice? Imagine every system carrying a number, something we will call its peril potential — a score between 0 and 100 that reflects the chance something might break. Because problems rarely show up when everything’s calm; they tend to arrive when change rolls through. This peril potential would act as a simple signal, an early warning. When the number starts to climb, it’s a cue to shift focus toward the maintenance work we often postpone – for lack of perceived impact. Tackling them at the right moment lowers the chances of incidents and eases the weight of invisible work on engineers. It’s a way to steady systems and reduce the quiet, grinding stress that builds up over time. Each system would start with a peril score of 0, recalculated after every SLA breach, incident, security event, or major alert tied to a change. The exact thresholds? That’s a judgment call. It would depend on your service tier, your team size, the strength of your tooling, and how easily you can automate away risk. Each organization would have to decide what “too risky” looks like for them.

    Of course, peril scores alone won’t clear your backlog. An astute reader like you might ask — what about the toil itself? How do we decide which pieces are worth tackling? For that, start by digging into your JIRA backlog. Look for the P2 and P3 postmortem follow-ups, the ones gathering dust quarter after quarter, always deprioritized because the immediate impact wasn’t obvious or the return on investment seemed questionable. After all, how risky could a two-line Terraform change be? Or that canary deployment we never fully automated. Or that brittle CI/CD pipeline no one quite trusts. Those are your starting points. Why? Because we already know — from the last incident, the last outage, the postmortem someone quietly typed up — that those fixes would have made a difference. The only reason they’ve stayed untouched is because no one had a way to measure their value. Peril potential gives you that odometer. It surfaces the invisible, lets you track the risk you’re chipping away at, and turns overlooked toil into clear, measurable progress. A small, steady way to make an outsized impact.

    Invisible work has always been the quiet backbone of our systems, and toil — especially perilwork — is where risk hides in plain sight. We can’t eliminate it entirely, but we can get smarter about when and how we tackle it. A simple, transparent measure like peril potential turns gut instinct into data, giving teams a way to prioritize the small, unglamorous fixes before they turn into costly problems. It offers engineers a way to make their impact visible, to reduce stress, and to chip away at risk in a way that scales. And while no metric is perfect, having even a rough signal is better than navigating blind. Start where you are. Pick a threshold. Surface the neglected tasks. You’ll be surprised how quickly the garden starts to clear.

  • Data Contracts —what is it and why should you care?

    Data Contracts —what is it and why should you care?

    Is it a schema, is it an api, is it a bird, a plane…I’m getting carried away. Much like the ETL versus ELT debate of 2021, data contract was the hot topic of 2022. But what is it really?

    Producer, meet Consumer

    Few months ago I wrote about “Bridging the data gap” which talked about a communication gap between producers and consumers of data. It is a tale as old as time — frontend engineer configures a click event to be fired from the mobile application. It gets picked up by the Data Platform and gets stored in different formats. Maybe it will go through several transformations. By the time an analyst decides that they want to use this to run some funnel analysis, they have to jump through hoops and walk through fire to figure out basic details about the event.

    1. “What is the schema?”
    2. “What is its freshness? How often is it synced to the analytical database?”
    3. “What kind of quality can I expect? Would there be a lot of duplicates? What about dropped data?”
    4. “What is the business context? When is this event fired? Is this fired for all clicks or only when certain conditions are met? What are those conditions”

    In my opinion a good data contract would codify all these things and presents it to the consumers so that they don’t have to talk to the data producers to find answers. An API spec for data pipelines if you will.

    In a nutshell, data contract is a handshake agreement between data producers and consumers. A good contract tells the consumer everything they need to know in order to build a product on top of this data with confidence and clarity. And in case you’re wondering it is more than just schema..”

    The data contract I want…

    Since this is an emerging topic with varied opinions, I’d like to put in my wishlist of things I’d like to see in a contract and why…

    Schema

    A schema defines the expected format of the data, including the data types. This is the the bare minimum and kind of a requirement anyway if the data is serialized over the wire and needs guidance on how to deserialize. JSON, Avro, protocol buffers are popular schema definition languages for everything ranging from data objects on the wire to API request/response. Relational databases inherently have a schemaSchema registries like the one offered by Confluent has been around since 2014. Any good organization will have some kind of schema validation and enforcement at the edges of consumers. The only places where its kind of a wild wild west is in the land of logs and NoSQL DBs. But there is an argument to be made that even when this type unstructured data is converted to an analyze-able format— a schema must be defined.

    /**
    * This event is fired when a logged in user clicks the Submit button
    * on the main page. Subsequent clicks are aggregated together
    * and sent as one event.
    */
    message ClickEvent {
    // This is a message level custom option. One can customize
    // any type of option for a protocol buffer message
    option (user.event) = true;

    // event_id
    int64 id = 1;

    // logged in user
    // This is an example of a field level custom option. Can be used for
    // providing additional information about a field like whether it contains
    // personally identifiable information or not.
    User user = 2 [pii=true];

    // time when the button was clicked, comes from the client clock
    Timestamp clicked_at = 3;

    // Number of times the logged in user clicked this button over a
    // 5 second interval
    int number_of_clicks = 4;
    }

    Semantics

    Data semantics refer to the meaning or interpretation of data. It should encompass the relationships and associations between data elements and how they relate to real-world concepts or objects. In other words, data semantics is concerned with the context in which data is used and the meaning that can be derived from it. It helps ensure that data is interpreted correctly.

    For example, consider the field number_of_clicks. Does it count all the clicks of the button? Or does it only count clicks by logged in users? Without additional context or information, the data itself is meaningless.

    Semantics help establish a shared vocabulary between different systems and applications.

    Data profile

    It would be nice to get a summary or snapshot of the characteristics of a dataset. Should provide an overview of the data, including its structure, content, and quality. For example:

    1. What is the column cardinality i.e how many unique values does the column have.
    2. Number of nulls, zeros, empties etc
    3. Value distribution — what is the median(p50th) or p95th value of this column

    Why is this useful? Let’s say I’m building a data product using your dataset. I want to write validations to ensure everything is working as expected. Unless I know whats coming in I can’t validate what’s going out. This is a crucial component for ensuring data quality and anomaly detection. Speaking of….

    SLOs/SLAs and Data Quality

    Latency(or freshness), availability and consistency are some basic things the consumer of your data may care about to assess whether its fit for theitrintended use. Let me give you some examples:
    1. I’m building an executive dashboard for my CEO so she can look at the number of new customer acquire every month. When she asks me how recent the data is I want to be able to give a good answer — and for that I need to know how recent is the data coming from upstream.

    2. I’m writing a Flink streaming job that reads from your data-stream, does some windowed aggregations and writes out the output. I want to figure out what my watermarking strategy should be and for that I need to know expected lateness in your stream. A latency distribution or percentile can give me all the information I need to design a robust product myself.

    Additionally, data quality checks should be able to measure the reality against the expectations to quantify accuracy of the dataset. For example if your product has 10M unique users, but your click events table only has 5M -thats clearly wrong.

    Supported Use

    Or how not to use a data product. This is an uncommon one but one that I feel should definitely be a part of a good data contract. In my time working in data I’ve seen all kinds of bad data consumption patterns. Unless you specify supported usage up front you’ll find yourself supporting weird use cases that’ll suck up your team’s operational bandwidth. Examples of supported use:
    1. “Do not run batch queries on this stream — streaming applications only”

    2. “When running queries on this dataset, filter by time partition otherwise the queries will take a long time to finish”

    3. “Do not run scans on this table, here are some supported query patterns…”.

    Governance

    Access control and governance is often handled separately, but in my opinion it should be a part of data contract. Similar to supported use, its good for consumers to know what all they are allowed to do with the data. Does it contain confidential or sensitive information? How should it be stored, retained, displayed to end users.

    Is data contract the same as data catalog?

    Technically they serve different purposes. While former is an agreement between data producers and consumers, latter is a centralized inventory or registry of data assets that provides information about the location, ownership, quality, and usage of data. That being said a catalog could be the place where contracts are stored? Topic of discussion for another day..

    Parting Thoughts

    1. Over the years schema registry has become a popular way to validate schema at the edge. Look at Confluent schema registry for example — very popular among Kafka consumers.
    2. In my opinion data contract is the next evolution of schema registry. It goes beyond schema to encapsulates other critical info about datasets such as usage, SLO, governance, data quality etc.
    3. The underlying goal is to build a bridge between data producers and consumers.
    4. Whether a contract should exist for every hop of a data pipeline or just at the critical edges(eg: edge of mobile application and data platform) still needs to be seen.
    5. A good contract should have accountability mechanism built into it. A continuous way to monitor the aspect of the contract and clear rules for what need to happen when a contract is violated. Much like service level agreements.

    References

    1. https://dataproducts.substack.com/p/the-rise-of-data-contracts
    2. https://mlops.community/an-engineers-guide-to-data-contracts-pt-1/
    3. https://benn.substack.com/p/data-contracts

  • A catalog system for your thoughts

    A catalog system for your thoughts

    Last week I spent 20 out of 40 of my working hours talking to people — in 1:1s, in team meetings, in brainstorming sessions. I also read 8 different tech blogs, 2 open source documentations, 10 tweet threads, 3 StackOverflow questions, 10 pull requests, 2 design docs. All of this in one week. By the end of the week I may have discussed some 20 different things prompting 100 different streams of thoughts. But if you ask me to recall all of it now, sadly, I will not be able to.

    This is unsurprising because our brain does not have infinite storage. Memories are formed when neural pathways are traversed frequently. Thinking multiple disparate thoughts does not have that effect. In his book Deep Work, Cal Newport talks about the importance of focussing without distraction to train those neural pathways and create a deep understanding in a particular domain. This means long focussed hours, without distraction and context switch. Context switching leads to something called as “attention residue”. This happens when people are unable to fully disconnect from a previous task before starting another. This leads to some carryover thinking from the previous task and takes away from the focus that should be given to the current task.

    There is an endless array of productivity tools that help you organize your calendar to get more focus time or ones that will help you manage your meetings or set reminders and what not. But what has been life changing for me both in terms of making me a better writer as well as a better Staff level engineer has been this technique called “Zettelkasten”

    Zettelkasten is German for “slip-box”. It is a system of taking notes and cataloging them such that ideas and thoughts emerge out of it like magic! It turns the process of writing on its head. Most of us approach writing by picking a topic first, researching the topic second and writing third. The Zettelkasten method encourages you to always be in research mode. To make notes about what you read, learn, discuss and cataloging it around topics. You don’t need to research the topic because you already did that during your day to day readings and discussions.

    So how do you go about it? It’s pretty simple really(so simple that it is easy to overlook)

    • Every time you read a book, a blog, a technical design doc— make short notes! Not only when reading, every time you have a discussion, listen to a podcast, have a team meeting — make notes. . Let’s call these Rough Notes. These can be really short notes, scribbles on a napkin even, something to help you recall the material.
    • At the end of the day, go through your notes and produce a paragraph about your own take on the material. Let’s call this Master Note. This is your personal opinion, takeaway etc. Write this paragraph on a new sheet of paper.
    • To this add a Reference section and link the original material here. If it is a book — add a link to the book. If it is a podcast — a link to the podcast. If this is about a conversation, then context about the meeting — who were the participants, what was the topic of discussion, where was it held. Why is this important? Later on when you combine several master notes to come up with your own written material, you won’t have to search for references at that time. References are available inside the master notes ready for use.
    • At the top of your master note add a number. This should be unique to each master note and increasing in time. This number will be used to reference your note from the catalog system and should also provide time based ordering. This is important because this ordering helps you see how your thoughts and ideas evolve over time. You can choose a timestamp or a human readable date for this. Let’s call this the Index. The index helps you reference your master notes in this system.
    • Now comes the crucial part. After you write your master note, think about the overarching themes or topics this master note could be related to. Think hashtags. What are some hashtags you will add to this? These are the Boxes in your slip-box. These can be existing boxes or new boxes. These boxes become entries in your catalog system(the slip-box). A master note may belong to several boxes. This is where your index comes in handy comes in handy. Instead of copying each note to add to different boxes, you simply make copies of the index and add the copies to each box it belongs to. Keep the master note in a separate box from where you can retrieve it by the index When adding an index card to a box you can choose a position relative to other cards to add additional ordering.

    Over time some boxes in the slip-box will become heavier than others. These are the more popular topics. Once a box reaches a critical mass of index cards, it can be potentially turned into a written material. Depending on the amount of material at hand — it could become a blog, a research paper, a design proposal or a book. Your notes are organized, references are already linked and timeline is codified. With some minimal editing effort this can be a good written piece.

    “It could become a blog, a research paper, a design proposal or a book.”

    Why I love this?

    • Writing frees up your brain for more thoughts: By writing things down as you read, listen or perform a task, you can achieve closure. This reduces the amount of “attention residue” that may seep into your next task as you context switch. At the same time you don’t have to worry about forgetting things, because you already wrote it down.
    • Writing forces you to clarify your thought process: If you have ever written a technical design doc(TDD), you can understand this feeling. As you start writing the TDD you’re forced to think about finer details of the design. It forces you to explore questions you otherwise would not have. Very often it also makes very clear what is doable and what is not. This is why we were tasked with book reports in school. It is an important tool to aid reading comprehension. By writing what we read we solidify our comprehension of the material.
    • Never miss useful stuff by always being in “research mode”: When you tackle writing by picking a topic first, and researching second, you focus only on things related to the topic when you are researching. This means you end up missing all the other useful things in the research material not related to your topic. Besides, picking a topic first leads to pre formed opinions. Whatever we read next often serves that opinion. This has the potential to introduce bias as we tend to find things that support our view. When we read things without an agenda and make notes as we read, we treat every tidbit of information with equal importance. We don’t know it yet but we’re researching multiple topics at the same time. This is very powerful!
    • Never lose your notes!: How many times have you taken notes in a notebook never to find it again. I have stacks of notebooks lying under my desk that I have never so much as flipped through in years. I may not even comprehend what was written by me after all this time. By cataloging your notes you can not only find it later, but also use it for creating useful content.

    “When we read things without an agenda and make notes and catalog them as we read, we treat every tidbit of information with equal importance. We’re researching multiple topics at the same time…we just don’t know it yet”

    The concept of slip-box is great! But how do I use this with digital tools?

    Great question! I mean who buys beautiful Moleskine paper notebooks anymore(**wink wink**). All jokes aside, I still use small notebooks, but only for making rough notes. This is because I’m faster at scribbling into a paper notebook than typing on my computer(I get distracted when the squiggly lines appear). For my master notes I mainly use two tools:

    Obsidian is a neat tool that keeps all your notes locally on your machine. You can use the backlinks and forward links to create virtual “boxes” in your slip-box. Obsidian also generates a neat graph where nodes in the graph are topics(boxes in the slip-box). Over time some nodes become bigger — these are your more popular topics.
    One big plus for Obsidian is that its local first and open source. You’re not locked into to a service or cloud storage. You can also add plugins to add more functionality.

    More recently however, I started using Notion for everything from making daycare lists, TODOs, shopping lists etc. Notion has several templates you can use for basically anything. There is even a template for Zettelkasten. When you create new pages using this template it generates a UID for it which is basically like the index card I described above. There is a section where you can add “Tags”. These are the boxes in your slip-box. You can also group these notes by tags to find topics.

    Notion is free for personal use. I like the simple interface and the fact that I can access it from any device very easily.

    That is all folks! As always feedback is welcome. Let me know if you found this useful. Toodles!

    References

  • Site Reliability Engineering – SLA, SLO and SLIs

    Site Reliability Engineering – SLA, SLO and SLIs

    I have worked as an engineer for more than a decade and something I’ve seen everyone struggle with is how to manage reliability expectations for their service. This is particularly challenging for data streaming platforms. To that end I am starting a series that I’m calling “Site Reliability Engineering for Streaming Platforms”.

    This post is the first in that series. In this I want to provide a refresher on some common terminologies used in the Site Reliability Engineering that seems to confound a fair number of us.

    SLI(Service Level Indicator)

    Indicators help measure the reliability of a service. These are metrics that indicate how a service is doing. Availability and Latency are commonly used service level indicators. These are metrics that your customers care about and directly impacts how they would interact with your service. Indicators are usually second order effects and symptoms of a problem instead of the problem itself. For example: Service downtime is a second order effect of high CPU. Users don’t care about your service’s CPU usage in and of itself. You as a service owner may alert on high CPU so you can mitigate the issue but you would not put it forward as an indicator of how your service is doing.

    Good SLIs are:

    • ✅ Things your users care about
    • ✅ Is intrinsically connected to your service’s performance. In other words is impacted by your service’s performance and hence a reflection of it.

    Let us look at some possible metrics and why or why not it may be a good SLI

    Availability

    • ✅ Users care about the probability with which their requests will be served.
    • ✅ Aspects of the service’s performance impacts its availability so it can be controlled through actions by service owners

    Downtime

    • ✅ Directly impacts customers. If service is down its not serving requests
    • ✅ Usually impacted by issues within the service*

    Latency

    • ✅ Depending on the type of service, users may have a baseline expectation since this may dictate how long they should wait for a response before retrying or giving up.
    • ✅ In most cases it is impacted by service’s performance. There may be some situations where external factors can have an impact. For example: external data sinks not able to handle incoming throughput leading to back pressure and hence impacting latency(story for another day)

    CPU

    • 🔴 To users of a service, CPU metrics mean nothing
    • 🔴 High CPU is not impacted by service performance. Rather vice-versa is true, high CPU may have an impact on availability or latency. Because of this reason it is not a good SLI

    Requests/second

    • 🔴 Users typically do not care about the overall requests being made to the service per second
    • 🔴 Is not impacted by a service’s performance. Therefore not a good SLI

    Data Loss

    • 🔴 For stateless services, availability is often used in lieu of data loss. Data loss is relevant to data pipelines. However, users often do not care about the magnitude of data loss. The question users ask is not “how much data loss?” but “is there any data loss?”. This is often codified in terms of “at least once”, “at most once” or “exactly once” delivery guarantees. This is a long topic in and of itself — story for another day.
    • ✅ Data loss is often impacted by service’s performance and hence can be controlled by service owners. In some situations it is reliant on external dependencies in which case a baseline expectation can be set.

    Once you have identified your SLIs, plot the past performance over a relatively long period of time say 30 days. Now draw a line through it to mark an acceptable threshold for this SLI. Let’s say your SLI is “latency”. You drew a line at the 0.5 second latency mark. Based on past performance over 30 days requests were handled by your service with a latency of 0.5 seconds about 90% of the time. So you frame your service’s latency SLO like so: ”Over a period of 30 days, 90% of the requests will get a response within 0.5 seconds”

    This is a great start! However after speaking to your customers you realize that they are ok, nay happy with a 5 sec latency at least 90% of the time. So you bump that threshold line to the 5 sec mark and notice that in the last 30 days almost 99% of the requests were being served under 5 seconds.

    This is great! By upping the latency threshold to 5 seconds while keeping the probability at 90% you’re giving yourself leeway to make mistakes in the future. This leeway will allow you to keep operational costs low since your goalpost is a little closer and also gives you a ramp to take risks since you have some runway to make mistakes. With this new adjustment your new latency SLO will read: ”Over a period of 30 days, 90% of the requests will get a response within 5 seconds”

    This leeway is called Error Budget

    Error Budget

    You can very rarely in fact never get 100% availability for your service. Shit happens to distributed systems. You may be able to get 99% availability pretty reliably. Your customers may be happy with 90% availability. As a result, there is no reason to promise 99% availability when 90% is enough. This runway is your error budget.

    Error Budget = (100 — SLO)

    Let’s say your availability SLO is 90%, then your error budget will be 10%.
    Why is error budget important:

    • Lowers operational cost by making errors less expensive — If achieving your SLO means round the clock monitoring and error handling by your Ops team, you’re spending a lot of resources.
    • More runway for innovation — Adding new features to an existing service often comes with migrations. Downtimes can happen during this time, not to mention human errors. Having a good error budget gives you the room to innovate without worrying about an SLO miss.
    • Past does not dictate the future — This is true for stock markets and also true for your service. Your service traffic may grow in the future. You may acquire new users. Heck your application may go viral! Having more room for errors and downtimes is good in terms of being prepared for future unknowns.

    “Under promise, over deliver”

    SLA(Service Level Agreement)

    SLA is basically SLO with a penalty for SLO miss. For example: I could tell my customers “If less than 90% of requests get latency under 5 seconds, I am obligated to give you a discount”. Most of time you don’t need to specify an SLA unless you are a SaaS provider or charging for your service. SLAs are drawn by the legal team of a company and codified into the contract that customers and service providers sign. In most cases SLO is sufficient.

    Burn Rate

    You have your SLIs, SLOs and SLAs in place. Congratulations! Now it’s go time. Monitoring key metrics and setting up alerts to detect and mitigate issues is bread and butter for any production system. But thats not all. You also need to alert on SLO burn rate. But what is burn rate anyway?

    Burn rate is the rate at which a service consumes its error budget.

    Let’s say the error budget for your availability SLO is 10%. This means over a 30 day period your service is allowed to be unavailable for at most 10% of the requests without an SLO miss. However, Its the 15th of the month — 8% of your requests have already hit 404. You only have a runway of 2% before you miss SLO and you still have 15 days to go. What type of actions can you take to avoid an SLO miss?

    • Pause new features and migrations
    • Re-provision infrastructure components to allow higher traffic — if applicable. For example: adding more server instances.
    • Inform customer if applicable so they’re prepared for an SLO miss
    • Add gate-keeping for new use case. This is important for multi-tenant systems where a customer use case can have wide reaching impact.

    That is all folks! Let me know if you found this useful. And if you would like me to drill down into any specific topic.

    References