Is it a schema, is it an api, is it a bird, a plane…I’m getting carried away. Much like the ETL versus ELT debate of 2021, data contract was the hot topic of 2022. But what is it really?

Producer, meet Consumer

Few months ago I wrote about “Bridging the data gap” which talked about a communication gap between producers and consumers of data. It is a tale as old as time — frontend engineer configures a click event to be fired from the mobile application. It gets picked up by the Data Platform and gets stored in different formats. Maybe it will go through several transformations. By the time an analyst decides that they want to use this to run some funnel analysis, they have to jump through hoops and walk through fire to figure out basic details about the event.

“What is the schema?”
“What is its freshness? How often is it synced to the analytical database?”
“What kind of quality can I expect? Would there be a lot of duplicates? What about dropped data?”
“What is the business context? When is this event fired? Is this fired for all clicks or only when certain conditions are met? What are those conditions”

In my opinion a good data contract would codify all these things and presents it to the consumers so that they don’t have to talk to the data producers to find answers. An API spec for data pipelines if you will.

“In a nutshell, data contract is a handshake agreement between data producers and consumers. A good contract tells the consumer everything they need to know in order to build a product on top of this data with confidence and clarity. And in case you’re wondering it is more than just schema..”

The data contract I want…

Since this is an emerging topic with varied opinions, I’d like to put in my wishlist of things I’d like to see in a contract and why…

Schema

A schema defines the expected format of the data, including the data types. This is the the bare minimum and kind of a requirement anyway if the data is serialized over the wire and needs guidance on how to deserialize. JSON, Avro, protocol buffers are popular schema definition languages for everything ranging from data objects on the wire to API request/response. Relational databases inherently have a schema. Schema registries like the one offered by Confluent has been around since 2014. Any good organization will have some kind of schema validation and enforcement at the edges of consumers. The only places where its kind of a wild wild west is in the land of logs and NoSQL DBs. But there is an argument to be made that even when this type unstructured data is converted to an analyze-able format— a schema must be defined.

/**
 * This event is fired when a logged in user clicks the Submit button
 * on the main page. Subsequent clicks are aggregated together
 * and sent as one event.
 */
message ClickEvent {
   // This is a message level custom option. One can customize
   // any type of option for a protocol buffer message
   option (user.event) = true;
   
   // event_id
   int64 id = 1;

   // logged in user
   // This is an example of a field level custom option. Can be used for
   // providing additional information about a field like whether it contains
   // personally identifiable information or not.
   User user = 2 [pii=true];

   // time when the button was clicked, comes from the client clock
   Timestamp clicked_at = 3;

   // Number of times the logged in user clicked this button over a
   // 5 second interval
   int number_of_clicks = 4;
}

Semantics

Data semantics refer to the meaning or interpretation of data. It should encompass the relationships and associations between data elements and how they relate to real-world concepts or objects. In other words, data semantics is concerned with the context in which data is used and the meaning that can be derived from it. It helps ensure that data is interpreted correctly.

For example, consider the field number_of_clicks. Does it count all the clicks of the button? Or does it only count clicks by logged in users? Without additional context or information, the data itself is meaningless.

Semantics help establish a shared vocabulary between different systems and applications.

Data profile

It would be nice to get a summary or snapshot of the characteristics of a dataset. Should provide an overview of the data, including its structure, content, and quality. For example:

What is the column cardinality i.e how many unique values does the column have.
Number of nulls, zeros, empties etc
Value distribution — what is the median(p50th) or p95th value of this column

Why is this useful? Let’s say I’m building a data product using your dataset. I want to write validations to ensure everything is working as expected. Unless I know whats coming in I can’t validate what’s going out. This is a crucial component for ensuring data quality and anomaly detection. Speaking of….

SLOs/SLAs and Data Quality

Latency(or freshness), availability and consistency are some basic things the consumer of your data may care about to assess whether its fit for theitrintended use. Let me give you some examples:
1. I’m building an executive dashboard for my CEO so she can look at the number of new customer acquire every month. When she asks me how recent the data is I want to be able to give a good answer — and for that I need to know how recent is the data coming from upstream.

2. I’m writing a Flink streaming job that reads from your data-stream, does some windowed aggregations and writes out the output. I want to figure out what my watermarking strategy should be and for that I need to know expected lateness in your stream. A latency distribution or percentile can give me all the information I need to design a robust product myself.

Additionally, data quality checks should be able to measure the reality against the expectations to quantify accuracy of the dataset. For example if your product has 10M unique users, but your click events table only has 5M -thats clearly wrong.

Supported Use

Or how not to use a data product. This is an uncommon one but one that I feel should definitely be a part of a good data contract. In my time working in data I’ve seen all kinds of bad data consumption patterns. Unless you specify supported usage up front you’ll find yourself supporting weird use cases that’ll suck up your team’s operational bandwidth. Examples of supported use:
1. “Do not run batch queries on this stream — streaming applications only”

2. “When running queries on this dataset, filter by time partition otherwise the queries will take a long time to finish”

3. “Do not run scans on this table, here are some supported query patterns…”.

Governance

Access control and governance is often handled separately, but in my opinion it should be a part of data contract. Similar to supported use, its good for consumers to know what all they are allowed to do with the data. Does it contain confidential or sensitive information? How should it be stored, retained, displayed to end users.

Is data contract the same as data catalog?

Technically they serve different purposes. While former is an agreement between data producers and consumers, latter is a centralized inventory or registry of data assets that provides information about the location, ownership, quality, and usage of data. That being said a catalog could be the place where contracts are stored? Topic of discussion for another day..

Parting Thoughts

Over the years schema registry has become a popular way to validate schema at the edge. Look at Confluent schema registry for example — very popular among Kafka consumers.
In my opinion data contract is the next evolution of schema registry. It goes beyond schema to encapsulates other critical info about datasets such as usage, SLO, governance, data quality etc.
The underlying goal is to build a bridge between data producers and consumers.
Whether a contract should exist for every hop of a data pipeline or just at the critical edges(eg: edge of mobile application and data platform) still needs to be seen.
A good contract should have accountability mechanism built into it. A continuous way to monitor the aspect of the contract and clear rules for what need to happen when a contract is violated. Much like service level agreements.

Tending the Pipes

Data Contracts —what is it and why should you care?

Producer, meet Consumer

The data contract I want…

Schema

Semantics

Data profile

SLOs/SLAs and Data Quality

Supported Use

Governance

Is data contract the same as data catalog?

Parting Thoughts

References

Like this:

More posts

10 Hacks to Learn New Skills Quicker(Spoiler: Coffee Isn’t the Only One)

This week in reading – May 29th

“Perilwork” and the cost of toil

Data Contracts —what is it and why should you care?

Data Contracts —what is it and why should you care?

Producer, meet Consumer

The data contract I want…

Schema

Semantics

Data profile

SLOs/SLAs and Data Quality

Supported Use

Governance

Is data contract the same as data catalog?

Parting Thoughts

References

Share this:

Like this:

More posts

10 Hacks to Learn New Skills Quicker(Spoiler: Coffee Isn’t the Only One)

This week in reading – May 29th

“Perilwork” and the cost of toil

Data Contracts —what is it and why should you care?