Data Contracts for Data Engineers: Getting Started in 2026
Learn what data contracts are, why they matter in 2026, and how to implement them in your pipelines with YAML specs and PySpark validation. Includes a complete contract template, enforcement code, CI/CD testing patterns, and a tool comparison — no expensive tooling required.
Your dashboard broke again. Not because your pipeline failed — it ran perfectly. The problem was upstream: someone renamed a column in the source system, and your pipeline happily ingested the new schema without anyone knowing.
This is exactly the problem data contracts solve.
Data contracts are the hottest topic in data engineering right now, and for good reason. They shift quality enforcement left — to the producers — instead of leaving consumers to discover broken data after the damage is done.
This guide covers what data contracts actually are, why they matter, and how to start implementing them in your pipelines today — no expensive tooling required.
What Is a Data Contract?
A data contract is a formal agreement between a data producer and its consumers about what a dataset promises. It defines:
- Schema — column names, types, nullability
- Freshness — how often the data is updated, and the SLA for staleness
- Volume — expected row counts or ranges
- Semantics — what the data actually means (business definitions)
- Quality rules — validation constraints (ranges, enums, uniqueness)
- Ownership — who is responsible when something breaks
Think of it like an API contract. When you build a REST API, you define the request/response schema, error codes, and rate limits. Consumers build against that contract. If you change it, you version it and communicate the change.
Data contracts apply the same principle to datasets.
Why Data Contracts Matter Now
Three trends are making contracts essential in 2026:
1. Pipelines are more complex than ever. A typical data platform has dozens of tables flowing through bronze → silver → gold layers. One schema change at the source can cascade failures through every downstream table, dashboard, and ML model.
2. Data quality failures are getting more expensive. With companies embedding data into AI models and automated decision-making, bad data doesn't just produce a wrong chart — it produces wrong decisions at scale.
3. The "shift left" movement. The traditional approach — consumers running quality checks after ingestion — catches problems too late. By the time you detect a null spike in your gold table, the bad data has already propagated through your entire pipeline. Contracts enforce quality at the source, before bad data ever enters your system.
Anatomy of a Data Contract
Here's what a practical data contract looks like. No special tooling needed — this is a YAML file you keep in your repo:
# contracts/orders.yaml
contract:
name: orders
version: 2.1
owner: data-eng-team
description: "Customer orders from the transactional database"
schema:
- name: order_id
type: BIGINT
nullable: false
unique: true
description: "Primary key, auto-incremented"
- name: customer_id
type: BIGINT
nullable: false
description: "FK to customers table"
- name: order_date
type: DATE
nullable: false
valid_range: ["2020-01-01", "today"]
- name: amount
type: DECIMAL(10,2)
nullable: false
valid_range: [0.01, 999999.99]
- name: status
type: STRING
nullable: false
allowed_values: ["pending", "shipped", "delivered", "cancelled"]
- name: updated_at
type: TIMESTAMP
nullable: false
sla:
freshness: 4h
volume_min: 500
volume_max: 50000
contacts:
slack_channel: "#data-eng-alerts"
on_call: "data-eng-team@company.com"
The key insight: this contract is both human-readable documentation AND machine-enforceable validation. The same file serves as your data dictionary and your automated quality gate.
📄 Enforce quality beyond contracts
Contracts define what data should look like. The Data Quality Monitoring Playbook gives you the full enforcement framework — SQL checks, PySpark validation, alerting, and threshold tuning. $4.99
Enforcing Contracts in Your Pipeline
A contract is worthless if nobody checks it. Here's how to enforce contracts at different stages:
Pre-write validation (producer side)
This is where contracts have the most impact. The producer validates data against the contract BEFORE writing to the target table:
import yaml
from pyspark.sql import functions as F
def load_contract(path):
with open(path, 'r') as f:
return yaml.safe_load(f)['contract']
def validate_contract(df, contract):
"""Validate a DataFrame against a data contract. Returns list of violations."""
violations = []
total = df.count()
# Schema check: verify all required columns exist with correct types
expected_cols = {col['name']: col for col in contract['schema']}
actual_cols = {f.name: str(f.dataType) for f in df.schema.fields}
for col_name, col_spec in expected_cols.items():
if col_name not in actual_cols:
violations.append(f"SCHEMA: Missing column '{col_name}'")
continue
# Nullability check
if not col_spec.get('nullable', True):
null_count = df.filter(F.col(col_name).isNull()).count()
if null_count > 0:
violations.append(
f"NULL: '{col_name}' has {null_count} nulls "
f"({round(null_count/total*100, 2)}%) but nullable=false"
)
# Uniqueness check
if col_spec.get('unique', False):
dup_count = total - df.select(col_name).distinct().count()
if dup_count > 0:
violations.append(
f"UNIQUE: '{col_name}' has {dup_count} duplicate values"
)
# Allowed values check
if 'allowed_values' in col_spec:
invalid = df.filter(
~F.col(col_name).isin(col_spec['allowed_values'])
).count()
if invalid > 0:
violations.append(
f"ENUM: '{col_name}' has {invalid} values "
f"outside {col_spec['allowed_values']}"
)
# Volume check
sla = contract.get('sla', {})
if 'volume_min' in sla and total < sla['volume_min']:
violations.append(
f"VOLUME: {total} rows below minimum {sla['volume_min']}"
)
if 'volume_max' in sla and total > sla['volume_max']:
violations.append(
f"VOLUME: {total} rows above maximum {sla['volume_max']}"
)
return violations
Using it in your pipeline
contract = load_contract('/repos/contracts/orders.yaml')
violations = validate_contract(transformed_df, contract)
if violations:
print("CONTRACT VIOLATIONS:")
for v in violations:
print(f" ❌ {v}")
# Option A: Fail the pipeline (strict mode)
raise Exception(f"Data contract violated: {len(violations)} issues found")
# Option B: Log and alert but continue (warn mode)
# send_slack_alert(violations)
# log_to_audit_table(violations)
else:
print("✅ Contract validated — writing to target")
transformed_df.write.format('delta') \
.mode('overwrite') \
.saveAsTable('catalog.silver.orders')
This is the same pattern I cover in more detail in my data quality checks guide, but contracts formalize it into a declarative specification rather than ad-hoc checks scattered across notebooks.
Contract Enforcement with SQL
If your team is more SQL-heavy, you can enforce contracts directly in Databricks SQL:
-- Schema validation: check for unexpected nulls
SELECT
SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) AS null_order_ids,
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_ids,
SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS null_amounts,
SUM(CASE WHEN status IS NULL THEN 1 ELSE 0 END) AS null_statuses
FROM staging.orders
-- All should be 0 per contract
-- Enum validation
SELECT DISTINCT status
FROM staging.orders
WHERE status NOT IN ('pending', 'shipped', 'delivered', 'cancelled')
-- Should return 0 rows
-- Volume validation
SELECT COUNT(*) AS row_count
FROM staging.orders
WHERE DATE(updated_at) = CURRENT_DATE()
-- Should be between 500 and 50000 per contract
-- Freshness validation
SELECT
MAX(updated_at) AS latest_record,
TIMESTAMPDIFF(HOUR, MAX(updated_at), CURRENT_TIMESTAMP()) AS hours_stale
FROM catalog.silver.orders
-- hours_stale should be < 4 per contract
Schema Evolution and Contracts
One question that comes up immediately: how do contracts work with schema evolution in Delta Lake?
The answer: contracts should version alongside schema changes.
# When a column is added:
contract:
name: orders
version: 2.2 # bumped from 2.1
changelog:
- version: 2.2
date: "2026-03-15"
changes: "Added 'shipping_method' column (STRING, nullable)"
schema:
# ... existing columns ...
- name: shipping_method
type: STRING
nullable: true # nullable because backfill hasn't run yet
allowed_values: ["standard", "express", "overnight"]
added_in: 2.2
The workflow:
- Producer wants to add a column
- Producer updates the contract YAML with the new column and bumps the version
- Contract change goes through code review (same as an API change)
- Consumers are notified via the changelog
- Producer deploys the schema change with
mergeSchema = true - Existing rows get null for the new column (which is fine because
nullable: true)
This prevents the scenario that opened this article — unexpected schema changes breaking downstream consumers.
Contract Testing in CI/CD
For teams with mature pipelines, you can run contract tests automatically:
# tests/test_orders_contract.py
import pytest
from contracts.validator import load_contract, validate_contract
def test_orders_contract():
"""Run against a sample dataset in CI."""
contract = load_contract('contracts/orders.yaml')
# Load a test fixture or sample from the actual table
sample_df = spark.table('catalog.silver.orders').limit(10000)
violations = validate_contract(sample_df, contract)
assert len(violations) == 0, \
f"Contract violations found:\n" + "\n".join(violations)
def test_contract_schema_matches_table():
"""Verify the contract's schema matches the actual Delta table."""
contract = load_contract('contracts/orders.yaml')
table_schema = spark.table('catalog.silver.orders').schema
contract_cols = {col['name'] for col in contract['schema']}
table_cols = {f.name for f in table_schema.fields}
missing_in_table = contract_cols - table_cols
extra_in_table = table_cols - contract_cols
assert not missing_in_table, \
f"Columns in contract but not in table: {missing_in_table}"
# extra_in_table is OK (table may have internal columns like _rescued_data)
Tools for Data Contracts
You don't need tools to start — the YAML + validation approach above works today. But as you scale, these tools can help:
| Tool | Approach | Best for |
|---|---|---|
| YAML + custom validation (this guide) | DIY | Teams starting out, full control |
| Soda Core | YAML-based checks, open source | Easy setup, Slack integration, growing teams |
| Great Expectations | Test-driven validation suites | Complex rules, CI/CD integration |
| Databricks Expectations | Built-in Delta constraints | Native Databricks users |
| datacontract CLI | Open source contract spec | Standardized contract format across teams |
For most teams, starting with YAML contracts in your repo and the validation function above is the right move. You can always migrate to a tool later — the contract definitions transfer.
Data contracts touch data governance, architecture, and team processes. If you want the broader context on data mesh, governance frameworks, and organizational design for data teams, Pluralsight has courses that connect the engineering practices to the organizational strategy.
How to Start: 3 Steps
Step 1: Pick your most critical table. The one that breaks dashboards when it has bad data. Write a contract for it using the YAML template above.
Step 2: Add validation to the pipeline. Drop the validate_contract() function into the notebook that writes to that table. Start in warn mode (log + alert but don't fail).
Step 3: Shift to strict mode after 2 weeks. Once you've tuned the thresholds and fixed false positives, flip to strict mode where contract violations fail the pipeline.
That's it. You don't need to contract every table on day one. Start with one, prove the value, then expand.
Common Gotchas
1. Don't over-specify. A contract that checks 50 things will produce constant false positives and get ignored. Start with the 5 most important constraints: primary key uniqueness, critical column nullability, freshness SLA, volume bounds, and one or two enum validations.
2. Contracts need owners. Every contract should have a clear owner (team or person) who is responsible for updating it when the schema changes and investigating violations. Without ownership, contracts become stale documentation.
3. Version your contracts. Schema changes should bump the contract version. This creates an audit trail and lets consumers pin to a specific version if they need stability.
4. Warn first, fail later. Running contracts in strict mode from day one will break everything. Start in warn mode, tune thresholds for 1-2 weeks, then graduate to strict.
5. Contracts don't replace monitoring. Contracts catch known issues (schema changes, null spikes, enum violations). You still need anomaly detection for unknown issues (distribution drift, sudden volume drops). Use both.
Get the Playbook
Want the complete data quality monitoring framework — including the PySpark validation function, SQL check templates, alerting patterns, and threshold recommendations? Grab the Data Quality Monitoring Playbook on Gumroad for $4.99.
Also check out the rest of the PipelinePulse quick references:
- Delta Table Troubleshooting Checklist ($9)
- Databricks SQL Cheat Sheet ($4.99)
- PySpark Null Handling Cheat Sheet ($4.99)
- PySpark Window Functions Cheat Sheet ($4.99)
- Schema Evolution Quick Reference ($4.99)
Building data pipelines that don't break at 3am? That's what PipelinePulse is about. More guides at pipelinepulse.dev.