SCD Type 2 Implementation in Databricks [Step-by-Step Guide]

How to implement SCD Type 2 in Databricks — change detection with null-safe comparisons, MERGE expiration, version inserts, and a complete reusable PySpark function.

SCD Type 2 Implementation in Databricks [Step-by-Step Guide]
How to implement SCD Type 2 in Databricks — detect changes with null-safe comparisons, expire old versions with MERGE, and insert new versions. Includes a reusable PySpark function.

Slowly Changing Dimensions (SCD) Type 2 is how you track the full history of changes in a dimension table. When a customer changes their address, when a product's price updates, when an employee switches departments — SCD Type 2 keeps every version with start and end dates so your historical reports stay accurate.

It's also one of the trickiest patterns to implement correctly in Databricks. Get the MERGE logic wrong and you'll end up with overlapping date ranges, orphaned records, or a table that grows uncontrollably. I've implemented SCD Type 2 on multiple production tables, and in this guide I'll walk you through the exact approach that works — with both SQL and PySpark.


What SCD Type 2 does

In a Type 2 slowly changing dimension, every change creates a new row instead of overwriting the existing one. Each row has:

  • effective_start_date — when this version became active
  • effective_end_date — when this version was superseded (null if current)
  • is_current — flag indicating the active version

Here's what the table looks like after a customer changes their city:

customer_id | name  | city      | effective_start | effective_end | is_current
1001        | alice | new york  | 2025-01-15      | 2026-02-20    | false
1001        | alice | london    | 2026-02-20      | null          | true

Both rows exist. Historical queries can filter on date ranges. Current-state queries filter on is_current = true.


Step 1: Set up the target table

Create your dimension table with the SCD Type 2 columns:

CREATE TABLE IF NOT EXISTS dim_customers (
  surrogate_key BIGINT GENERATED ALWAYS AS IDENTITY,
  customer_id STRING,
  name STRING,
  email STRING,
  city STRING,
  tier STRING,
  effective_start_date DATE,
  effective_end_date DATE,
  is_current BOOLEAN
)
USING DELTA

The surrogate_key auto-increments and gives each version a unique identifier. The customer_id is the natural business key — it repeats across versions.


Step 2: Identify what changed

Before running the MERGE, you need to know which source records actually changed. Comparing every column avoids unnecessary updates:

CREATE OR REPLACE TEMP VIEW changes AS
SELECT
  s.customer_id,
  s.name,
  s.email,
  s.city,
  s.tier
FROM staging_customers s
LEFT JOIN dim_customers t
  ON s.customer_id = t.customer_id
  AND t.is_current = true
WHERE t.customer_id IS NULL  -- new customers
   OR s.name != t.name
   OR s.email != t.email
   OR s.city != t.city
   OR s.tier != t.tier

This gives you only records that are either new or actually changed. Be careful with null comparisons here — if any tracked column can be null, use null-safe comparisons:

   OR NOT (s.email <=> t.email)
   OR NOT (s.city <=> t.city)

The <=> operator returns true when both sides are null, preventing false positives that would create unnecessary SCD versions. If you're not familiar with null-safe operators in Spark, check out my complete guide to handling null values in Spark DataFrames.


Step 3: The MERGE statement

This is the core of SCD Type 2. The MERGE does two things in one atomic operation:

  1. Expire the old version — set effective_end_date and is_current = false
  2. Insert the new version — with effective_start_date = today and is_current = true
MERGE INTO dim_customers AS target
USING changes AS source
ON target.customer_id = source.customer_id
   AND target.is_current = true

-- Expire the current version
WHEN MATCHED THEN
  UPDATE SET
    target.effective_end_date = current_date(),
    target.is_current = false

-- Insert new customers (no existing record)
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, city, tier,
          effective_start_date, effective_end_date, is_current)
  VALUES (source.customer_id, source.name, source.email, source.city, source.tier,
          current_date(), null, true)

Wait — this only expires the old row. Where does the new version get inserted? After the MERGE, insert the new versions for changed records:

INSERT INTO dim_customers
  (customer_id, name, email, city, tier,
   effective_start_date, effective_end_date, is_current)
SELECT
  customer_id, name, email, city, tier,
  current_date(), null, true
FROM changes
WHERE customer_id IN (
  SELECT customer_id FROM dim_customers WHERE is_current = false
  AND effective_end_date = current_date()
)

📄 MERGE syntax at your fingertips

SCD2 relies heavily on MERGE. Keep the Databricks SQL Cheat Sheet next to your notebook — MERGE, window functions, CTEs, and 50+ other patterns. $4.99

The complete PySpark implementation

Here's the full implementation wrapped in a clean function:

from delta.tables import DeltaTable
from pyspark.sql.functions import col, current_date, lit, coalesce

def scd_type2_merge(spark, source_df, target_table, business_key, tracked_columns):
    """
    Perform SCD Type 2 merge on a Delta table.
    
    Args:
        spark: SparkSession
        source_df: DataFrame with new/updated records
        target_table: fully qualified Delta table name
        business_key: column name for the natural key (e.g. 'customer_id')
        tracked_columns: list of columns to track for changes
    """
    target = DeltaTable.forName(spark, target_table)
    target_df = spark.table(target_table)
    
    # Step 1: Find changed records
    current_records = target_df.filter("is_current = true")
    
    # Build change detection condition
    change_conditions = [
        ~col(f"source.{c}").eqNullSafe(col(f"target.{c}"))
        for c in tracked_columns
    ]
    change_expr = change_conditions[0]
    for cond in change_conditions[1:]:
        change_expr = change_expr | cond
    
    changes = (
        source_df.alias("source")
        .join(
            current_records.alias("target"),
            col(f"source.{business_key}") == col(f"target.{business_key}"),
            "left"
        )
        .filter(
            col(f"target.{business_key}").isNull() | change_expr
        )
        .select([col(f"source.{c}") for c in source_df.columns])
    )
    
    changes.createOrReplaceTempView("_scd2_changes")
    
    # Step 2: Expire old records
    (target.alias("target")
        .merge(
            changes.alias("source"),
            f"target.{business_key} = source.{business_key} AND target.is_current = true"
        )
        .whenMatchedUpdate(set={
            "effective_end_date": "current_date()",
            "is_current": "false"
        })
        .whenNotMatchedInsert(values={
            business_key: f"source.{business_key}",
            **{c: f"source.{c}" for c in tracked_columns},
            "effective_start_date": "current_date()",
            "effective_end_date": "null",
            "is_current": "true"
        })
        .execute()
    )
    
    # Step 3: Insert new versions for changed records
    new_versions = spark.sql(f"""
        SELECT * FROM _scd2_changes
        WHERE {business_key} IN (
            SELECT {business_key} FROM {target_table}
            WHERE is_current = false AND effective_end_date = current_date()
        )
    """)
    
    if new_versions.count() > 0:
        insert_df = new_versions.select(
            *[col(c) for c in tracked_columns],
            col(business_key),
            current_date().alias("effective_start_date"),
            lit(None).cast("date").alias("effective_end_date"),
            lit(True).alias("is_current")
        )
        insert_df.write.mode("append").saveAsTable(target_table)
    
    # Report
    total_changes = changes.count()
    print(f"SCD Type 2 complete: {total_changes} changes processed")

Usage

source = spark.table("staging_customers")

scd_type2_merge(
    spark=spark,
    source_df=source,
    target_table="my_database.dim_customers",
    business_key="customer_id",
    tracked_columns=["name", "email", "city", "tier"]
)
SCD Type 2 sits at the intersection of data modeling and engineering. If you want to go deeper on dimensional modeling theory — star schemas, slowly changing dimensions, fact tables — Pluralsight has comprehensive data warehousing courses that give you the "why" behind the patterns.

Querying SCD Type 2 tables

Get current state

SELECT * FROM dim_customers WHERE is_current = true

Get state at a specific date

SELECT * FROM dim_customers
WHERE effective_start_date <= '2025-06-15'
  AND (effective_end_date > '2025-06-15' OR effective_end_date IS NULL)

Get full change history for a customer

SELECT * FROM dim_customers
WHERE customer_id = '1001'
ORDER BY effective_start_date

Count changes per customer

SELECT customer_id, name, COUNT(*) - 1 AS times_changed
FROM dim_customers
GROUP BY customer_id, name
HAVING COUNT(*) > 1
ORDER BY times_changed DESC

Common pitfalls

Pitfall 1: Running SCD twice in the same day. If your pipeline retries, the same record gets expired and re-inserted with the same dates. Fix: add idempotency checks — skip records where the source matches the current target.

Pitfall 2: Null columns creating false changes. If email is null in the source and null in the target, a standard != comparison returns null (not false), triggering a false change. Fix: use null-safe comparisons (<=>) as shown in Step 2.

Pitfall 3: Table growing too large. Every change creates a new row forever. Fix: archive old versions periodically, or add a retention policy that removes versions older than N years.

Pitfall 4: Forgetting to OPTIMIZE after large loads. SCD Type 2 creates many small files from the MERGE + INSERT pattern. Run OPTIMIZE with Z-ORDER on the business key column after each load.


Key takeaways

SCD Type 2 is a two-step process: expire the old version, then insert the new one. The MERGE handles expiration and new inserts; a follow-up INSERT adds new versions for changed records.

Three things to get right:

  1. Change detection matters. Only create new versions when data actually changed — use null-safe comparisons to avoid false positives.
  2. Make it idempotent. Your pipeline should produce the same result whether it runs once or twice on the same day.
  3. Maintain your table. Run OPTIMIZE + Z-ORDER after SCD loads to keep queries fast.

For the complete SQL syntax reference including MERGE patterns, window functions, and more, grab the Databricks SQL Cheat Sheet — $4.99 for 25+ copy-paste patterns. And if null values are causing false SCD changes, the PySpark Null Handling Cheat Sheet covers every null pattern you'll need.

For more on the MERGE statement itself, see my complete MERGE INTO guide. And for automated quality checks on your dimension tables, check out my data quality checks guide.


Subscribe to PipelinePulse for practical data engineering tutorials. Real production patterns, not textbook theory.