Databricks MERGE INTO: Complete Guide with Real Examples [2026]

Everything you need to know about MERGE INTO in Databricks — basic upserts, conditional updates, soft deletes, and performance tips from production experience.

Databricks MERGE INTO: Complete Guide with Real Examples [2026]
The complete guide to Databricks MERGE INTO — from basic upserts to conditional updates, soft deletes, and performance optimization. With SQL and PySpark examples.

MERGE INTO is the single most important SQL command in Delta Lake. It handles inserts, updates, and deletes in one atomic operation — the foundation of every incremental pipeline, every SCD implementation, and every upsert pattern you'll ever build in Databricks.

But it's also the command that causes the most subtle bugs. A wrong match condition creates silent duplicates. A missing WHEN clause drops records. A poorly structured MERGE on a large table can run for hours when it should take minutes.

I've written hundreds of MERGE statements in production, and I've made most of these mistakes at least once. This guide covers everything — from basic syntax to advanced patterns — with the real-world gotchas that documentation doesn't warn you about.


Basic syntax

The MERGE statement compares rows between a source and a target table, then takes different actions based on whether a match is found:

MERGE INTO target_table AS target
USING source_table AS source
ON target.id = source.id
WHEN MATCHED THEN
  UPDATE SET target.name = source.name,
             target.amount = source.amount,
             target.updated_at = current_timestamp()
WHEN NOT MATCHED THEN
  INSERT (id, name, amount, updated_at)
  VALUES (source.id, source.name, source.amount, current_timestamp())

This does exactly what an "upsert" means: if the row exists in the target (matched on id), update it. If it doesn't exist, insert it.


📄 Keep this syntax handy

MERGE is one of 50+ patterns in the Databricks SQL Cheat Sheet — a printable reference for every SQL operation you'll use in Databricks. $4.99

The match condition: get this right or get duplicates

The ON clause is the most critical part of MERGE. It defines what "same record" means. Get it wrong, and you'll create duplicates silently — your pipeline won't error, your row counts will look fine, and your downstream reports will be wrong.

Common mistake: incomplete match key

-- ❌ WRONG: user_id alone isn't unique if users have multiple events
MERGE INTO events AS target
USING staging AS source
ON target.user_id = source.user_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

If a user has 5 events, this MERGE can't decide which target row to match against — and the behavior is unpredictable. You'll end up with duplicates.

-- ✅ RIGHT: include all columns that define a unique record
MERGE INTO events AS target
USING staging AS source
ON target.user_id = source.user_id
   AND target.event_date = source.event_date
   AND target.event_type = source.event_type
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

How to verify your match key

Before writing any MERGE, run this check on your source data:

-- If this returns rows, your match key isn't unique
SELECT user_id, event_date, event_type, COUNT(*) AS cnt
FROM staging
GROUP BY user_id, event_date, event_type
HAVING COUNT(*) > 1

If this returns rows, you need to either add more columns to your match key or deduplicate the source first. I covered this in detail in my guide to fixing duplicate records in Delta tables.


UPDATE SET * vs explicit columns

You'll see two styles in the wild:

UPDATE SET * (update all columns)

WHEN MATCHED THEN UPDATE SET *

This updates every column in the target with the corresponding column from the source. It's concise but dangerous — if the source has a column you don't want to overwrite (like created_at or row_id), it will be overwritten silently.

Explicit column list (safer)

WHEN MATCHED THEN
  UPDATE SET
    target.name = source.name,
    target.amount = source.amount,
    target.status = source.status,
    target.updated_at = current_timestamp()

This is more verbose but safer — you control exactly which columns get updated. I use this in production for any table where certain columns should never change after initial insert.

INSERT * vs explicit columns

Same principle applies to INSERT:

-- Quick but assumes source and target have identical schemas
WHEN NOT MATCHED THEN INSERT *

-- Safer: explicit column mapping
WHEN NOT MATCHED THEN
  INSERT (id, name, amount, status, created_at, updated_at)
  VALUES (source.id, source.name, source.amount, source.status,
          current_timestamp(), current_timestamp())

Conditional MERGE with WHEN MATCHED AND

You can add conditions to your MATCHED clauses to handle different scenarios:

Only update if the data actually changed

MERGE INTO customers AS target
USING staging AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND (
  target.name != source.name
  OR target.email != source.email
  OR target.phone != source.phone
) THEN
  UPDATE SET
    target.name = source.name,
    target.email = source.email,
    target.phone = source.phone,
    target.updated_at = current_timestamp()
WHEN NOT MATCHED THEN
  INSERT *

This skips the update if nothing changed — saving write amplification and transaction log entries. This is a big deal for large tables where most records don't change between loads.

Soft delete pattern

Handle deleted records by marking them instead of physically removing them:

MERGE INTO customers AS target
USING staging AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *
WHEN NOT MATCHED BY SOURCE THEN
  UPDATE SET target.is_deleted = true,
             target.deleted_at = current_timestamp()

The WHEN NOT MATCHED BY SOURCE clause fires for rows that exist in the target but not in the source — meaning they were deleted from the source system.


MERGE with deduplication

If your source data might contain duplicates, deduplicate it before the MERGE — not after:

from pyspark.sql import Window
from pyspark.sql.functions import row_number, col

# Deduplicate source: keep the latest record per business key
window = Window.partitionBy("customer_id") \
               .orderBy(col("updated_at").desc())

staging_deduped = (
    spark.table("staging")
    .withColumn("rn", row_number().over(window))
    .filter("rn = 1")
    .drop("rn")
)

# Create temp view for SQL MERGE
staging_deduped.createOrReplaceTempView("staging_clean")
MERGE INTO customers AS target
USING staging_clean AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

This pattern prevents the "multiple matches" problem that causes duplicates in the target.


PySpark DeltaTable MERGE API

If you prefer PySpark over SQL, Delta Lake has a Python API for MERGE:

from delta.tables import DeltaTable

target = DeltaTable.forName(spark, "my_database.customers")
source = spark.table("staging_clean")

(target.alias("target")
  .merge(
    source.alias("source"),
    "target.customer_id = source.customer_id"
  )
  .whenMatchedUpdate(set={
    "name": "source.name",
    "email": "source.email",
    "updated_at": "current_timestamp()"
  })
  .whenNotMatchedInsert(values={
    "customer_id": "source.customer_id",
    "name": "source.name",
    "email": "source.email",
    "created_at": "current_timestamp()",
    "updated_at": "current_timestamp()"
  })
  .execute()
)

The PySpark API is useful when you need to build the MERGE dynamically — for example, when column lists come from configuration rather than being hardcoded.


Performance optimization

MERGE on large tables can be slow. Here are the techniques I use to keep it fast.

Partition pruning

If your target table is partitioned, always include the partition column in your ON clause or add a filter:

MERGE INTO events AS target
USING staging AS source
ON target.event_date = source.event_date
   AND target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Without event_date in the condition, Databricks has to scan every partition to find matches. With it, only the relevant partitions are scanned.

Narrow your source

Don't MERGE an entire staging table if you only need today's data:

MERGE INTO events AS target
USING (SELECT * FROM staging WHERE load_date = current_date()) AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

OPTIMIZE after MERGE

MERGE creates new small files for every partition it touches. Run OPTIMIZE after large MERGE operations to compact these files:

MERGE INTO events AS target
USING staging AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

-- Compact the files that MERGE just created
OPTIMIZE events
WHERE event_date >= current_date() - INTERVAL 7 DAYS
ZORDER BY (event_id);
New to Spark SQL? DataCamp's Databricks courses cover MERGE, window functions, and SQL optimization from the ground up — worth it if you want structured learning instead of piecing it together from blog posts.

Common patterns

Full refresh with MERGE (idempotent reload)

Replace all data for a specific partition while keeping other partitions untouched:

MERGE INTO daily_metrics AS target
USING new_metrics AS source
ON target.metric_date = source.metric_date
   AND target.metric_name = source.metric_name
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Incremental load with change tracking

Only process records that changed since the last run:

MERGE INTO customers AS target
USING (
  SELECT * FROM source_customers
  WHERE updated_at > (SELECT MAX(updated_at) FROM customers)
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Quick reference

Pattern Use case Key consideration
Basic upsert Insert new, update existing Ensure match key is truly unique
Conditional update Skip unchanged rows Reduces write amplification
Soft delete Mark deleted rows Use WHEN NOT MATCHED BY SOURCE
Dedup + MERGE Source has duplicates Deduplicate before MERGE, not after
Partition-aware Large partitioned tables Include partition column in ON clause

Key takeaways

MERGE INTO is powerful but unforgiving. The three rules I follow on every MERGE statement:

  1. Verify your match key is unique in the source before writing the MERGE. A five-second query saves hours of debugging duplicates.
  2. Use explicit column lists in production instead of SET * and INSERT *. You want to control exactly what gets written.
  3. Run OPTIMIZE after large MERGE operations — MERGE creates small files that degrade query performance over time.

For more on keeping your Delta tables healthy after MERGE operations, check out my guide on OPTIMIZE, Z-ORDER, and VACUUM. If duplicate records are already in your table, here's how to detect and fix them. And if you're running these pipelines on a budget, my VPS pipeline guide shows you how to set up a $6/month server on DigitalOcean for scheduled ETL jobs.


Want all 20 Delta table troubleshooting patterns in one place? Grab the Delta Table Troubleshooting Checklist — copy-paste diagnostic queries for every issue, just $9.

Subscribe to PipelinePulse for practical data engineering tutorials from real production experience. No theory fluff — just working code and hard-won lessons.