Databricks MERGE INTO: Complete Guide with Real Examples [2026]
Everything you need to know about MERGE INTO in Databricks — basic upserts, conditional updates, soft deletes, and performance tips from production experience.
MERGE INTO is the single most important SQL command in Delta Lake. It handles inserts, updates, and deletes in one atomic operation — the foundation of every incremental pipeline, every SCD implementation, and every upsert pattern you'll ever build in Databricks.
But it's also the command that causes the most subtle bugs. A wrong match condition creates silent duplicates. A missing WHEN clause drops records. A poorly structured MERGE on a large table can run for hours when it should take minutes.
I've written hundreds of MERGE statements in production, and I've made most of these mistakes at least once. This guide covers everything — from basic syntax to advanced patterns — with the real-world gotchas that documentation doesn't warn you about.
Basic syntax
The MERGE statement compares rows between a source and a target table, then takes different actions based on whether a match is found:
MERGE INTO target_table AS target
USING source_table AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET target.name = source.name,
target.amount = source.amount,
target.updated_at = current_timestamp()
WHEN NOT MATCHED THEN
INSERT (id, name, amount, updated_at)
VALUES (source.id, source.name, source.amount, current_timestamp())
This does exactly what an "upsert" means: if the row exists in the target (matched on id), update it. If it doesn't exist, insert it.
📄 Keep this syntax handy
MERGE is one of 50+ patterns in the Databricks SQL Cheat Sheet — a printable reference for every SQL operation you'll use in Databricks. $4.99
The match condition: get this right or get duplicates
The ON clause is the most critical part of MERGE. It defines what "same record" means. Get it wrong, and you'll create duplicates silently — your pipeline won't error, your row counts will look fine, and your downstream reports will be wrong.
Common mistake: incomplete match key
-- ❌ WRONG: user_id alone isn't unique if users have multiple events
MERGE INTO events AS target
USING staging AS source
ON target.user_id = source.user_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
If a user has 5 events, this MERGE can't decide which target row to match against — and the behavior is unpredictable. You'll end up with duplicates.
-- ✅ RIGHT: include all columns that define a unique record
MERGE INTO events AS target
USING staging AS source
ON target.user_id = source.user_id
AND target.event_date = source.event_date
AND target.event_type = source.event_type
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
How to verify your match key
Before writing any MERGE, run this check on your source data:
-- If this returns rows, your match key isn't unique
SELECT user_id, event_date, event_type, COUNT(*) AS cnt
FROM staging
GROUP BY user_id, event_date, event_type
HAVING COUNT(*) > 1
If this returns rows, you need to either add more columns to your match key or deduplicate the source first. I covered this in detail in my guide to fixing duplicate records in Delta tables.
UPDATE SET * vs explicit columns
You'll see two styles in the wild:
UPDATE SET * (update all columns)
WHEN MATCHED THEN UPDATE SET *
This updates every column in the target with the corresponding column from the source. It's concise but dangerous — if the source has a column you don't want to overwrite (like created_at or row_id), it will be overwritten silently.
Explicit column list (safer)
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.amount = source.amount,
target.status = source.status,
target.updated_at = current_timestamp()
This is more verbose but safer — you control exactly which columns get updated. I use this in production for any table where certain columns should never change after initial insert.
INSERT * vs explicit columns
Same principle applies to INSERT:
-- Quick but assumes source and target have identical schemas
WHEN NOT MATCHED THEN INSERT *
-- Safer: explicit column mapping
WHEN NOT MATCHED THEN
INSERT (id, name, amount, status, created_at, updated_at)
VALUES (source.id, source.name, source.amount, source.status,
current_timestamp(), current_timestamp())
Conditional MERGE with WHEN MATCHED AND
You can add conditions to your MATCHED clauses to handle different scenarios:
Only update if the data actually changed
MERGE INTO customers AS target
USING staging AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND (
target.name != source.name
OR target.email != source.email
OR target.phone != source.phone
) THEN
UPDATE SET
target.name = source.name,
target.email = source.email,
target.phone = source.phone,
target.updated_at = current_timestamp()
WHEN NOT MATCHED THEN
INSERT *
This skips the update if nothing changed — saving write amplification and transaction log entries. This is a big deal for large tables where most records don't change between loads.
Soft delete pattern
Handle deleted records by marking them instead of physically removing them:
MERGE INTO customers AS target
USING staging AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
WHEN NOT MATCHED BY SOURCE THEN
UPDATE SET target.is_deleted = true,
target.deleted_at = current_timestamp()
The WHEN NOT MATCHED BY SOURCE clause fires for rows that exist in the target but not in the source — meaning they were deleted from the source system.
MERGE with deduplication
If your source data might contain duplicates, deduplicate it before the MERGE — not after:
from pyspark.sql import Window
from pyspark.sql.functions import row_number, col
# Deduplicate source: keep the latest record per business key
window = Window.partitionBy("customer_id") \
.orderBy(col("updated_at").desc())
staging_deduped = (
spark.table("staging")
.withColumn("rn", row_number().over(window))
.filter("rn = 1")
.drop("rn")
)
# Create temp view for SQL MERGE
staging_deduped.createOrReplaceTempView("staging_clean")
MERGE INTO customers AS target
USING staging_clean AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
This pattern prevents the "multiple matches" problem that causes duplicates in the target.
PySpark DeltaTable MERGE API
If you prefer PySpark over SQL, Delta Lake has a Python API for MERGE:
from delta.tables import DeltaTable
target = DeltaTable.forName(spark, "my_database.customers")
source = spark.table("staging_clean")
(target.alias("target")
.merge(
source.alias("source"),
"target.customer_id = source.customer_id"
)
.whenMatchedUpdate(set={
"name": "source.name",
"email": "source.email",
"updated_at": "current_timestamp()"
})
.whenNotMatchedInsert(values={
"customer_id": "source.customer_id",
"name": "source.name",
"email": "source.email",
"created_at": "current_timestamp()",
"updated_at": "current_timestamp()"
})
.execute()
)
The PySpark API is useful when you need to build the MERGE dynamically — for example, when column lists come from configuration rather than being hardcoded.
Performance optimization
MERGE on large tables can be slow. Here are the techniques I use to keep it fast.
Partition pruning
If your target table is partitioned, always include the partition column in your ON clause or add a filter:
MERGE INTO events AS target
USING staging AS source
ON target.event_date = source.event_date
AND target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
Without event_date in the condition, Databricks has to scan every partition to find matches. With it, only the relevant partitions are scanned.
Narrow your source
Don't MERGE an entire staging table if you only need today's data:
MERGE INTO events AS target
USING (SELECT * FROM staging WHERE load_date = current_date()) AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
OPTIMIZE after MERGE
MERGE creates new small files for every partition it touches. Run OPTIMIZE after large MERGE operations to compact these files:
MERGE INTO events AS target
USING staging AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
-- Compact the files that MERGE just created
OPTIMIZE events
WHERE event_date >= current_date() - INTERVAL 7 DAYS
ZORDER BY (event_id);
New to Spark SQL? DataCamp's Databricks courses cover MERGE, window functions, and SQL optimization from the ground up — worth it if you want structured learning instead of piecing it together from blog posts.
Common patterns
Full refresh with MERGE (idempotent reload)
Replace all data for a specific partition while keeping other partitions untouched:
MERGE INTO daily_metrics AS target
USING new_metrics AS source
ON target.metric_date = source.metric_date
AND target.metric_name = source.metric_name
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
Incremental load with change tracking
Only process records that changed since the last run:
MERGE INTO customers AS target
USING (
SELECT * FROM source_customers
WHERE updated_at > (SELECT MAX(updated_at) FROM customers)
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
Quick reference
| Pattern | Use case | Key consideration |
|---|---|---|
| Basic upsert | Insert new, update existing | Ensure match key is truly unique |
| Conditional update | Skip unchanged rows | Reduces write amplification |
| Soft delete | Mark deleted rows | Use WHEN NOT MATCHED BY SOURCE |
| Dedup + MERGE | Source has duplicates | Deduplicate before MERGE, not after |
| Partition-aware | Large partitioned tables | Include partition column in ON clause |
Key takeaways
MERGE INTO is powerful but unforgiving. The three rules I follow on every MERGE statement:
- Verify your match key is unique in the source before writing the MERGE. A five-second query saves hours of debugging duplicates.
- Use explicit column lists in production instead of
SET *andINSERT *. You want to control exactly what gets written. - Run OPTIMIZE after large MERGE operations — MERGE creates small files that degrade query performance over time.
For more on keeping your Delta tables healthy after MERGE operations, check out my guide on OPTIMIZE, Z-ORDER, and VACUUM. If duplicate records are already in your table, here's how to detect and fix them. And if you're running these pipelines on a budget, my VPS pipeline guide shows you how to set up a $6/month server on DigitalOcean for scheduled ETL jobs.
Want all 20 Delta table troubleshooting patterns in one place? Grab the Delta Table Troubleshooting Checklist — copy-paste diagnostic queries for every issue, just $9.
Subscribe to PipelinePulse for practical data engineering tutorials from real production experience. No theory fluff — just working code and hard-won lessons.