How to Fix Duplicate Records in Delta Tables [2026 Guide]

How to Fix Duplicate Records in Delta Tables [2026 Guide]

Your pipeline looks clean. Your row counts match. But your downstream reports show inflated numbers, and your stakeholders are asking questions you can't answer yet. Sound familiar?

Duplicate records in Delta tables are one of the most common — and sneakiest — data quality issues in production pipelines. They don't throw errors. They don't fail your jobs. They just quietly inflate your numbers until someone notices.

I recently dealt with exactly this problem in a production Databricks workspace. A logging table had been silently accumulating duplicates for weeks because the MERGE key was missing a column. In this guide, I'll walk you through exactly how to detect, diagnose, and fix duplicates — including the specific mistake I made and how I resolved it.


Why duplicates happen in Delta tables

Before you fix duplicates, you need to understand how they get there. In my experience, these are the four most common causes:

1. Incomplete MERGE keys

This is the one that got me. When you use MERGE INTO for upserts, your match condition needs to include every column that defines a unique record. If you're matching on user_id alone but your data is unique on user_id + event_date, you'll get duplicates every time a user has events on multiple dates.

Here's what a broken MERGE looks like:

-- ❌ BROKEN: Missing event_date in the match key
MERGE INTO app_events AS target
USING staging_events AS source
ON target.user_id = source.user_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

The problem: user_id alone wasn't unique. The same user could have multiple events on different dates. Every time the pipeline re-ran, it inserted new rows instead of updating existing ones because the MERGE couldn't find an exact match.

The fix:

-- ✅ FIXED: Complete composite key
MERGE INTO app_events AS target
USING staging_events AS source
ON target.user_id = source.user_id
   AND target.event_date = source.event_date
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

2. Append-mode pipelines without idempotency

If your pipeline uses INSERT INTO or .mode("append") and doesn't check whether data already exists, every re-run creates duplicates. This is especially common when:

  • A scheduled job fails halfway and gets retried
  • You manually re-run a pipeline to backfill data
  • Streaming micro-batches overlap due to late-arriving events

3. Schema evolution breaking uniqueness

When Delta Lake adds new nullable columns through schema evolution, your previously unique combination of columns might no longer be unique. A record that was distinct on (id, name, status) might now have duplicate rows that differ only in the new nullable updated_at column.

4. Late-arriving data in streaming pipelines

Structured Streaming with Delta tables can produce duplicates when events arrive out of order or are reprocessed after a checkpoint recovery. If your watermark isn't configured correctly, the same event can land in multiple micro-batches.


How to detect duplicates

Before you can fix anything, you need to know exactly how bad the problem is. Here are the queries I use to diagnose duplicates.

Quick count check

Start with the simplest possible query — compare distinct vs total row counts:

SELECT
  COUNT(*) AS total_rows,
  COUNT(DISTINCT user_id, event_date) AS distinct_rows,
  COUNT(*) - COUNT(DISTINCT user_id, event_date) AS duplicate_rows
FROM app_events

If duplicate_rows is greater than zero, you have a problem.

Find the actual duplicates

Once you know duplicates exist, find them:

SELECT
  user_id,
  event_date,
  COUNT(*) AS occurrence_count
FROM app_events
GROUP BY user_id, event_date
HAVING COUNT(*) > 1
ORDER BY occurrence_count DESC

This tells you which records are duplicated and how many copies exist. In my case, some key combinations had 5-6 copies — one for each pipeline re-run.

PySpark approach with window functions

If you prefer PySpark or need more flexibility, window functions are your friend:

from pyspark.sql import Window
from pyspark.sql.functions import row_number, col

# Define what makes a record unique
window_spec = Window.partitionBy("user_id", "event_date") \
                    .orderBy(col("_commit_timestamp").desc())

df_with_row_num = (
    spark.table("app_events")
    .withColumn("row_num", row_number().over(window_spec))
)

# Show only duplicates (row_num > 1)
duplicates = df_with_row_num.filter(col("row_num") > 1)
print(f"Found {duplicates.count()} duplicate rows")
duplicates.show(20)

The _commit_timestamp ordering ensures that if you need to keep one copy, you keep the most recent one.

Check if your MERGE key is truly unique

This is the diagnostic I wish I'd run before I started debugging everything else:

-- This should return zero rows if your key is truly unique
SELECT
  user_id,
  COUNT(*) AS cnt
FROM staging_events
GROUP BY user_id
HAVING COUNT(*) > 1

If this returns rows, your MERGE key isn't unique in the source — and that's your root cause.


How to fix duplicates

You've found the duplicates. Now let's remove them. There are three approaches, and which one you choose depends on your situation.

Method 1: Fix the MERGE key (fix the root cause)

If your duplicates are caused by an incomplete MERGE key, fix the key first. This stops new duplicates from being created. Then clean up the existing ones.

-- Step 1: Create a clean deduplicated version
CREATE OR REPLACE TABLE app_events_clean AS
SELECT *
FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY user_id, event_date
      ORDER BY _commit_timestamp DESC
    ) AS rn
  FROM app_events
)
WHERE rn = 1;

-- Step 2: Replace the original table
DROP TABLE app_events;
ALTER TABLE app_events_clean RENAME TO app_events;

-- Step 3: Update your MERGE statement to use the complete key
-- (see the fixed MERGE example above)

When to use this: Your MERGE key is wrong and you need to fix the root cause. This is the most common scenario.

Method 2: Deduplication with CREATE OR REPLACE

For a quick one-time cleanup when you know exactly what the unique key should be:

CREATE OR REPLACE TABLE app_events AS
SELECT * EXCEPT(rn)
FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY user_id, event_date
      ORDER BY _commit_timestamp DESC
    ) AS rn
  FROM app_events
)
WHERE rn = 1

When to use this: Quick cleanup where you want to keep the most recent version of each record. Simple and fast.

Method 3: Delete + re-insert for full table refresh

When the data volume is small enough or you have a reliable source of truth:

-- Truncate and reload from source
TRUNCATE TABLE app_events;

INSERT INTO app_events
SELECT DISTINCT *
FROM raw_events
WHERE event_date >= '2026-01-01'

When to use this: Your source data is clean and you can afford to do a full reload. Works well for smaller tables or when you're doing a one-time fix during a maintenance window.

Decision framework

Not sure which method to use? Here's how I decide:

  • Is the root cause a bad MERGE key? → Method 1 (fix MERGE + deduplicate)
  • Is it a one-time cleanup with no ongoing issue? → Method 2 (CREATE OR REPLACE)
  • Is the table small and you have a clean source? → Method 3 (truncate + reload)
  • Is it a streaming table with checkpoints? → Be very careful. You may need to reset checkpoints after deduplication.

How to prevent duplicates going forward

Fixing duplicates once is fine. Making sure they never come back is better.

Always validate your MERGE key

Before writing any MERGE statement, run this check on your source data:

-- This should return zero rows if your key is truly unique
SELECT merge_key_col1, merge_key_col2, COUNT(*)
FROM source_table
GROUP BY merge_key_col1, merge_key_col2
HAVING COUNT(*) > 1

If it returns rows, your MERGE key isn't unique and you need to either add more columns to the key or deduplicate the source first.

Add data quality checks to your pipeline

Build automated checks that run after every pipeline execution. At minimum:

-- Check: No duplicates on business key
SELECT
  CASE WHEN COUNT(*) > COUNT(DISTINCT user_id, event_date)
    THEN 'FAIL: Duplicates detected'
    ELSE 'PASS: No duplicates'
  END AS quality_check
FROM app_events

You can integrate this with Great Expectations, dbt tests, or even a simple Databricks notebook that runs after your main pipeline and alerts on failure.

Use watermarks for streaming deduplication

If you're using Structured Streaming, configure watermarks to handle late-arriving data properly:

df_stream = (
    spark.readStream
    .table("raw_events")
    .withWatermark("event_timestamp", "1 hour")
    .dropDuplicatesWithinWatermark(["user_id", "event_date"])
)

The dropDuplicatesWithinWatermark function is specifically designed to handle deduplication in streaming contexts without unbounded state growth.

Design idempotent pipelines

The gold standard is making your pipelines idempotent — running them twice produces the same result as running them once. For batch pipelines, this usually means:

  • Use MERGE instead of INSERT for incremental loads
  • Include all business key columns in your MERGE condition
  • Use CREATE OR REPLACE for full refresh tables
  • Add a deduplication step before writing to the target table

Key takeaways

Duplicate records in Delta tables almost always come down to one of four causes: incomplete MERGE keys, non-idempotent appends, schema evolution, or streaming reprocessing. The fix is usually straightforward once you identify the root cause.

The most important thing I learned from debugging this in production: always validate that your MERGE key is actually unique in the source data before you trust it. A five-second query would have saved me hours of debugging.

If you're dealing with data quality issues beyond just duplicates — null values, schema drift, freshness problems — that's a whole separate topic I'll be covering in upcoming posts. Subscribe to PipelinePulse to get notified when those go live.


Found this useful? Share it with a data engineer who's probably debugging duplicates right now. And if you want the complete Delta Table Troubleshooting Checklist as a downloadable reference, [grab it here on Gumroad] — it covers the 20 most common Delta table issues with ready-to-use SQL queries.