Delta Table OPTIMIZE, Z-ORDER, and VACUUM Explained [2026 Guide]

A practical guide to Delta Lake's three essential maintenance commands, with production-ready scripts and scheduling tips.

Delta Table OPTIMIZE, Z-ORDER, and VACUUM Explained [2026 Guide]
Learn how OPTIMIZE compacts small files, Z-ORDER sorts data for faster queries, and VACUUM cleans up old history in Delta Lake. Includes a production maintenance script.

Your Delta table works fine when it's small. But as it grows — thousands of small files, months of history piling up — queries slow down, storage costs creep up, and your pipelines start timing out. Sound familiar?

OPTIMIZE, Z-ORDER, and VACUUM are the three maintenance commands every Delta Lake user needs to understand. They solve different problems, they need to run in the right order, and getting them wrong can either waste compute or break your time travel. In this guide, I'll explain what each one does, when to use them, and the exact commands I run in production.


The problem: small files and stale history

Every time you write to a Delta table — whether it's an INSERT, MERGE, UPDATE, or streaming micro-batch — Delta Lake creates new Parquet files. Over time, this leads to two problems:

Too many small files. A table with 10,000 tiny Parquet files is dramatically slower to query than one with 100 properly-sized files. Every file adds overhead: each one needs to be opened, its metadata read, and its footer parsed. This is called the "small file problem" and it's the most common performance issue in Delta Lake.

Too much old history. Delta Lake keeps a transaction log of every change. This is what powers time travel — the ability to query older versions of your table. But keeping every version forever means your storage costs grow indefinitely, and listing files in your table's directory gets slower.

OPTIMIZE fixes the first problem. VACUUM fixes the second. Z-ORDER makes the first fix even more effective.


OPTIMIZE: compacting small files

OPTIMIZE rewrites your table's small Parquet files into fewer, larger files. It doesn't change any data — it just reorganizes the physical storage for better read performance.

Basic usage

OPTIMIZE my_database.events_table

This compacts all small files in the table. After running, you'll have fewer, larger files — typically around 1GB each (the target file size is configurable).

How much difference does it make?

In my experience, running OPTIMIZE on a table with thousands of small files can improve query performance by 2-10x. The improvement is most dramatic on:

  • Tables that receive frequent small writes (streaming, micro-batch appends)
  • Tables that get many MERGE operations (each MERGE creates new files for updated partitions)
  • Tables with wide schemas (more columns = more metadata overhead per file)

OPTIMIZE with partition filtering

For large partitioned tables, you don't need to optimize the entire table every time. Target specific partitions:

-- Only optimize the last 7 days of data
OPTIMIZE my_database.events_table
WHERE event_date >= current_date() - INTERVAL 7 DAYS

This is much faster and cheaper than optimizing the whole table, especially if historical partitions don't change.

When to run OPTIMIZE

  • After bulk loads: Run OPTIMIZE after any large INSERT or MERGE operation that creates many small files
  • On a schedule: For tables with continuous writes, schedule OPTIMIZE to run daily or every few hours during off-peak times
  • Before expensive queries: If you're about to run a heavy analytical query, optimizing the target table first can save more compute than the optimization costs

Z-ORDER is an option you add to OPTIMIZE that physically sorts data within files by specific columns. This makes queries that filter on those columns dramatically faster because Delta Lake can skip entire files that don't contain matching data.

Basic usage

OPTIMIZE my_database.events_table
ZORDER BY (user_id)

How Z-ORDER works

Without Z-ORDER, data is distributed randomly across files. A query filtering on user_id = 123 might need to open every single file in the table because that user's data could be scattered anywhere.

With Z-ORDER on user_id, all records for the same user are co-located in the same files (or nearby files). Now that same query can skip 90%+ of files entirely because it knows from the file-level statistics that user_id = 123 doesn't exist in those files. This is called data skipping and it's the real power of Z-ORDER.

Choosing Z-ORDER columns

Pick columns that appear in your most common WHERE clauses. The best Z-ORDER candidates are:

  • High cardinality columns — like user_id, order_id, session_id (thousands or millions of distinct values)
  • Columns frequently used in filters — whatever your dashboards and queries filter on most
  • Date/timestamp columns — if your table isn't already partitioned by date

You can Z-ORDER by multiple columns, but there are diminishing returns after 2-3 columns:

-- Good: 2 commonly filtered columns
OPTIMIZE my_database.events_table
ZORDER BY (user_id, event_type)

-- Diminishing returns: don't go overboard
-- The more columns you add, the less effective each one becomes

Z-ORDER vs partitioning

A common question: should you partition or Z-ORDER?

Partition when a column has low cardinality (a few hundred distinct values at most) and you almost always filter on it. Date columns are the classic example — PARTITIONED BY (event_date).

Z-ORDER when a column has high cardinality (thousands+ distinct values) or when you filter on it often but not always. User IDs, product IDs, and region codes are good Z-ORDER candidates.

You can combine both:

-- Table is partitioned by date, Z-ORDERed by user_id within each partition
OPTIMIZE my_database.events_table
WHERE event_date >= current_date() - INTERVAL 7 DAYS
ZORDER BY (user_id)

This gives you the best of both worlds — partition pruning on date AND data skipping on user_id.


VACUUM: cleaning up old files

VACUUM removes old Parquet files that are no longer referenced by the current version of your table. These are files left behind from previous versions after OPTIMIZE, UPDATE, DELETE, or MERGE operations.

Basic usage

-- Remove files older than the default retention period (7 days)
VACUUM my_database.events_table

Setting the retention period

The default retention period is 7 days, meaning VACUUM only deletes files that have been unreferenced for more than 7 days. You can change this:

-- Remove files older than 30 days (more conservative)
VACUUM my_database.events_table RETAIN 720 HOURS

-- Remove files older than 24 hours (aggressive — be careful)
VACUUM my_database.events_table RETAIN 24 HOURS

The time travel tradeoff

Here's the critical thing to understand: VACUUM permanently deletes old versions of your data. After running VACUUM with a 7-day retention, you can no longer time travel to any version older than 7 days.

This means:

-- This works if the version is within retention period
SELECT * FROM my_database.events_table VERSION AS OF 42

-- This fails after VACUUM if version 30 is older than retention
SELECT * FROM my_database.events_table VERSION AS OF 30
-- Error: "The data files referenced by version 30 have been deleted"

Dry run first

Always run a dry run before your first VACUUM to see what would be deleted:

-- See what files would be deleted (without actually deleting them)
VACUUM my_database.events_table DRY RUN

This shows you the list of files that would be removed and their total size. It's a good safety check.

Don't set retention below 7 days without understanding the risks

If you try to set retention below 7 days, Databricks will throw an error by default:

-- This will fail with a safety check error
VACUUM my_database.events_table RETAIN 12 HOURS

You can override this safety check, but only do so if you're sure no concurrent readers or writers are accessing old versions:

-- Override the safety check (use with caution)
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM my_database.events_table RETAIN 12 HOURS;
SET spark.databricks.delta.retentionDurationCheck.enabled = true;

I've seen teams accidentally break running pipelines by vacuuming too aggressively. If a long-running query started before the VACUUM, it might try to read files that have been deleted. Stick with 7 days unless you have a specific reason to go shorter.


📋 When OPTIMIZE and VACUUM aren't enough

Table still slow or broken? The Delta Table Troubleshooting Checklist walks through every common failure — transaction log issues, file corruption, MERGE conflicts, and performance problems. $9

The right order: OPTIMIZE first, then VACUUM

Always run OPTIMIZE before VACUUM. Here's why:

  1. OPTIMIZE creates new, compacted files and marks the old small files as no longer needed
  2. VACUUM then cleans up those old files (after the retention period)

If you VACUUM first, you're cleaning up old files but your table still has thousands of small files that haven't been compacted. If you only OPTIMIZE without ever running VACUUM, your storage costs will keep growing because the old files are never removed.

If you want to understand the storage layer more deeply — how Parquet files work, how Delta Lake manages transactions — DataCamp has solid courses on Spark internals and Delta Lake fundamentals.

A complete maintenance script

Here's the maintenance script I run on production tables:

-- Step 1: Compact small files and co-locate data
OPTIMIZE my_database.events_table
WHERE event_date >= current_date() - INTERVAL 7 DAYS
ZORDER BY (user_id);

-- Step 2: Clean up old files beyond the retention period
VACUUM my_database.events_table RETAIN 168 HOURS; -- 7 days

Scheduling maintenance

In Databricks, you can schedule this as a job that runs daily during off-peak hours. Create a notebook with the maintenance commands and attach it to a scheduled workflow:

# maintenance_notebook.py
tables_to_maintain = [
    ("my_database.events_table", "event_date", "user_id"),
    ("my_database.orders_table", "order_date", "customer_id"),
    ("my_database.sessions_table", "session_date", "session_id"),
]

for table, partition_col, zorder_col in tables_to_maintain:
    print(f"Optimizing {table}...")
    spark.sql(f"""
        OPTIMIZE {table}
        WHERE {partition_col} >= current_date() - INTERVAL 7 DAYS
        ZORDER BY ({zorder_col})
    """)
    
    print(f"Vacuuming {table}...")
    spark.sql(f"VACUUM {table} RETAIN 168 HOURS")
    
    print(f"Done with {table}")

This loops through all your tables and runs maintenance on each one. Simple, effective, and easy to extend as you add new tables.


Quick reference

Command What it does When to run Impact on queries
OPTIMIZE Compacts small files into larger ones After bulk writes, or on a daily schedule 2-10x faster reads
Z-ORDER Co-locates data by specified columns Always run with OPTIMIZE Massive speedup for filtered queries
VACUUM Deletes old unreferenced files After OPTIMIZE, on a daily/weekly schedule No query impact, reduces storage cost

Key takeaways

OPTIMIZE, Z-ORDER, and VACUUM are not optional for production Delta tables — they're essential maintenance. Without them, your tables accumulate small files, queries slow down, and storage costs grow unchecked.

The recipe is simple: OPTIMIZE + Z-ORDER your most-queried columns, then VACUUM old files. Schedule it daily, and your Delta tables will stay fast and lean.

If you're new to Delta Lake and still getting duplicates or merge issues sorted out, check out my previous guide on how to fix duplicate records in Delta tables. And if you want to deploy your own data pipeline on a budget, here's my step-by-step guide on running a scheduled Python ETL pipeline on a VPS — you can get a production server running for $6/month on DigitalOcean.


Want all 20 Delta table troubleshooting patterns in one place? Grab the Delta Table Troubleshooting Checklist — copy-paste diagnostic queries for every issue, just $9.

Found this useful? Subscribe to PipelinePulse to get notified when new data engineering guides go live. I publish practical, no-fluff tutorials based on real production experience.