How to Debug Failing Databricks Pipelines

A systematic approach to debugging Databricks pipeline failures. Covers OOM errors, missing transaction logs, schema mismatches, MERGE duplicates, JDBC NaN issues, and streaming checkpoint corruption — with error decision trees, log reading guides, cluster config checklists, and performance fixes.

How to Debug Failing Databricks Pipelines
A systematic approach to debugging Databricks failures — the 3-bucket rule, error decision trees, and diagnostic commands from production experience.

Your Databricks job failed at 3am. You open the run page to a wall of red text, a stack trace that mentions three different Java exceptions, and no obvious clue about what actually went wrong.

I've been there more times than I'd like to admit. After years of debugging production pipelines at SOCAR, I've developed a systematic approach that gets to the root cause fast — without randomly changing cluster configs and hoping for the best.

This guide covers the most common Databricks failures, how to read the logs that actually matter, and the fixes that work.


📄 Keep this debugging workflow offline

Get the Databricks Debugging Kit — error decision trees, log reading guide, cluster config checklist, and diagnostic commands in a printable PDF. $4.99

The 3-Bucket Rule

Almost every Databricks pipeline failure falls into one of three buckets:

  1. Memory — the job ran out of RAM (OOM errors, killed executors)
  2. Data — the input data is wrong (schema changes, duplicates, nulls, NaN values)
  3. Configuration — the cluster or job settings are wrong (wrong instance type, missing libraries, permissions)

Your first job when debugging is to figure out which bucket you're in. The fix is completely different for each one.


Step 1: Read the Right Logs

Most people look at the wrong logs. Here's where to actually find useful information:

Driver logs (start here)

Clusters → Your Cluster → Driver Logs tab

The driver log contains the actual exception that killed your job. Search for these keywords in order:

Exception
Error
OOM
killed
FAILED

The first match is usually your root cause. Everything after it is just cascading failure noise.

Spark UI (for performance issues)

Clusters → Your Cluster → Spark UI → Stages tab

Click the failed stage. Look at:

  • Task duration distribution — if one task took 100x longer than the others, you have data skew
  • Shuffle read/write — massive shuffle means you might need to repartition
  • Spill (Memory) / Spill (Disk) — any spill means your executors need more memory or your data needs repartitioning

Event log (for cluster-level issues)

Clusters → Your Cluster → Event Log

Filter for executor lost or container killed. These indicate the cluster itself is struggling — usually an OOM at the executor level that Spark can't recover from.

Job run output

Workflows → Your Job → Run history → Click the failed run

This shows you the notebook output at the point of failure. Often the most readable error message is right here.


Step 2: Match the Error to the Fix

Here's the decision tree I use. Find your error message and follow the path:

SparkException: Job aborted

This is a generic wrapper — the real cause is in the nested exception. Look for the "Caused by:" line.

Caused by: OutOfMemoryError → Memory bucket. See OOM fixes below.

Caused by: FileNotFoundException → A data file was deleted while the job was reading it. Usually caused by a concurrent VACUUM or OPTIMIZE. Fix: increase VACUUM retention or schedule maintenance jobs to not overlap with ETL.

Caused by: AnalysisException → Data bucket. A column name doesn't exist, a type cast failed, or the table path is wrong. The error message usually tells you exactly which column or table is the problem.

DELTA_MISSING_TRANSACTION_LOG

The Delta transaction log is corrupted or the table path is wrong.

# Step 1: Verify the table exists
spark.sql("DESCRIBE DETAIL catalog.schema.my_table")

# Step 2: Check if _delta_log directory exists
dbutils.fs.ls('/path/to/table/_delta_log/')

# Step 3: If log is corrupted, check VACUUM history
spark.sql("DESCRIBE HISTORY catalog.schema.my_table LIMIT 10")

Most common cause: someone ran VACUUM with too aggressive a retention period and deleted files that active queries still needed. See my guide on VACUUM retention for safe settings.

AnalysisException: cannot resolve column

A column name in your code doesn't match the table's actual schema.

-- Check what columns actually exist
DESCRIBE TABLE catalog.schema.my_table

-- Compare with your code's expected columns
-- Look for: typos, renamed columns, dropped columns

If columns were recently renamed or dropped, someone may have enabled column mapping and made schema changes without updating downstream pipelines. This is why data contracts matter.

java.lang.OutOfMemoryError {#oom-fixes}

The single most common Databricks failure. Three sub-types:

Driver OOM — the driver ran out of memory, usually because of .collect(), .toPandas(), or a large broadcast join.

# BAD: pulling all data to driver
all_data = df.collect()  # OOM if df has millions of rows
pdf = df.toPandas()      # Same problem

# GOOD: limit first, or process on executors
sample = df.limit(10000).toPandas()
result = df.agg(F.count('*')).collect()  # small result is fine

Executor OOM — one or more executors ran out of memory during processing. Usually caused by:

  1. A single partition that's too large (data skew)
  2. A large shuffle operation
  3. Not enough memory for the workload
# Check for data skew
df.groupBy('partition_key').count().orderBy(F.desc('count')).show(10)
# If one partition has 10x more rows than others, that's your problem

# Fix: add a salt key to distribute the skewed key
from pyspark.sql import functions as F
df = df.withColumn('salt', F.concat(F.col('skewed_key'), F.lit('_'), (F.rand() * 10).cast('int')))

Metaspace OOM — rare, caused by too many classes loaded. Usually means too many UDFs or a library conflict. Restart the cluster and reduce UDF usage.

MERGE condition produced duplicate rows

Your MERGE source data has duplicates on the merge key. Delta Lake requires the source to match at most one row per target row.

# Find duplicates in your source
source_df.groupBy('merge_key').count().filter('count > 1').show()

# Fix: dedup before MERGE
from pyspark.sql import Window

w = Window.partitionBy('merge_key').orderBy(F.desc('updated_at'))
deduped = source_df.withColumn('rn', F.row_number().over(w)) \
                   .filter(F.col('rn') == 1) \
                   .drop('rn')

I covered this exact issue in detail in my duplicate records in Delta tables guide. It's the #1 reason MERGE operations fail.

PSQLException / JDBC errors with NaN

If you're reading from PostgreSQL via JDBC and hit Bad value for type BigDecimal: NaN, the source table has NaN values in numeric columns.

# Fix: push the NULLIF conversion to the JDBC query level
jdbc_df = spark.read.format('jdbc') \
    .option('url', jdbc_url) \
    .option('dbtable', """(
        SELECT id, 
            NULLIF(amount::text, 'NaN')::numeric AS amount,
            NULLIF(balance::text, 'NaN')::numeric AS balance
        FROM transactions
    ) AS t""") \
    .load()

This pushes the NaN-to-null conversion to PostgreSQL before Spark ever sees the data. I've used this exact fix in production — it's the cleanest solution.

Stream stopped with CheckpointError

Streaming checkpoint is corrupted or was moved/deleted.

# Option A: Delete checkpoint and restart from scratch
dbutils.fs.rm('/checkpoints/my_stream/', recurse=True)
# Then restart the stream — it will reprocess from the beginning

# Option B: Start from a specific version using time travel
stream = spark.readStream.format('delta') \
    .option('startingVersion', 100) \
    .table('source_table')

See my time travel guide for more on replaying from specific versions.


Step 3: Cluster Configuration Checklist

Wrong cluster config is a silent killer — your job runs but slowly, expensively, or crashes under load. Here's what to check:

Setting Dev / Test Production Batch Production Streaming
Worker type Standard_DS3_v2 Standard_DS4_v2 Standard_DS4_v2
Min workers 1 2 2
Max workers 4 8-16 8
Autoscale Yes Yes No (fixed for streaming)
Spot instances Yes Yes (with fallback) No
Photon Optional Yes Yes
Auto-terminate 10 min 30 min Never

Key rules:

  • Use job clusters for scheduled jobs, not all-purpose clusters. Job clusters spin down after the task completes, saving 30-50% on costs.
  • Don't use spot instances for streaming. Spot eviction kills your stream mid-batch.
  • Enable Photon for production. The vectorized engine is 2-3x faster for most workloads, which means fewer DBUs.

For a complete cost optimization strategy, see my Databricks Cost Optimization Checklist.

For pipelines that don't need Databricks-scale compute, a simple DigitalOcean VPS at $6/month can handle scheduled Python ETL just fine — see my VPS ETL guide. For leveling up your Databricks debugging skills, Pluralsight has advanced Spark performance tuning courses.

Step 4: Quick Diagnostic Commands

Bookmark these — you'll use them constantly:

-- Table health check
DESCRIBE DETAIL catalog.schema.my_table
-- Look at: numFiles (too many = small file problem), sizeInBytes, lastModified

-- Small file problem?
-- If numFiles > 10x what you'd expect, run OPTIMIZE
OPTIMIZE catalog.schema.my_table

-- Check data freshness
SELECT MAX(updated_at) AS latest, COUNT(*) AS total_rows
FROM catalog.schema.my_table

-- Check for partition skew
SELECT partition_col, COUNT(*) AS row_count
FROM catalog.schema.my_table
GROUP BY partition_col
ORDER BY row_count DESC
LIMIT 10

-- Table version history (what changed recently?)
DESCRIBE HISTORY catalog.schema.my_table LIMIT 20

-- Spark config dump (what's actually set?)
-- In a notebook cell:
spark.sparkContext.getConf().getAll()
# Check current cluster memory
sc = spark.sparkContext
print(f"Driver memory: {sc._conf.get('spark.driver.memory', 'default')}")
print(f"Executor memory: {sc._conf.get('spark.executor.memory', 'default')}")
print(f"Shuffle partitions: {spark.conf.get('spark.sql.shuffle.partitions')}")

Common Performance Fixes

Problem Symptom Fix
Small file problem Slow reads, numFiles very high OPTIMIZE table; enable autoCompact
Data skew One task takes 10x longer than others Add salt key or repartition before join
Shuffle spill "Disk spill" in Spark UI task metrics Increase spark.sql.shuffle.partitions
Driver OOM on collect() Driver crashes on .collect() or .toPandas() Use .limit() or .take(n) instead
Broadcast join too large BroadcastExchangeExec timeout Increase broadcast threshold or disable
Slow MERGE MERGE takes hours on large tables Add Z-ORDER on merge key columns

The Debugging Workflow (Summary)

  1. Check the job run output — often the most readable error
  2. Search driver logs for the first Exception/Error
  3. Classify into bucket — memory, data, or config
  4. Match the error to the fix using the decision tree above
  5. Check Spark UI if it's a performance issue (skew, spill, shuffle)
  6. Run diagnostic commands to inspect table health
  7. Fix and re-run — preferably on a smaller dataset first

Get the Debugging Kit

Want this entire debugging workflow as a printable PDF — error decision trees, log reading guide, cluster config checklist, and diagnostic commands? Grab the Databricks Debugging Kit on Gumroad for $4.99.

Also check out:


Building data pipelines that don't break at 3am? That's what PipelinePulse is about. More guides at pipelinepulse.dev.