How to Handle Null Values in Spark DataFrames [Complete Tutorial]

Every null-handling pattern for Spark DataFrames — detection, filling, dropping, null-safe joins, and production-ready frameworks with PySpark and SQL examples.

How to Handle Null Values in Spark DataFrames [Complete Tutorial]
Every null-handling pattern you need in Spark — from detection and filling to null-safe joins and NaN handling. PySpark and SQL examples included.

Null values are the silent killers of data pipelines. They don't throw errors when you write them. They don't show up in row counts. They just quietly break your joins, corrupt your aggregations, and produce wrong numbers that nobody notices until a stakeholder asks why the revenue report is off by 30%.

I deal with nulls in Spark DataFrames every week — from source systems that send blanks instead of nulls, to API responses with missing fields, to JDBC reads where database NULLs become Spark nulls. This tutorial covers every null-handling pattern you'll need, with both PySpark and SQL examples.


How nulls behave in Spark (and why they're dangerous)

Before you fix nulls, you need to understand how Spark treats them. The behavior isn't always intuitive.

Nulls in comparisons

Spark follows SQL's three-valued logic. Any comparison involving null returns null — not true, not false:

from pyspark.sql.functions import col, lit

df = spark.createDataFrame([(1, None), (2, "hello"), (3, None)], ["id", "name"])

# This does NOT return rows where name is null
df.filter(col("name") == None).show()  # Returns 0 rows!

# This is the correct way
df.filter(col("name").isNull()).show()  # Returns 2 rows

Nulls in aggregations

Most aggregate functions ignore nulls silently. This can produce misleading results:

-- If 3 out of 10 amounts are null, this averages only the 7 non-null values
SELECT AVG(amount) FROM orders
-- Returns the average of 7 values, not 10

-- To treat nulls as zero:
SELECT AVG(COALESCE(amount, 0)) FROM orders
-- Returns the average of all 10 values, with nulls counted as 0

Nulls in joins

This is where nulls cause the most damage. Null never equals null in a join:

-- Rows where customer_id is null will NEVER match, even if both sides have nulls
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
-- Any order with customer_id = null is silently dropped

If you're losing rows after a join, check for nulls in your join keys first.


📄 All null patterns on one page

Get the PySpark Null Handling Cheat Sheet — every detection, filling, filtering, and null-safe join pattern in a printable PDF. $4.99

Detecting nulls

Before you handle nulls, find them. Here's how to get a complete picture.

Count nulls per column

from pyspark.sql.functions import col, sum as spark_sum, when, count

df = spark.table("orders")

# Count nulls in every column at once
null_counts = df.select([
    spark_sum(when(col(c).isNull(), 1).otherwise(0)).alias(c)
    for c in df.columns
])
null_counts.show(truncate=False)

SQL version

SELECT
  COUNT(*) AS total_rows,
  SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) AS null_order_ids,
  SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_ids,
  SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS null_amounts,
  SUM(CASE WHEN order_date IS NULL THEN 1 ELSE 0 END) AS null_order_dates
FROM orders

Null percentage per column

Sometimes raw counts aren't enough — you need to know what percentage of your data is null:

total = df.count()
null_pcts = df.select([
    (spark_sum(when(col(c).isNull(), 1).otherwise(0)) / total * 100)
    .alias(c)
    for c in df.columns
])
null_pcts.show(truncate=False)

Filling nulls

The most common null-handling strategy: replace nulls with a default value.

fillna / COALESCE — replace with a constant

# Fill all null strings with "unknown"
df_filled = df.fillna("unknown", subset=["name", "city"])

# Fill all null numbers with 0
df_filled = df.fillna(0, subset=["amount", "quantity"])

# Fill different columns with different defaults
df_filled = df.fillna({
    "name": "unknown",
    "amount": 0,
    "status": "pending",
    "order_date": "1970-01-01"
})

SQL version

SELECT
  order_id,
  COALESCE(customer_name, 'unknown') AS customer_name,
  COALESCE(amount, 0) AS amount,
  COALESCE(status, 'pending') AS status
FROM orders

Fill with column-level logic

Sometimes the default depends on other columns:

from pyspark.sql.functions import when, col, coalesce

# If shipping_address is null, use billing_address
df = df.withColumn(
    "shipping_address",
    coalesce(col("shipping_address"), col("billing_address"))
)

# Conditional fill based on another column
df = df.withColumn(
    "discount",
    when(col("discount").isNull() & (col("customer_type") == "premium"), 0.10)
    .when(col("discount").isNull(), 0)
    .otherwise(col("discount"))
)

Fill with aggregated values

Replace nulls with the column's mean, median, or mode:

from pyspark.sql.functions import mean

# Calculate the mean
avg_amount = df.select(mean("amount")).collect()[0][0]

# Fill nulls with the mean
df_filled = df.fillna(avg_amount, subset=["amount"])

Dropping nulls

Sometimes the right answer is to remove rows with nulls entirely.

Drop rows where any column is null

# Drop rows where ANY column has a null
df_clean = df.dropna()

# Drop rows where specific columns have nulls
df_clean = df.dropna(subset=["order_id", "customer_id"])

Drop rows with a threshold

# Keep rows that have at least 3 non-null values
df_clean = df.dropna(thresh=3)

SQL version

-- Drop rows where critical columns are null
SELECT * FROM orders
WHERE order_id IS NOT NULL
  AND customer_id IS NOT NULL
  AND amount IS NOT NULL

Handling nulls in joins

Nulls in join keys silently drop rows. Here are patterns to prevent that.

Use a null-safe equality operator

Spark has a null-safe equality operator (<=>) that treats null = null as true:

from pyspark.sql.functions import col

# Standard join: null != null (rows dropped)
result = df1.join(df2, df1.key == df2.key)

# Null-safe join: null == null (rows kept)
result = df1.join(df2, df1.key.eqNullSafe(df2.key))

SQL version

-- Standard: drops rows where key is null on either side
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.customer_id

-- Null-safe: keeps rows where both sides are null
SELECT * FROM orders o JOIN customers c ON o.customer_id <=> c.customer_id

Coalesce before joining

Replace nulls with a sentinel value before the join:

df1 = df1.withColumn("key", coalesce(col("key"), lit("__MISSING__")))
df2 = df2.withColumn("key", coalesce(col("key"), lit("__MISSING__")))
result = df1.join(df2, "key")

Handling nulls from JDBC sources

If you're reading from PostgreSQL, MySQL, or other databases via JDBC, you'll encounter specific null issues.

NaN vs null from PostgreSQL

PostgreSQL can return NaN (Not a Number) for numeric columns, which Spark treats differently from null. I hit this exact issue reading a transactions table where amounts were stored as NaN instead of null:

# NaN values from PostgreSQL won't be caught by isNull()
from pyspark.sql.functions import isnan

# Check for both null AND NaN
df = df.filter(~(col("amount").isNull() | isnan(col("amount"))))

# Or replace NaN with null first, then handle uniformly
from pyspark.sql.functions import when, isnan

df = df.withColumn(
    "amount",
    when(isnan(col("amount")), None).otherwise(col("amount"))
)

Push null handling to the JDBC query

For better performance, handle nulls in the SQL pushdown rather than in Spark:

query = """
(SELECT
    id,
    COALESCE(name, 'unknown') AS name,
    NULLIF(amount::text, 'NaN')::numeric AS amount
FROM transactions) AS t
"""

df = spark.read.jdbc(url=jdbc_url, table=query, properties=props)

This pushes the null/NaN handling to PostgreSQL, reducing the data transferred to Spark.


Null-handling patterns for production pipelines

Pattern 1: Validate then fill

Check for nulls first, log any issues, then fill:

def handle_nulls(df, critical_cols, fill_defaults):
    """Validate critical columns, fill defaults, log issues."""
    
    # Check critical columns (these should never be null)
    for col_name in critical_cols:
        null_count = df.filter(col(col_name).isNull()).count()
        if null_count > 0:
            print(f"WARNING: {null_count} nulls in {col_name}")
    
    # Fill non-critical columns with defaults
    df = df.fillna(fill_defaults)
    
    return df

# Usage
df = handle_nulls(
    df,
    critical_cols=["order_id", "customer_id"],
    fill_defaults={"amount": 0, "status": "unknown", "city": "unknown"}
)

Pattern 2: Quarantine bad records

Instead of dropping or filling, send null records to a separate table for investigation:

# Split into clean and quarantine
clean_df = df.filter(
    col("order_id").isNotNull() &
    col("customer_id").isNotNull()
)

quarantine_df = df.filter(
    col("order_id").isNull() |
    col("customer_id").isNull()
)

# Write both
clean_df.write.mode("append").saveAsTable("orders_clean")
quarantine_df.write.mode("append").saveAsTable("orders_quarantine")

print(f"Clean: {clean_df.count()}, Quarantine: {quarantine_df.count()}")
If null handling is tripping you up, it usually means the PySpark fundamentals need strengthening. DataCamp has a hands-on PySpark track that covers DataFrames, filtering, joins, and the patterns you'll use daily — including null-safe operations.

Quick reference

Operation PySpark SQL
Check for null col("x").isNull() x IS NULL
Check for not null col("x").isNotNull() x IS NOT NULL
Fill with constant df.fillna(0, ["x"]) COALESCE(x, 0)
Fill with column coalesce(col("a"), col("b")) COALESCE(a, b)
Drop null rows df.dropna(subset=["x"]) WHERE x IS NOT NULL
Null-safe join df1.key.eqNullSafe(df2.key) a <=> b
Check for NaN isnan(col("x")) x != x

Key takeaways

Null values are a data engineering problem, not just a coding problem. The right approach depends on your business context — sometimes a null should be zero, sometimes it should be "unknown", and sometimes it means the row should be dropped entirely.

Three rules I follow in every pipeline:

  1. Detect before you fix. Always count nulls per column before deciding how to handle them. A column that's 95% null is a different problem than one that's 0.5% null.
  2. Be explicit about your strategy. Don't let Spark's default null behavior silently drop rows or skew aggregations. Write the COALESCE or the fillna — make it visible in your code.
  3. Add null checks to your quality framework. Nulls in critical columns should trigger alerts, not silent data corruption.

For building automated quality checks that catch nulls before they reach production, see my data quality checks guide. For fixing the duplicates that often accompany null issues in MERGE operations, check out how to fix duplicate records in Delta tables. And for a complete reference on all 20 common Delta table issues including nulls, grab the Delta Table Troubleshooting Checklist — just $9.


Subscribe to PipelinePulse for weekly data engineering tutorials. No theory fluff — just working code from real production pipelines.