How to Handle Null Values in Spark DataFrames [Complete Tutorial]
Every null-handling pattern for Spark DataFrames — detection, filling, dropping, null-safe joins, and production-ready frameworks with PySpark and SQL examples.
Null values are the silent killers of data pipelines. They don't throw errors when you write them. They don't show up in row counts. They just quietly break your joins, corrupt your aggregations, and produce wrong numbers that nobody notices until a stakeholder asks why the revenue report is off by 30%.
I deal with nulls in Spark DataFrames every week — from source systems that send blanks instead of nulls, to API responses with missing fields, to JDBC reads where database NULLs become Spark nulls. This tutorial covers every null-handling pattern you'll need, with both PySpark and SQL examples.
How nulls behave in Spark (and why they're dangerous)
Before you fix nulls, you need to understand how Spark treats them. The behavior isn't always intuitive.
Nulls in comparisons
Spark follows SQL's three-valued logic. Any comparison involving null returns null — not true, not false:
from pyspark.sql.functions import col, lit
df = spark.createDataFrame([(1, None), (2, "hello"), (3, None)], ["id", "name"])
# This does NOT return rows where name is null
df.filter(col("name") == None).show() # Returns 0 rows!
# This is the correct way
df.filter(col("name").isNull()).show() # Returns 2 rows
Nulls in aggregations
Most aggregate functions ignore nulls silently. This can produce misleading results:
-- If 3 out of 10 amounts are null, this averages only the 7 non-null values
SELECT AVG(amount) FROM orders
-- Returns the average of 7 values, not 10
-- To treat nulls as zero:
SELECT AVG(COALESCE(amount, 0)) FROM orders
-- Returns the average of all 10 values, with nulls counted as 0
Nulls in joins
This is where nulls cause the most damage. Null never equals null in a join:
-- Rows where customer_id is null will NEVER match, even if both sides have nulls
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
-- Any order with customer_id = null is silently dropped
If you're losing rows after a join, check for nulls in your join keys first.
📄 All null patterns on one page
Get the PySpark Null Handling Cheat Sheet — every detection, filling, filtering, and null-safe join pattern in a printable PDF. $4.99
Detecting nulls
Before you handle nulls, find them. Here's how to get a complete picture.
Count nulls per column
from pyspark.sql.functions import col, sum as spark_sum, when, count
df = spark.table("orders")
# Count nulls in every column at once
null_counts = df.select([
spark_sum(when(col(c).isNull(), 1).otherwise(0)).alias(c)
for c in df.columns
])
null_counts.show(truncate=False)
SQL version
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) AS null_order_ids,
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_ids,
SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS null_amounts,
SUM(CASE WHEN order_date IS NULL THEN 1 ELSE 0 END) AS null_order_dates
FROM orders
Null percentage per column
Sometimes raw counts aren't enough — you need to know what percentage of your data is null:
total = df.count()
null_pcts = df.select([
(spark_sum(when(col(c).isNull(), 1).otherwise(0)) / total * 100)
.alias(c)
for c in df.columns
])
null_pcts.show(truncate=False)
Filling nulls
The most common null-handling strategy: replace nulls with a default value.
fillna / COALESCE — replace with a constant
# Fill all null strings with "unknown"
df_filled = df.fillna("unknown", subset=["name", "city"])
# Fill all null numbers with 0
df_filled = df.fillna(0, subset=["amount", "quantity"])
# Fill different columns with different defaults
df_filled = df.fillna({
"name": "unknown",
"amount": 0,
"status": "pending",
"order_date": "1970-01-01"
})
SQL version
SELECT
order_id,
COALESCE(customer_name, 'unknown') AS customer_name,
COALESCE(amount, 0) AS amount,
COALESCE(status, 'pending') AS status
FROM orders
Fill with column-level logic
Sometimes the default depends on other columns:
from pyspark.sql.functions import when, col, coalesce
# If shipping_address is null, use billing_address
df = df.withColumn(
"shipping_address",
coalesce(col("shipping_address"), col("billing_address"))
)
# Conditional fill based on another column
df = df.withColumn(
"discount",
when(col("discount").isNull() & (col("customer_type") == "premium"), 0.10)
.when(col("discount").isNull(), 0)
.otherwise(col("discount"))
)
Fill with aggregated values
Replace nulls with the column's mean, median, or mode:
from pyspark.sql.functions import mean
# Calculate the mean
avg_amount = df.select(mean("amount")).collect()[0][0]
# Fill nulls with the mean
df_filled = df.fillna(avg_amount, subset=["amount"])
Dropping nulls
Sometimes the right answer is to remove rows with nulls entirely.
Drop rows where any column is null
# Drop rows where ANY column has a null
df_clean = df.dropna()
# Drop rows where specific columns have nulls
df_clean = df.dropna(subset=["order_id", "customer_id"])
Drop rows with a threshold
# Keep rows that have at least 3 non-null values
df_clean = df.dropna(thresh=3)
SQL version
-- Drop rows where critical columns are null
SELECT * FROM orders
WHERE order_id IS NOT NULL
AND customer_id IS NOT NULL
AND amount IS NOT NULL
Handling nulls in joins
Nulls in join keys silently drop rows. Here are patterns to prevent that.
Use a null-safe equality operator
Spark has a null-safe equality operator (<=>) that treats null = null as true:
from pyspark.sql.functions import col
# Standard join: null != null (rows dropped)
result = df1.join(df2, df1.key == df2.key)
# Null-safe join: null == null (rows kept)
result = df1.join(df2, df1.key.eqNullSafe(df2.key))
SQL version
-- Standard: drops rows where key is null on either side
SELECT * FROM orders o JOIN customers c ON o.customer_id = c.customer_id
-- Null-safe: keeps rows where both sides are null
SELECT * FROM orders o JOIN customers c ON o.customer_id <=> c.customer_id
Coalesce before joining
Replace nulls with a sentinel value before the join:
df1 = df1.withColumn("key", coalesce(col("key"), lit("__MISSING__")))
df2 = df2.withColumn("key", coalesce(col("key"), lit("__MISSING__")))
result = df1.join(df2, "key")
Handling nulls from JDBC sources
If you're reading from PostgreSQL, MySQL, or other databases via JDBC, you'll encounter specific null issues.
NaN vs null from PostgreSQL
PostgreSQL can return NaN (Not a Number) for numeric columns, which Spark treats differently from null. I hit this exact issue reading a transactions table where amounts were stored as NaN instead of null:
# NaN values from PostgreSQL won't be caught by isNull()
from pyspark.sql.functions import isnan
# Check for both null AND NaN
df = df.filter(~(col("amount").isNull() | isnan(col("amount"))))
# Or replace NaN with null first, then handle uniformly
from pyspark.sql.functions import when, isnan
df = df.withColumn(
"amount",
when(isnan(col("amount")), None).otherwise(col("amount"))
)
Push null handling to the JDBC query
For better performance, handle nulls in the SQL pushdown rather than in Spark:
query = """
(SELECT
id,
COALESCE(name, 'unknown') AS name,
NULLIF(amount::text, 'NaN')::numeric AS amount
FROM transactions) AS t
"""
df = spark.read.jdbc(url=jdbc_url, table=query, properties=props)
This pushes the null/NaN handling to PostgreSQL, reducing the data transferred to Spark.
Null-handling patterns for production pipelines
Pattern 1: Validate then fill
Check for nulls first, log any issues, then fill:
def handle_nulls(df, critical_cols, fill_defaults):
"""Validate critical columns, fill defaults, log issues."""
# Check critical columns (these should never be null)
for col_name in critical_cols:
null_count = df.filter(col(col_name).isNull()).count()
if null_count > 0:
print(f"WARNING: {null_count} nulls in {col_name}")
# Fill non-critical columns with defaults
df = df.fillna(fill_defaults)
return df
# Usage
df = handle_nulls(
df,
critical_cols=["order_id", "customer_id"],
fill_defaults={"amount": 0, "status": "unknown", "city": "unknown"}
)
Pattern 2: Quarantine bad records
Instead of dropping or filling, send null records to a separate table for investigation:
# Split into clean and quarantine
clean_df = df.filter(
col("order_id").isNotNull() &
col("customer_id").isNotNull()
)
quarantine_df = df.filter(
col("order_id").isNull() |
col("customer_id").isNull()
)
# Write both
clean_df.write.mode("append").saveAsTable("orders_clean")
quarantine_df.write.mode("append").saveAsTable("orders_quarantine")
print(f"Clean: {clean_df.count()}, Quarantine: {quarantine_df.count()}")
If null handling is tripping you up, it usually means the PySpark fundamentals need strengthening. DataCamp has a hands-on PySpark track that covers DataFrames, filtering, joins, and the patterns you'll use daily — including null-safe operations.
Quick reference
| Operation | PySpark | SQL |
|---|---|---|
| Check for null | col("x").isNull() |
x IS NULL |
| Check for not null | col("x").isNotNull() |
x IS NOT NULL |
| Fill with constant | df.fillna(0, ["x"]) |
COALESCE(x, 0) |
| Fill with column | coalesce(col("a"), col("b")) |
COALESCE(a, b) |
| Drop null rows | df.dropna(subset=["x"]) |
WHERE x IS NOT NULL |
| Null-safe join | df1.key.eqNullSafe(df2.key) |
a <=> b |
| Check for NaN | isnan(col("x")) |
x != x |
Key takeaways
Null values are a data engineering problem, not just a coding problem. The right approach depends on your business context — sometimes a null should be zero, sometimes it should be "unknown", and sometimes it means the row should be dropped entirely.
Three rules I follow in every pipeline:
- Detect before you fix. Always count nulls per column before deciding how to handle them. A column that's 95% null is a different problem than one that's 0.5% null.
- Be explicit about your strategy. Don't let Spark's default null behavior silently drop rows or skew aggregations. Write the COALESCE or the fillna — make it visible in your code.
- Add null checks to your quality framework. Nulls in critical columns should trigger alerts, not silent data corruption.
For building automated quality checks that catch nulls before they reach production, see my data quality checks guide. For fixing the duplicates that often accompany null issues in MERGE operations, check out how to fix duplicate records in Delta tables. And for a complete reference on all 20 common Delta table issues including nulls, grab the Delta Table Troubleshooting Checklist — just $9.
Subscribe to PipelinePulse for weekly data engineering tutorials. No theory fluff — just working code from real production pipelines.