pyspark

PySpark Window Functions: The Complete Guide with Examples

Master PySpark window functions with real code examples. Covers row_number, rank, lag, lead, running totals, moving averages, sessionization, and more — plus 6 common gotchas that trip up even senior engineers.

PipelinePulse

18 Mar 2026 • 7 min read

Master every PySpark window function — ranking, analytics, and advanced patterns like sessionization and streak detection. Complete guide with copy-paste code.

Window functions are one of those PySpark features that separate people who write SQL-with-extra-steps from people who actually solve problems efficiently. Once you understand them, you'll use them everywhere — deduplication, running totals, time-series analysis, sessionization, and more.

This guide covers every window function you'll need in production, with real code you can drop into your Databricks notebooks today.

What Are Window Functions?

A window function performs a calculation across a set of rows that are related to the current row — without collapsing them into a single output like groupBy does.

Think of it this way: groupBy gives you one row per group. A window function gives you one result per row, but each result can "see" the other rows in its group.

from pyspark.sql import Window
from pyspark.sql import functions as F

# groupBy — collapses rows
df.groupBy('department').agg(F.avg('salary'))
# Result: 1 row per department

# Window function — keeps all rows, adds a column
w = Window.partitionBy('department')
df.withColumn('dept_avg', F.avg('salary').over(w))
# Result: every row gets its department's average as a new column

This is powerful because you can compare each row to its group without losing detail.

Building a WindowSpec

Every window function needs a WindowSpec that tells Spark how to group and order rows. There are three components:

partitionBy — Define Groups

# Each department is its own window
w = Window.partitionBy('department')

# Multiple partition columns
w = Window.partitionBy('department', 'region')

# No partition = all rows are one group
w = Window.orderBy('created_at')

orderBy — Define Sort Order

# Required for ranking functions
w = Window.partitionBy('department').orderBy('salary')

# Descending order
w = Window.partitionBy('department').orderBy(F.desc('salary'))

# Multiple sort columns (tiebreaker)
w = Window.partitionBy('department').orderBy(F.desc('salary'), 'hire_date')

Frame Boundaries — Control Which Rows

This is where most people get confused, so let me be clear.

When you add orderBy to a window, Spark applies a default frame: from the first row of the partition to the current row. This means sum() becomes a running sum, not a total.

You can override this with rowsBetween or rangeBetween:

# Running total (default with orderBy)
w_running = Window.partitionBy('dept').orderBy('date') \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Full partition (override the default)
w_full = Window.partitionBy('dept').orderBy('date') \
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

# 3-row sliding window (current + 1 before + 1 after)
w_sliding = Window.partitionBy('dept').orderBy('date') \
    .rowsBetween(-1, 1)

# 7-row lookback (current + 6 preceding)
w_7day = Window.partitionBy('dept').orderBy('date') \
    .rowsBetween(-6, 0)

The built-in boundary constants:

Window.unboundedPreceding — first row in the partition
Window.unboundedFollowing — last row in the partition
Window.currentRow — the current row (same as 0)

Ranking Functions

These assign a position to each row within its partition. All require orderBy.

w = Window.partitionBy('department').orderBy(F.desc('salary'))

df.withColumn('row_num',    F.row_number().over(w)) \
  .withColumn('rank',       F.rank().over(w)) \
  .withColumn('dense_rank', F.dense_rank().over(w)) \
  .withColumn('pct_rank',   F.percent_rank().over(w)) \
  .withColumn('ntile_4',    F.ntile(4).over(w))

How They Handle Ties

This trips people up, so here's a concrete example. Given salaries: 90, 90, 80, 70:

Function	Output	Behavior
`row_number()`	1, 2, 3, 4	Always unique — arbitrary tiebreak
`rank()`	1, 1, 3, 4	Ties share rank, next rank skips
`dense_rank()`	1, 1, 2, 3	Ties share rank, no gaps
`percent_rank()`	0.0, 0.0, 0.67, 1.0	Relative position as 0–1 fraction
`ntile(4)`	1, 2, 3, 4	Splits into N roughly equal buckets

Dedup with row_number() — The #1 Pattern

This is by far the most common window function pattern you'll write. Keep the latest record per key:

w = Window.partitionBy('user_id').orderBy(F.desc('updated_at'))

deduped = df.withColumn('rn', F.row_number().over(w)) \
            .filter(F.col('rn') == 1) \
            .drop('rn')

I use this pattern constantly when dealing with duplicate records in Delta tables. If you're running into dedup issues at scale, my Delta Table Troubleshooting Checklist covers the MERGE-based causes and fixes.

Top-N Per Group

Top 3 earners per department:

w = Window.partitionBy('department').orderBy(F.desc('salary'))

top3 = df.withColumn('rank', F.dense_rank().over(w)) \
         .filter(F.col('rank') <= 3)

Use dense_rank() here instead of row_number() if you want ties to both appear in the result.

📄 Want this as a printable reference?

Get the PySpark Window Functions Cheat Sheet — all 16 functions, comparison tables, copy-paste patterns, and gotchas in a 6-page PDF. $4.99

Analytic (Offset) Functions

These let you access values from other rows relative to the current row — previous values, next values, first and last in the group.

w = Window.partitionBy('product').orderBy('sale_date')

df.withColumn('prev_sale',  F.lag('amount', 1).over(w)) \
  .withColumn('next_sale',  F.lead('amount', 1).over(w)) \
  .withColumn('first_sale', F.first('amount').over(w)) \
  .withColumn('last_sale',  F.last('amount').over(w))

Function	What it does
`lag(col, n, default)`	Value from n rows before
`lead(col, n, default)`	Value from n rows after
`first(col)`	First value in the frame
`last(col)`	Last value in the frame

Day-Over-Day Change

w = Window.partitionBy('store_id').orderBy('date')

df.withColumn('prev_revenue', F.lag('revenue', 1).over(w)) \
  .withColumn('dod_change',
      F.when(F.col('prev_revenue').isNotNull(),
          F.round(
              (F.col('revenue') - F.col('prev_revenue'))
              / F.col('prev_revenue') * 100, 2
          )
      ).otherwise(None)
  )

lag() and lead() return null at partition boundaries — the first row has no previous row. Always handle this with .isNotNull() or pass a default: F.lag('col', 1, 0).

If you're working with null values more broadly, check out my guide on handling null values in PySpark. The patterns overlap a lot.

Warning: last() Is Non-Deterministic Without an Explicit Frame

This catches everyone at least once:

# WRONG — returns current row's value, not partition's last value
F.last('amount').over(Window.partitionBy('dept').orderBy('date'))

# CORRECT — explicit frame to get the true last value
w = Window.partitionBy('dept').orderBy('date') \
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
F.last('amount').over(w)

When orderBy is specified, the default frame is unboundedPreceding to currentRow. So last() just returns the current row's value — not the partition's last. Always set an explicit frame when using last() with orderBy.

Aggregate Window Functions

Standard aggregations applied over a window instead of collapsing rows.

w = Window.partitionBy('department')

df.withColumn('dept_avg',   F.avg('salary').over(w)) \
  .withColumn('dept_sum',   F.sum('salary').over(w)) \
  .withColumn('dept_count', F.count('*').over(w)) \
  .withColumn('dept_min',   F.min('salary').over(w)) \
  .withColumn('dept_max',   F.max('salary').over(w))

Running Total

w = Window.partitionBy('account_id').orderBy('txn_date') \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

df.withColumn('running_balance', F.sum('amount').over(w))

7-Day Moving Average

w = Window.partitionBy('sensor_id').orderBy('date') \
    .rowsBetween(-6, 0)  # current row + 6 preceding = 7 rows

df.withColumn('moving_avg_7d', F.avg('reading').over(w))

rowsBetween(-6, 0) gives a 7-row window. For calendar-based windows (actual 7 days, not 7 rows), use rangeBetween with a unix timestamp column.

Percent of Total

w = Window.partitionBy('region')

df.withColumn('pct_of_region',
    F.round(F.col('sales') / F.sum('sales').over(w) * 100, 2)
)

This is one of those patterns that looks simple but would be ugly to write with groupBy + join.

Advanced Patterns

Sessionization (Gap Detection)

A classic analytics pattern. Group events into sessions based on inactivity gaps:

w = Window.partitionBy('user_id').orderBy('event_time')

sessions = df \
    .withColumn('prev_time', F.lag('event_time', 1).over(w)) \
    .withColumn('gap_seconds',
        F.unix_timestamp('event_time') - F.unix_timestamp('prev_time')
    ) \
    .withColumn('new_session',
        F.when(F.col('gap_seconds') > 1800, 1).otherwise(0)
    ) \
    .withColumn('session_id',
        F.sum('new_session').over(w)
    )

Adjust the 1800-second (30 min) threshold to your use case. This pattern chains three window functions together: lag to get the previous timestamp, a comparison to flag gaps, and a running sum to assign session IDs.

Consecutive Streak Detection

Find consecutive dates (e.g., login streaks, consecutive game appearances):

w = Window.partitionBy('player_id').orderBy('game_date')

streaks = df \
    .withColumn('rn', F.row_number().over(w)) \
    .withColumn('grp',
        F.datediff('game_date', F.lit('1970-01-01').cast('date')) - F.col('rn')
    )
# Consecutive dates share the same grp value
# Group by (player_id, grp) and count to get streak lengths

Combining Multiple Windows

Real-world queries often need several windows at once:

w_dept = Window.partitionBy('department')
w_rank = Window.partitionBy('department').orderBy(F.desc('salary'))
w_run  = Window.partitionBy('department').orderBy('hire_date') \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

result = df \
    .withColumn('dept_avg',      F.avg('salary').over(w_dept)) \
    .withColumn('salary_rank',   F.dense_rank().over(w_rank)) \
    .withColumn('running_count', F.count('*').over(w_run)) \
    .withColumn('vs_avg',
        F.round(F.col('salary') - F.col('dept_avg'), 2)
    )

Define your windows up front and reuse them. It's more readable and Spark will optimize the execution plan.

Common Gotchas

After years of writing window functions in Databricks, these are the traps I've hit repeatedly:

1. Default frame with orderBy is not the whole partition. It's RANGE BETWEEN unboundedPreceding AND currentRow. This means sum() and avg() give running results, not totals. If you want the full partition, set an explicit frame with unboundedFollowing.

2. row_number() with ties is non-deterministic. If two rows have the same orderBy value, which gets row 1 vs row 2 is arbitrary. Add a tiebreaker column to make it deterministic.

3. last() without an explicit frame just returns the current row. See the warning in the Analytic Functions section above.

4. rangeBetween requires a numeric orderBy column. String columns throw an AnalysisException. Cast to a numeric type first.

5. Window functions can't be nested. F.sum(F.row_number().over(w)).over(w2) is invalid. Compute in separate .withColumn() steps.

6. Large partitions cause memory pressure. If one partition key has millions of rows, Spark shuffles all of them to one executor. Watch for skew — the same issue I discussed in my OPTIMIZE and Z-ORDER guide.

If window functions are still feeling shaky, it usually means the SQL and PySpark foundations need more reps. DataCamp has an interactive SQL track and a PySpark track that drill these patterns through hands-on exercises — way more effective than just reading about them.

Quick Reference Table

Function	What it does	Common use case
`row_number()`	Sequential int, no ties	Dedup, unique ordering
`rank()`	Ties share, gaps after	Competition-style ranking
`dense_rank()`	Ties share, no gaps	Top-N per group
`percent_rank()`	Rank as 0–1 fraction	Percentile position
`ntile(n)`	Split into n buckets	Quartiles, deciles
`cume_dist()`	Cumulative distribution	What % is below me?
`lag(col, n)`	Value n rows before	Previous value, DoD change
`lead(col, n)`	Value n rows after	Next value, lookahead
`first(col)`	First value in frame	Opening value, baseline
`last(col)`	Last value in frame	Closing value (use explicit frame!)
`sum(col)`	Sum over window	Running totals
`avg(col)`	Average over window	Moving averages
`count(col)`	Count over window	Running counts
`min/max(col)`	Min/max over window	Running extremes

Get the Cheat Sheet

If you want all of this in a printable 6-page PDF you can keep next to your notebook — with the comparison tables, copy-paste code patterns, and gotchas list — grab the PySpark Window Functions Cheat Sheet on Gumroad for $4.99.

Also check out the rest of the PipelinePulse quick references:

Delta Table Troubleshooting Checklist ($9)
Databricks SQL Cheat Sheet ($4.99)
PySpark Null Handling Cheat Sheet ($4.99)
Schema Evolution Quick Reference ($4.99)

Building data pipelines that don't break at 3am? That's what PipelinePulse is about. More guides at pipelinepulse.dev.