Window Functions Explained
Introduction
Window functions, or Analytic functions as they're called in BigQuery, are a way to compute values over a group of rows. They return a single value for each row, in contrast to aggregate functions which returns a single value for a group of rows.
Aggregate vs window/analytic functions:
letter | aggregate |
---|---|
A | 5 |
C | 6 |
letter | window |
---|---|
A | 5 |
A | 5 |
C | 6 |
In the first aggregate example the resulting data was grouped by letter, but in the second window example, we preserved our rows.
Why use window functions?
Window functions are very powerful, and once you've gotten your head around how to use them, you'll be surprised at they allow you to do. Some common use cases are:
- Running/Cumulative Total
- Moving Average
- Rank rows by custom criteria and groupings
- Finding the year-over-year % Change
where the PARTITION BY
denotes how to GROUP
rows into partitions, ORDER BY
how to order the rows in those partitions, and ROWS
which rows to consider in those ordered partitions.
Window Functions
In general, window functions can be grouped into 3 types:
- Navigation functions: Return the value given a specific location criteria (e.g. first_value)
- Numbering functions: Assign a number (e.g. rank) to each row based on their position in the specified window
- Analytic functions: Perform a calculation on a set of values (e.g. sum)
Navigation functions
- FIRST_VALUE :
FIRST_VALUE (value_expression [{RESPECT | IGNORE} NULLS])
-> Returns the value_expression for the first row in the current window frame. - LAST_VALUE :
LAST_VALUE (value_expression [{RESPECT | IGNORE} NULLS])
-> Returns the value_expression for the last row in the current window frame. - NTH_VALUE :
NTH_VALUE (value_expression, constant_integer_expression [{RESPECT | IGNORE} NULLS])
-> Returns the value_expression for the Nth row of the current window frame. - LEAD :
LEAD (value_expression[, offset [, default_expression]])
-> Returns the value_expression on the subsequent row. - LAG :
LAG (value_expression[, offset [, default_expression]])
-> Returns the value_expression on the preceding row. - PERCENTILE_CONT :
PERCENTILE_CONT (value_expression, percentile [{RESPECT | IGNORE} NULLS])
-> Returns the percentile of the value_expression with linear interpolation (continuous scale). - PERCENTILE_DISC :
PERCENTILE_DIST (value_expression, percentile [{RESPECT | IGNORE} NULLS])
-> Returns the percentile of the value_expression (discrete scale).
For these functions we'll use the following demo data:
player | season | first_season | last_season | per_change |
---|---|---|---|---|
Damian Lillard | 2019 | 2067 | 1978 | -4.31 |
Damian Lillard | 2020 | 2067 | 1978 | -4.31 |
Devin Booker | 2019 | 1700 | 1863 | 9.59 |
Devin Booker | 2020 | 1700 | 1863 | 9.59 |
James Harden | 2019 | 2818 | 2335 | -17.14 |
James Harden | 2020 | 2818 | 2335 | -17.14 |
Kemba Walker | 2019 | 2102 | 1145 | -45.53 |
Kemba Walker | 2020 | 2102 | 1145 | -45.53 |
Paul George | 2019 | 1978 | 1033 | -47.78 |
Paul George | 2020 | 1978 | 1033 | -47.78 |
How to find a year-over-year change
season | points_rank | player | points |
---|---|---|---|
2019 | 1 | James Harden | 2818 |
2019 | 2 | Kemba Walker | 2102 |
2019 | 3 | Damian Lillard | 2067 |
2020 | 1 | James Harden | 2335 |
2020 | 2 | Damian Lillard | 1978 |
2020 | 3 | Devin Booker | 1863 |
We used FIRST_VALUE
and LAST_VALUE
to find the scores for each player in the earliest and most recent seasons of data. Then we computed the percent difference using:
Numbering functions
- RANK :
RANK()
-> Returns the rank of each row in the ordered partition (starts at 1). - DENSE_RANK :
DENSE_RANK()
-> Returns the rank, but values of the same value get the same rank (starts at 1). - PERCENT_RANK :
PERCENT_RANK()
-> Returns the percentile rank of a row. - CUME_DIST :
CUME_DIST()
-> Returns the relative rank of a row . - NTILE:
NTILE(constant_integer_expression)
-> Returns the bucket number after dividing each partition into constant_integer_expression buckets. - ROW_NUMBER :
ROW_NUMBER()
-> Returns the sequential row number for each ordered partition.
⚠️ Numbering functions are not allowed [ROWS frame_clause]
How to get top 3 results for each group?
season | player | points | running_total_points |
---|---|---|---|
2019 | Damian Lillard | 2067 | 2067 |
2020 | Damian Lillard | 1978 | 4045 |
2019 | Devin Booker | 1700 | 1700 |
2020 | Devin Booker | 1863 | 3563 |
2019 | James Harden | 2818 | 2818 |
2020 | James Harden | 2335 | 5153 |
2019 | Kemba Walker | 2102 | 2102 |
2020 | Kemba Walker | 1145 | 3247 |
2019 | Paul George | 1978 | 1978 |
2020 | Paul George | 1033 | 3011 |
In this example, we used RANK
to rank each player by points over each season. Then we used a subquery to then return only the top 3 ranked players for each season.
Aggregate functions
Aggregate functions are available outside of window functions, but can be additionally applied over a specified window.
- ANY_VALUE :
ANY_VALUE(expression) [OVER (...)]
-> Returns expression for some row chosen from the group. Basically a random selection from an expression. - ARRAY_AGG : Returns an array of expression values.
- AVG :
AVG([DISTINCT] expression) [OVER (...)]
-> Returns the average of the non-NULL input values. - CORR :
CORR(X1, X2) [OVER (...)]
-> Returns the Pearson coefficient of correlation for a set of number pairs. - COUNT :
COUNT([DISTINCT] expression] [OVER (...)]
-> Returns the number of [distinct] elements in expression - COUNTIF :
COUNTIF(expression) [OVER (...)]
-> Returns the count of True values for expression - COVAR_POP :
COVAR_POP(X1, X2) [OVER (...)]
-> Returns the population covariance of a set of numbers - COVAR_SAMP :
COVAR_SAMP(X1, X2) [OVER (...)]
-> Returns the sample covariance of a set of numbers - MAX :
MAX(expression) [OVER (...)]
-> Returns the maximum non-NULL value of expression. - MIN:
MIN(expression) [OVER (...)]
-> Returns the minimum non-NULL value of expression. - ST_CUSTERDBSCAN :
ST_CLUSTERDBSCAN(geography_column, epsilon, minimum_geographies) OVER (...)
-> Performs DBSCAN clustering on a column of geographies - STDEV_POP :
STDDEV_POP([DISTINCT] expression) [OVER (...)]
-> Returns population standard deviation of values - STDEV_SAMP :
STDDEV_SAMP([DISTINCT] expression) [OVER (...)]
-> Returns sample standard deviation of values - STRING_AGG :
STRING_AGG([DISTINCT] expression [, delimiter] [ORDER BY key [{ASC|DESC}] [, ... ]] [LIMIT n]) [OVER (...)]
-> Returns a value obtained by concatenating all the non-null values - SUM :
SUM([DISTINCT] expression) [OVER (...)]
-> Returns the sum of all non-null values - VAR_POP :
VAR_POP([DISTINCT] expression) [OVER (...)]
-> Returns the population variance of results - VAR_SAMP :
VAR_SAMP([DISTINCT] expression) [OVER (...)]
-> Returns the sample variance of results
How to find a running total?
date | number | moving_avg |
---|---|---|
2020-01-01 | 1 | 1 |
2020-01-02 | 2 | 1.5 |
2020-01-03 | 3 | 2 |
2020-01-04 | 4 | 2.5 |
2020-01-05 | 5 | 3 |
2020-01-06 | 6 | 3.5 |
2020-01-07 | 7 | 4 |
2020-01-08 | 8 | 5 |
2020-01-09 | 9 | 6 |
2020-01-10 | 10 | 7 |
To find the running total simply use SUM
with an OVER
clause where you specify your groupings (PARTITION BY
), and the order in which to add them (ORDER BY
).
How to find a moving average?
date | number | moving_avg |
---|---|---|
2020-01-01 | 1 | 1 |
2020-01-02 | 2 | 1.5 |
2020-01-03 | 3 | 2 |
2020-01-04 | 4 | 2.5 |
2020-01-05 | 5 | 3 |
2020-01-06 | 6 | 3.5 |
2020-01-07 | 7 | 4 |
2020-01-08 | 8 | 5 |
2020-01-09 | 9 | 6 |
2020-01-10 | 10 | 7 |
To set the 7 day moving average window we can use the ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
in our OVER
clause.
How to find the most popular item a group?
item | category | qty | most_popular |
---|---|---|---|
apple | fruit | 1 | pear |
pear | fruit | 3 | pear |
green beans | vegetable | 4 | brussel sprouts |
brussel sprouts | vegetable | 5 | brussel sprouts |