Window Functions Explained

Introduction

Window functions, or Analytic functions as they're called in BigQuery, are a way to compute values over a group of rows. They return a single value for each row, in contrast to aggregate functions which returns a single value for a group of rows.

Aggregate vs window/analytic functions:

SELECT
  letter,
  SUM(number) AS aggregate
FROM
  (
    SELECT
      'A' AS letter,
      2 AS number
    UNION ALL
(    SELECT
      'A' AS letter,
      3 AS number)
    UNION ALL
(    SELECT
      'C' AS letter,
      6 AS number)
  ) AS table_3
GROUP BY
  letter

letter	aggregate
A	5
C	6

SELECT
  letter,
  SUM(number) AS aggregate
FROM
  (
    SELECT
      'A' AS letter,
      2 AS number
    UNION ALL
(    SELECT
      'A' AS letter,
      3 AS number)
    UNION ALL
(    SELECT
      'C' AS letter,
      6 AS number)
  ) AS table_3
GROUP BY
  letter

letter	window
A	5
A	5
C	6

In the first aggregate example the resulting data was grouped by letter, but in the second window example, we preserved our rows.

Why use window functions?

Window functions are very powerful, and once you've gotten your head around how to use them, you'll be surprised at they allow you to do. Some common use cases are:

Running/Cumulative Total
Moving Average
Rank rows by custom criteria and groupings
Finding the year-over-year % Change
‍

Syntax

window_function_name ([arguments]) OVER (
   [PARTITION BY partition_expression] 
   [ORDER BY expression ASC | DESC] 
   [ROWS frame_clause])

‍
where the PARTITION BY denotes how to GROUP rows into partitions, ORDER BY how to order the rows in those partitions, and ROWS which rows to consider in those ordered partitions.

Window Functions

In general, window functions can be grouped into 3 types:

Navigation functions: Return the value given a specific location criteria (e.g. first_value)
Numbering functions: Assign a number (e.g. rank) to each row based on their position in the specified window
Analytic functions: Perform a calculation on a set of values (e.g. sum)

Navigation functions

FIRST_VALUE : FIRST_VALUE (value_expression [{RESPECT | IGNORE} NULLS]) -> Returns the value_expression for the first row in the current window frame.
LAST_VALUE : LAST_VALUE (value_expression [{RESPECT | IGNORE} NULLS]) -> Returns the value_expression for the last row in the current window frame.
NTH_VALUE : NTH_VALUE (value_expression, constant_integer_expression [{RESPECT | IGNORE} NULLS]) -> Returns the value_expression for the Nth row of the current window frame.
LEAD : LEAD (value_expression[, offset [, default_expression]]) -> Returns the value_expression on the subsequent row.
LAG : LAG (value_expression[, offset [, default_expression]]) -> Returns the value_expression on the preceding row.
PERCENTILE_CONT : PERCENTILE_CONT (value_expression, percentile [{RESPECT | IGNORE} NULLS]) -> Returns the percentile of the value_expression with linear interpolation (continuous scale).
PERCENTILE_DISC : PERCENTILE_DIST (value_expression, percentile [{RESPECT | IGNORE} NULLS]) -> Returns the percentile of the value_expression (discrete scale).

For these functions we'll use the following demo data:

SELECT
  'James Harden' AS player,
  2335 AS points,
  2020 AS season
UNION ALL
(SELECT
  'Damian Lillard' AS player,
  1978 AS points,
  2020 AS season)
UNION ALL
(SELECT
  'Devin Booker' AS player,
  1863 AS points,
  2020 AS season)
UNION ALL
(SELECT
  'James Harden' AS player,
  2818 AS points,
  2019 AS season)
UNION ALL
(SELECT
  'Paul George' AS player,
  1978 AS points,
  2019 AS season)
UNION ALL
(SELECT
  'Kemba Walker' AS player,
  2102 AS points,
  2019 AS season)
UNION ALL
(SELECT
  'Damian Lillard' AS player,
  2067 AS points,
  2019 AS season)
UNION ALL
(SELECT
  'Devin Booker' AS player,
  1700 AS points,
  2019 AS season)
UNION ALL
(SELECT
  'Paul George' AS player,
  1033 AS points,
  2020 AS season)
UNION ALL
(SELECT
  'Kemba Walker' AS player,
  1145 AS points,
  2020 AS season)

player	season	first_season	last_season	per_change
Damian Lillard	2019	2067	1978	-4.31
Damian Lillard	2020	2067	1978	-4.31
Devin Booker	2019	1700	1863	9.59
Devin Booker	2020	1700	1863	9.59
James Harden	2019	2818	2335	-17.14
James Harden	2020	2818	2335	-17.14
Kemba Walker	2019	2102	1145	-45.53
Kemba Walker	2020	2102	1145	-45.53
Paul George	2019	1978	1033	-47.78
Paul George	2020	1978	1033	-47.78

‍
How to find a year-over-year change

SELECT DISTINCT
  player,
  season,
  FIRST_VALUE(points) OVER (PARTITION BY player ORDER BY season ASC ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS first_season,
  LAST_VALUE(points) OVER (PARTITION BY player ORDER BY season ASC ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS last_season,
  (100 * ((LAST_VALUE(points) OVER (PARTITION BY player ORDER BY season ASC ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) - FIRST_VALUE(points) OVER (PARTITION BY player ORDER BY season ASC ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)) / FIRST_VALUE(points) OVER (PARTITION BY player ORDER BY season ASC ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING))) AS per_change
FROM
  top_scorers

season	points_rank	player	points
2019	1	James Harden	2818
2019	2	Kemba Walker	2102
2019	3	Damian Lillard	2067
2020	1	James Harden	2335
2020	2	Damian Lillard	1978
2020	3	Devin Booker	1863

We used FIRST_VALUE and LAST_VALUE to find the scores for each player in the earliest and most recent seasons of data. Then we computed the percent difference using:

100 * ((new value - old value) / old value) per_difference

Numbering functions

RANK : RANK() -> Returns the rank of each row in the ordered partition (starts at 1).
DENSE_RANK : DENSE_RANK() -> Returns the rank, but values of the same value get the same rank (starts at 1).
PERCENT_RANK : PERCENT_RANK() -> Returns the percentile rank of a row.
CUME_DIST : CUME_DIST() -> Returns the relative rank of a row .
NTILE: NTILE(constant_integer_expression) -> Returns the bucket number after dividing each partition into constant_integer_expression buckets.
ROW_NUMBER : ROW_NUMBER() -> Returns the sequential row number for each ordered partition.

‍
⚠️ Numbering functions are not allowed [ROWS frame_clause]

How to get top 3 results for each group?

SELECT
  *
FROM
  (
    SELECT
      season,
      RANK() OVER (PARTITION BY season ORDER BY points DESC) AS points_rank,
      player,
      points
    FROM
      top_scorers
  ) AS table_1
WHERE
  (points_rank <= 3)

season	player	points	running_total_points
2019	Damian Lillard	2067	2067
2020	Damian Lillard	1978	4045
2019	Devin Booker	1700	1700
2020	Devin Booker	1863	3563
2019	James Harden	2818	2818
2020	James Harden	2335	5153
2019	Kemba Walker	2102	2102
2020	Kemba Walker	1145	3247
2019	Paul George	1978	1978
2020	Paul George	1033	3011

In this example, we used RANK to rank each player by points over each season. Then we used a subquery to then return only the top 3 ranked players for each season.

Aggregate functions

Aggregate functions are available outside of window functions, but can be additionally applied over a specified window.

ANY_VALUE : ANY_VALUE(expression) [OVER (...)] -> Returns expression for some row chosen from the group. Basically a random selection from an expression.
ARRAY_AGG : Returns an array of expression values.

ARRAY_AGG([DISTINCT] expression [{IGNORE|RESPECT} NULLS]
  [ORDER BY key [{ASC|DESC}] [, ... ]]  [LIMIT n]
) [OVER (...)]

AVG : AVG([DISTINCT] expression) [OVER (...)] -> Returns the average of the non-NULL input values.
CORR : CORR(X1, X2) [OVER (...)] -> Returns the Pearson coefficient of correlation for a set of number pairs.
COUNT : COUNT([DISTINCT] expression] [OVER (...)] -> Returns the number of [distinct] elements in expression
COUNTIF : COUNTIF(expression) [OVER (...)] -> Returns the count of True values for expression
COVAR_POP : COVAR_POP(X1, X2) [OVER (...)] -> Returns the population covariance of a set of numbers
COVAR_SAMP : COVAR_SAMP(X1, X2) [OVER (...)] -> Returns the sample covariance of a set of numbers
MAX : MAX(expression) [OVER (...)] -> Returns the maximum non-NULL value of expression.
MIN: MIN(expression) [OVER (...)] -> Returns the minimum non-NULL value of expression.
ST_CUSTERDBSCAN : ST_CLUSTERDBSCAN(geography_column, epsilon, minimum_geographies) OVER (...) -> Performs DBSCAN clustering on a column of geographies
STDEV_POP : STDDEV_POP([DISTINCT] expression) [OVER (...)] -> Returns population standard deviation of values
STDEV_SAMP : STDDEV_SAMP([DISTINCT] expression) [OVER (...)] -> Returns sample standard deviation of values
STRING_AGG : STRING_AGG([DISTINCT] expression [, delimiter] [ORDER BY key [{ASC|DESC}] [, ... ]] [LIMIT n]) [OVER (...)] -> Returns a value obtained by concatenating all the non-null values
SUM : SUM([DISTINCT] expression) [OVER (...)] -> Returns the sum of all non-null values
VAR_POP : VAR_POP([DISTINCT] expression) [OVER (...)] -> Returns the population variance of results
VAR_SAMP : VAR_SAMP([DISTINCT] expression) [OVER (...)] -> Returns the sample variance of results

‍
How to find a running total?

SELECT
  season,
  player,
  points,
  SUM(top_scorers.points) OVER (PARTITION BY player ORDER BY season ASC) AS running_total_points
FROM
  top_scorers

date	number	moving_avg
2020-01-01	1	1
2020-01-02	2	1.5
2020-01-03	3	2
2020-01-04	4	2.5
2020-01-05	5	3
2020-01-06	6	3.5
2020-01-07	7	4
2020-01-08	8	5
2020-01-09	9	6
2020-01-10	10	7

To find the running total simply use SUM with an OVER clause where you specify your groupings (PARTITION BY), and the order in which to add them (ORDER BY).

How to find a moving average?

SELECT
  date,
  number,
  AVG(number) OVER (ORDER BY date ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg
FROM
  (
    SELECT
      date,
      number
    FROM
      UNNEST(GENERATE_DATE_ARRAY('2020-01-01', '2020-02-01')) AS date
      CROSS JOIN UNNEST(GENERATE_ARRAY(1, 30)) AS number
  ) AS table_1
LIMIT
  10

date	number	moving_avg
2020-01-01	1	1
2020-01-02	2	1.5
2020-01-03	3	2
2020-01-04	4	2.5
2020-01-05	5	3
2020-01-06	6	3.5
2020-01-07	7	4
2020-01-08	8	5
2020-01-09	9	6
2020-01-10	10	7

To set the 7 day moving average window we can use the ROWS BETWEEN 6 PRECEDING AND CURRENT ROW in our OVER clause.

How to find the most popular item a group?

SELECT
  item,
  category,
  qty,
  LAST_VALUE(item) OVER (PARTITION BY category ORDER BY qty DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS most_popular
FROM
  (
    SELECT
      'apple' AS item,
      'fruit' AS category,
      1 AS qty
    UNION ALL
(    SELECT
      'pear' AS item,
      'fruit' AS category,
      3 AS qty)
    UNION ALL
(    SELECT
      'green beans' AS item,
      'vegetable' AS category,
      4 AS qty)
    UNION ALL
(    SELECT
      'brussel sprouts' AS item,
      'vegetable' AS category,
      5 AS qty)
  ) AS table_4

item	category	qty	most_popular
apple	fruit	1	pear
pear	fruit	3	pear
green beans	vegetable	4	brussel sprouts
brussel sprouts	vegetable	5	brussel sprouts

‍

Business Improvement

Product

Use cases

Pricing

Professional Services

Customers

Resources

Introduction

Aggregate vs window/analytic functions:

Why use window functions?

Syntax

Window Functions

Navigation functions

‍
How to find a year-over-year change

Numbering functions

How to get top 3 results for each group?

Aggregate functions

‍
How to find a running total?

How to find a moving average?

How to find the most popular item a group?

Start solving your organization's biggest problems with Count today.

Business Improvement

Product

Use cases

Pricing

Professional Services

Customers

Resources

Window Functions Explained

Introduction

Aggregate vs window/analytic functions:

Why use window functions?

Syntax

Window Functions

Navigation functions

‍How to find a year-over-year change

Numbering functions

How to get top 3 results for each group?

Aggregate functions

‍How to find a running total?

How to find a moving average?

How to find the most popular item a group?

Start solving your organization's biggest problems with Count today.

‍
How to find a year-over-year change

‍
How to find a running total?