Everything you need to know about SQL Array functions. Arrays are ordered lists in BigQuery. They are very powerful and flexible once you know how to use them.
Window functions, or Analytic functions as they're called in BigQuery, are a way to compute values over a group of rows. They return a single value for each row, in contrast to aggregate functions which returns a single value for a group of rows.
| letter | aggregate |
|---|---|
| A | 5 |
| C | 6 |
| letter | window |
|---|---|
| A | 5 |
| A | 5 |
| C | 6 |
In the first aggregate example the resulting data was grouped by letter, but in the second window example, we preserved our rows.
SELECT
letter,
SUM(number) AS aggregate
FROM
(
SELECT
'A' AS letter,
2 AS number
UNION ALL
( SELECT
'A' AS letter,
3 AS number)
UNION ALL
( SELECT
'C' AS letter,
6 AS number)
) AS table_3
GROUP BY
letterWindow functions are very powerful, and once you've gotten your head around how to use them, you'll be surprised at they allow you to do. Some common use cases are:
SELECT
letter,
SUM(number) AS aggregate
FROM
(
SELECT
'A' AS letter,
2 AS number
UNION ALL
( SELECT
'A' AS letter,
3 AS number)
UNION ALL
( SELECT
'C' AS letter,
6 AS number)
) AS table_3
GROUP BY
letterwhere the PARTITION BY denotes how to GROUP rows into partitions, ORDER BY how to order the rows in those partitions, and ROWS which rows to consider in those ordered partitions.
In general, window functions can be grouped into 3 types:
window_function_name ([arguments]) OVER (
[PARTITION BY partition_expression]
[ORDER BY expression ASC | DESC]
[ROWS frame_clause])FIRST_VALUE (value_expression [{RESPECT | IGNORE} NULLS]) -> Returns the value_expression for the first row in the current window frame.LAST_VALUE (value_expression [{RESPECT | IGNORE} NULLS]) -> Returns the value_expression for the last row in the current window frame.NTH_VALUE (value_expression, constant_integer_expression [{RESPECT | IGNORE} NULLS]) -> Returns the value_expression for the Nth row of the current window frame.LEAD (value_expression[, offset [, default_expression]]) -> Returns the value_expression on the subsequent row.LAG (value_expression[, offset [, default_expression]]) -> Returns the value_expression on the preceding row.PERCENTILE_CONT (value_expression, percentile [{RESPECT | IGNORE} NULLS]) -> Returns the percentile of the value_expression with linear interpolation (continuous scale).PERCENTILE_DIST (value_expression, percentile [{RESPECT | IGNORE} NULLS]) -> Returns the percentile of the value_expression (discrete scale).For these functions we'll use the following demo data:
SELECT DISTINCT
player,
season,
FIRST_VALUE(points) OVER (PARTITION BY player ORDER BY season ASC ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS first_season,
LAST_VALUE(points) OVER (PARTITION BY player ORDER BY season ASC ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS last_season,
(100 * ((LAST_VALUE(points) OVER (PARTITION BY player ORDER BY season ASC ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) - FIRST_VALUE(points) OVER (PARTITION BY player ORDER BY season ASC ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)) / FIRST_VALUE(points) OVER (PARTITION BY player ORDER BY season ASC ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING))) AS per_change
FROM
top_scorers| player | season | first_season | last_season | per_change |
|---|---|---|---|---|
| Damian Lillard | 2019 | 2067 | 1978 | -4.31 |
| Damian Lillard | 2020 | 2067 | 1978 | -4.31 |
| Devin Booker | 2019 | 1700 | 1863 | 9.59 |
| Devin Booker | 2020 | 1700 | 1863 | 9.59 |
| James Harden | 2019 | 2818 | 2335 | -17.14 |
| James Harden | 2020 | 2818 | 2335 | -17.14 |
| Kemba Walker | 2019 | 2102 | 1145 | -45.53 |
| Kemba Walker | 2020 | 2102 | 1145 | -45.53 |
| Paul George | 2019 | 1978 | 1033 | -47.78 |
| Paul George | 2020 | 1978 | 1033 | -47.78 |
| season | points_rank | player | points |
|---|---|---|---|
| 2019 | 1 | James Harden | 2818 |
| 2019 | 2 | Kemba Walker | 2102 |
| 2019 | 3 | Damian Lillard | 2067 |
| 2020 | 1 | James Harden | 2335 |
| 2020 | 2 | Damian Lillard | 1978 |
| 2020 | 3 | Devin Booker | 1863 |
We used FIRST_VALUE and LAST_VALUE to find the scores for each player in the earliest and most recent seasons of data. Then we computed the percent difference using:
100 * ((new value - old value) / old value) per_differenceRANK() -> Returns the rank of each row in the ordered partition (starts at 1).DENSE_RANK() -> Returns the rank, but values of the same value get the same rank (starts at 1).PERCENT_RANK() -> Returns the percentile rank of a row.CUME_DIST() -> Returns the relative rank of a row .NTILE(constant_integer_expression) -> Returns the bucket number after dividing each partition into constant_integer_expression buckets.ROW_NUMBER() -> Returns the sequential row number for each ordered partition.⚠️ Numbering functions are not allowed [ROWS frame_clause]
| season | player | points | running_total_points |
|---|---|---|---|
| 2019 | Damian Lillard | 2067 | 2067 |
| 2020 | Damian Lillard | 1978 | 4045 |
| 2019 | Devin Booker | 1700 | 1700 |
| 2020 | Devin Booker | 1863 | 3563 |
| 2019 | James Harden | 2818 | 2818 |
| 2020 | James Harden | 2335 | 5153 |
| 2019 | Kemba Walker | 2102 | 2102 |
| 2020 | Kemba Walker | 1145 | 3247 |
| 2019 | Paul George | 1978 | 1978 |
| 2020 | Paul George | 1033 | 3011 |
In this example, we used RANK to rank each player by points over each season. Then we used a subquery to then return only the top 3 ranked players for each season.
SELECT
*
FROM
(
SELECT
season,
RANK() OVER (PARTITION BY season ORDER BY points DESC) AS points_rank,
player,
points
FROM
top_scorers
) AS table_1
WHERE
(points_rank <= 3)Aggregate functions are available outside of window functions, but can be additionally applied over a specified window.
ANY_VALUE(expression) [OVER (...)] -> Returns expression for some row chosen from the group. Basically a random selection from an expression.AVG([DISTINCT] expression) [OVER (...)] -> Returns the average of the non-NULL input values.CORR(X1, X2) [OVER (...)] -> Returns the Pearson coefficient of correlation for a set of number pairs.COUNT([DISTINCT] expression] [OVER (...)] -> Returns the number of [distinct] elements in expressionCOUNTIF(expression) [OVER (...)] -> Returns the count of True values for expressionCOVAR_POP(X1, X2) [OVER (...)] -> Returns the population covariance of a set of numbersCOVAR_SAMP(X1, X2) [OVER (...)] -> Returns the sample covariance of a set of numbersMAX(expression) [OVER (...)] -> Returns the maximum non-NULL value of expression.MIN(expression) [OVER (...)] -> Returns the minimum non-NULL value of expression.ST_CLUSTERDBSCAN(geography_column, epsilon, minimum_geographies) OVER (...) -> Performs DBSCAN clustering on a column of geographiesSTDDEV_POP([DISTINCT] expression) [OVER (...)] -> Returns population standard deviation of valuesSTDDEV_SAMP([DISTINCT] expression) [OVER (...)] -> Returns sample standard deviation of valuesSTRING_AGG([DISTINCT] expression [, delimiter] [ORDER BY key [{ASC|DESC}] [, ... ]] [LIMIT n]) [OVER (...)] -> Returns a value obtained by concatenating all the non-null valuesSUM([DISTINCT] expression) [OVER (...)] -> Returns the sum of all non-null valuesVAR_POP([DISTINCT] expression) [OVER (...)] -> Returns the population variance of resultsVAR_SAMP([DISTINCT] expression) [OVER (...)] -> Returns the sample variance of results| date | number | moving_avg |
|---|---|---|
| 2020-01-01 | 1 | 1 |
| 2020-01-02 | 2 | 1.5 |
| 2020-01-03 | 3 | 2 |
| 2020-01-04 | 4 | 2.5 |
| 2020-01-05 | 5 | 3 |
| 2020-01-06 | 6 | 3.5 |
| 2020-01-07 | 7 | 4 |
| 2020-01-08 | 8 | 5 |
| 2020-01-09 | 9 | 6 |
| 2020-01-10 | 10 | 7 |
To find the running total simply use SUM with an OVER clause where you specify your groupings (PARTITION BY), and the order in which to add them (ORDER BY).
| date | number | moving_avg |
|---|---|---|
| 2020-01-01 | 1 | 1 |
| 2020-01-02 | 2 | 1.5 |
| 2020-01-03 | 3 | 2 |
| 2020-01-04 | 4 | 2.5 |
| 2020-01-05 | 5 | 3 |
| 2020-01-06 | 6 | 3.5 |
| 2020-01-07 | 7 | 4 |
| 2020-01-08 | 8 | 5 |
| 2020-01-09 | 9 | 6 |
| 2020-01-10 | 10 | 7 |
To set the 7 day moving average window we can use the ROWS BETWEEN 6 PRECEDING AND CURRENT ROW in our OVER clause.
| item | category | qty | most_popular |
|---|---|---|---|
| apple | fruit | 1 | pear |
| pear | fruit | 3 | pear |
| green beans | vegetable | 4 | brussel sprouts |
| brussel sprouts | vegetable | 5 | brussel sprouts |
ARRAY_AGG([DISTINCT] expression [{IGNORE|RESPECT} NULLS]
[ORDER BY key [{ASC|DESC}] [, ... ]] [LIMIT n]
) [OVER (...)]SELECT
season,
player,
points,
SUM(top_scorers.points) OVER (PARTITION BY player ORDER BY season ASC) AS running_total_points
FROM
top_scorersSELECT
date,
number,
AVG(number) OVER (ORDER BY date ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg
FROM
(
SELECT
date,
number
FROM
UNNEST(GENERATE_DATE_ARRAY('2020-01-01', '2020-02-01')) AS date
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 30)) AS number
) AS table_1
LIMIT
10SELECT
item,
category,
qty,
LAST_VALUE(item) OVER (PARTITION BY category ORDER BY qty DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS most_popular
FROM
(
SELECT
'apple' AS item,
'fruit' AS category,
1 AS qty
UNION ALL
( SELECT
'pear' AS item,
'fruit' AS category,
3 AS qty)
UNION ALL
( SELECT
'green beans' AS item,
'vegetable' AS category,
4 AS qty)
UNION ALL
( SELECT
'brussel sprouts' AS item,
'vegetable' AS category,
5 AS qty)
) AS table_4