Window Functions
Window functions in Snowflake are a way to compute values over a group of rows. They return a single value for each row, in contrast to aggregate functions which return a single value for a group of rows.
Aggregate vs window/analytic functions
LETTER | AGGREGATE |
---|---|
A | 5 |
C | 6 |
LETTER | WINDOW |
---|---|
A | 5 |
A | 5 |
C | 6 |
In the first aggregate example, the resulting data were grouped by letter, but in the second window example, we preserved our rows.
Why use window functions?
Window functions are very powerful, and once you've gotten your head around how to use them, you'll be surprised at what they allow you to do. Some common use cases are:
- Running/Cumulative Total
- Moving Average
- Rank rows by custom criteria and groupings
- Finding the year-over-year % Change
where the PARTITION BY
denotes how to GROUP
rows into partitions,ORDER BY
how to order the rows in those partitions, and FRAME
which determines which rows to consider in those ordered partitions.
In general, window functions can be grouped into 3 types:
- Navigation functions: Return the value given a specific location criteria (e.g. first_value)
- Numbering functions: Assign a number (e.g. rank) to each row based on their position in the specified window
- Analytic functions: Perform a calculation on a set of values (e.g. sum)
Navigation functions
FIRST_VALUE
: Returns the value_expression for the first row in the current window frame.
LAST_VALUE
: Returns the value_expression for the last row in the current window frame.
NTH_VALUE
: Returns the value_expression for the Nth row of the current window frame.
LEAD
: Returns the value_expression on the subsequent row.LAG
: Returns the value_expression on the preceding row.
MODE
: Returns the most frequent value for the values within the expression
For these functions we'll use the following demo data:
PLAYER | POINTS | SEASON |
---|---|---|
James Harden | 2335 | 2020 |
Damian Lillard | 1978 | 2020 |
Devin Booker | 1863 | 2020 |
James Harden | 2818 | 2019 |
Paul George | 1978 | 2019 |
Kemba Walker | 2102 | 2019 |
Damian Lillard | 2067 | 2019 |
Devin Booker | 1700 | 2019 |
Paul George | 1033 | 2020 |
Kemba Walker | 1145 | 2020 |
How to find a year-over-year change:
PLAYER | FIRST_SEASON | LAST_SEASON | PER_CHANGE |
---|---|---|---|
Damian Lillard | 2067 | 1978 | -4.3058 |
Devin Booker | 1700 | 1863 | 9.5882 |
James Harden | 2818 | 2335 | -17.1398 |
Kemba Walker | 2102 | 1145 | -45.5281 |
Paul George | 1978 | 1033 | -47.7755 |
We used FIRST_VALUE
and LAST_VALUE
to find the scores for each player in the earliest and most recent seasons of data. Then we computed the percent difference using:
Numbering functions
RANK
: Returns the rank of each row in the ordered partition (starts at 1).DENSE_RANK
: Returns the rank, but values of the same value get the same rank (starts at 1).PERCENT_RANK
: Returns the percentile rank of a row.CUME_DIST
: Returns the relative rank of a row.NTILE
: Returns the bucket number after dividing each partition intoconstant_integer_expression
buckets.ROW_NUMBER
: Returns the sequential row number for each ordered partition.WIDTH_BUCKET
: Constructs equi-width histograms, in which the histogram range is divided into intervals of identical size, and returns the bucket number into which the value of an expression falls after it has been evaluated.CONDITIONAL_CHANGE_EVENT
: Returns a window event number for each row when the value of the expression is different from the value in the previous row.CONDITIONAL_TRUE_EVENT
: Returns a window event number for each row based on the result of the boolean expression.
How to get top 3 results for each group?
SEASON | POINTS_RANK | PLAYER | POINTS |
---|---|---|---|
2019 | 1 | James Harden | 2818 |
2019 | 2 | Kemba Walker | 2102 |
2019 | 3 | Damian Lillard | 2067 |
2020 | 1 | James Harden | 2335 |
2020 | 2 | Damian Lillard | 1978 |
2020 | 3 | Devin Booker | 1863 |
In this example, we used RANK
to rank each player by points over each season. Then we used a subquery to then return only the top 3 ranked players for each season.
Aggregate functions
Aggregate functions are available outside of window functions, but can be additionally applied over a specified window.
ANY_VALUE
: Returns expression for some row chosen from the group. Basically a random selection from an expression.AVG
: Returns the average of the non-NULL input values.CORR
: Returns the coefficient (COVAR_POP(y, x) / (STDDEV_POP(x) * STDDEV_POP(y))
) of correlation for a set of number pairs.COUNT
: Returns the number of [distinct] elements in expressionCOUNT_IF
: Returns the count of True values for expressionCOVAR_POP
: Returns the population covariance of a set of numbersCOVAR_SAMP
:Returns the sample covariance of a set of numbersLISTAGG
: Returns the concatenated input values, separated by thedelimiter
string.MAX
: Returns the maximum non-NULL value of the expression.MEDIAN
: Returns the median of a set of values.MIN
: Returns the minimum non-NULL value of the expression.PERCENTILE_CONT
: Returns the percentile value based on a continuous distribution of the input columnPERCENTILE_DISC
: Returns the percentile value based on a discrete distribution of the input columnRATIO_TO_REPORT
: Returns the ratio of a value within a group to the sum of the values within a group.STDEV
: Returns the sample standard deviation (square root of sample variance) of non-NULL values.STDEV_POP
: Returns population standard deviation of valuesSTDEV_SAMP
: Returns sample standard deviation of valuesSUM
: Returns the sum of all non-null valuesVAR_POP
: Returns the population variance of resultsVARIANCE | VAR_SAMP
: Returns the sample variance of results
There are more, less commonly used window functions available here.
How to find a running total?
SEASON | PLAYER | POINTS | RUNNING_TOTAL_POINTS |
---|---|---|---|
2019 | Damian Lillard | 2067 | 2067 |
2020 | Damian Lillard | 1978 | 4045 |
2019 | Devin Booker | 1700 | 1700 |
2020 | Devin Booker | 1863 | 3563 |
2019 | James Harden | 2818 | 2818 |
2020 | James Harden | 2335 | 5153 |
2019 | Kemba Walker | 2102 | 2102 |
2020 | Kemba Walker | 1145 | 3247 |
2019 | Paul George | 1978 | 1978 |
2020 | Paul George | 1033 | 3011 |
To find the running total simply use SUM
with an OVER
clause where you specify your groupings (PARTITION BY
), and the order in which to add them (ORDER BY
).
How We Built This
This page was built using Count, the first notebook built around SQL. It combines the best features of a SQL IDE, Data Visualization Tool, and Computational Notebooks. In the Count notebook, each cell acts like a CTE, meaning you can reference any other cell in your queries.
This makes not only for far more readable reports (like this one) but also a much faster and more powerful way to do your analysis, essentially turning your analysis into a connected graph of data frames rather than one-off convoluted queries and CSV files. And with a built-in visualization framework, you won't have to export your data to make your charts. Go from raw data to interactive report in one document.