Linear regression is a powerful way to understand your data and make predictions. Doing this in SQL has always been difficult, but Snowflake has a few built-in functions that simplify the process.
Linear regression is the process of fitting a linear model to observed data. This often takes the shape of fitting a trend line to data.
These lines can then be used to:
There are many variations and methodologies to calculate this trend line, but the most common is called Least-Squares Regression, which attempts to minimize, on average, the distance between the fitted line and the observed points.
The idea is to place a line that minimizes the sum of the squared residuals, represented as the red lines above.
MINIMIZE SUM(error_1^2, error_2^2, ... , error_n^2)
The line is represented by:
y_est = mx + b
where:
We can estimate m and b using the following formulas:
m = sum((x - x_bar) / (y - y_bar)) / sum((x-x_bar)^2)
b = y_bar - (m * x_bar)
Once we use these formulas to have our line, we can evaluate how well it fits our data using a metric called R-squared.
In effect, R-squared measures how much of the variance can be explained by our fit line.
R-squared = sum((y_est - y_bar)^2) / sum((y - y_bar)^2)
This will evaluate to a number between 0 and 1, with 1 being a better fitting line.
Thankfully, Snowflake has some built-in functions to make finding a trend line much easier. These include:
Let's see how to use them to make the following plot:
SELECT AVG_YEARS_CODE, AVG_SALARY, DEV_TYPE FROM PUBLIC.SO_SALARY
Using REGR_SLOPE and REGR_INTERCEPT, we can quickly find the estimated slope and intercepts:
SELECT
REGR_SLOPE(AVG_SALARY,AVG_YEARS_CODE) m,
regr_intercept(AVG_SALARY,AVG_YEARS_CODE) b
from PUBLIC.SO_SALARY
So now our line can be modeled with:
y_est = 7022.131X + 43603.459
In order to add the line to the chart, we can generate x values for the trend line above. To do this we can use the GENERATOR function to generate rows throughout the range of the Years of Experience we have in our dataset.
SELECT row_number() over (order by seq4()) + 6 x FROM table(generator(ROWCOUNT => 12))
Then we can use our trend line equation to generate the Y values for our trend line:
Select LINE_X.X, LINE_X.X * M + B y_est from LINE_X cross join SLOPES
Select * from A full outer join LINE_XY ON ROUND(A.AVG_YEARS_CODE) = LINE_XY.X
Now we have our scatter with our calculated trend line!
Now we can evaluate how well our line fits our data using REGR_R2:
SELECT regr_r2(AVG_YEARS_CODE,AVG_SALARY) R_2 FROM PUBLIC.SO_SALARY
An R-squared of 0.465 is not ideal, but it tells us how much we should (or shouldn't) trust the predictions we make with this equation.
Now we can use our formula to work out for <X> years of coding experience, what a developer's average salary may be:
SELECT 5 years_experience, SLOPES.M * 5 + SLOPES.B Estimated_salary FROM SLOPES
union all
(SELECT 15 years_experience, SLOPES.M * 15 + SLOPES.B Estimated_salary FROM SLOPES)
This page was built using Count, the first notebook built around SQL. It combines the best features of a SQL IDE, Data Visualization Tool, and Computational Notebooks. In the Count notebook, each cell acts like a CTE, meaning you can reference any other cell in your queries.
This makes not only for far more readable reports (like this one) but also a much faster and more powerful way to do your analysis, essentially turning your analysis into a connected graph of data frames rather than one-off convoluted queries and CSV files. And with a built-in visualization framework, you won't have to export your data to make your charts. Go from raw data to interactive report in one document.