SELECT knowledge FROM sql_resources WHERE category='snowflake' AND slug='linear-regression'

Linear Regression

Linear regression is a powerful way to understand your data and make predictions. Doing this in SQL has always been difficult, but Snowflake has a few built-in functions that simplify the process.

Crash course

Linear regression is the process of fitting a linear model to observed data. This often takes the shape of fitting a trend line to data.

These lines can then be used to:

MINIMIZE SUM(error_1^2, error_2^2, ... , error_n^2)

Loading code...

Better understand the relationship between the two variables
Make predictions about un-observed values

There are many variations and methodologies to calculate this trend line, but the most common is called Least-Squares Regression, which attempts to minimize, on average, the distance between the fitted line and the observed points.

The idea is to place a line that minimizes the sum of the squared residuals, represented as the red lines above.

We use the square of the error so errors above and below the line are treated the same.

Finding the line

The line is represented by:

y_est = mx + b

y_est - estimated y value
x - x value
b - y-intercept

We can estimate m and b using the following formulas:

R-squared = sum((y_est - y_bar)^2) / sum((y - y_bar)^2)

Loading code...

Goodness of fit

Once we use these formulas to have our line, we can evaluate how well it fits our data using a metric called R-squared.

In effect, R-squared measures how much of the variance can be explained by our fit line.

This will evaluate to a number between 0 and 1, with 1 being a better fitting line.

Step-By-Step

Thankfully, Snowflake has some built-in functions to make finding a trend line much easier. These include:

SELECT AVG_YEARS_CODE, AVG_SALARY, DEV_TYPE FROM PUBLIC.SO_SALARY

REGR_COUNT
REGR_INTERCEPT
REGR_SLOPE

Let's see how to use them to make the following plot:

SELECT
  REGR_SLOPE(AVG_SALARY,AVG_YEARS_CODE) m,
  regr_intercept(AVG_SALARY,AVG_YEARS_CODE) b
from PUBLIC.SO_SALARY

Loading code...

AVG_YEARS_CODE	AVG_SALARY	DEV_TYPE
10.83692773	109036.0829	Database administrator
10.50219539	147103.3569	Engineer; site reliability
10.98557182	107815.8756	System administrator
10.76276576	105691.588	Developer; desktop or enterprise applications
10.87868481	116052.123	Developer; embedded applications or devices
16.49919485	155010.0265	Senior executive/VP
13.77134029	152102.1363	Engineering manager
8.159303086	99119.27968	Developer; front-end
8.38508884	90740.52521	Developer; mobile
8.16277105	123911.1833	Data scientist or machine learning specialist
8.131296718	79984.80214	Academic researcher
10.49096595	117256.6903	DevOps specialist
9.976124375	117228.032	Data or business analyst
9.791820753	108321.9202	Designer
9.931633207	112595.8757	Scientist
10.46674401	95941.5808	Educator
10.3957529	149901.0961	Marketing or sales professional
8.653050774	103901.0658	Developer; full-stack
8.919727838	102814.5598	Developer; back-end
9.118639139	125454.2183	Engineer; data
8.939187347	102792.3237	Developer; QA or test
9.230899114	108271.3855	Developer; game or graphics
11.76445024	120308.0049	Product manager

Source: StackOverflow 2020 Survey

1. Find The Slope and Intercepts

Using REGR_SLOPE and REGR_INTERCEPT, we can quickly find the estimated slope and intercepts:

y_est = 7022.131X + 43603.459

M	B
7022.130523674	43603.458640385

So now our line can be modeled with:

SELECT row_number() over (order by seq4()) + 6 x FROM table(generator(ROWCOUNT => 12))

Loading code...

2. Generate values for the line

In order to add the line to the chart, we can generate x values for the trend line above. To do this we can use the GENERATOR function to generate rows throughout the range of the Years of Experience we have in our dataset.

X
7
8
9
10
11
12
13
14
15
16
17
18

Then we can use our trend line equation to generate the Y values for our trend line:

Select LINE_X.X, LINE_X.X * M + B y_est from LINE_X cross join SLOPES

Loading code...

X	Y_EST
7	92758.372306104
8	99780.502829778
9	106802.633353452
10	113824.763877126
11	120846.8944008
12	127869.024924474
13	134891.155448148
14	141913.285971823
15	148935.416495497
16	155957.547019171
17	162979.677542845
18	170001.808066519

3. Join trend line points back to original data

AVG_YEARS_CODE	AVG_SALARY	DEV_TYPE	X	Y_EST
10.83692773	109036.0829	Database administrator	11	120846.8944008
10.50219539	147103.3569	Engineer; site reliability	11	120846.8944008
10.98557182	107815.8756	System administrator	11	120846.8944008
10.76276576	105691.588	Developer; desktop or enterprise applications	11	120846.8944008
10.87868481	116052.123	Developer; embedded applications or devices	11	120846.8944008
16.49919485	155010.0265	Senior executive/VP	16	155957.547019171
13.77134029	152102.1363	Engineering manager	14	141913.285971823
8.159303086	99119.27968	Developer; front-end	8	99780.502829778
8.38508884	90740.52521	Developer; mobile	8	99780.502829778
8.16277105	123911.1833	Data scientist or machine learning specialist	8	99780.502829778
8.131296718	79984.80214	Academic researcher	8	99780.502829778
10.49096595	117256.6903	DevOps specialist	10	113824.763877126
9.976124375	117228.032	Data or business analyst	10	113824.763877126
9.791820753	108321.9202	Designer	10	113824.763877126
9.931633207	112595.8757	Scientist	10	113824.763877126
10.46674401	95941.5808	Educator	10	113824.763877126
10.3957529	149901.0961	Marketing or sales professional	10	113824.763877126
8.653050774	103901.0658	Developer; full-stack	9	106802.633353452
8.919727838	102814.5598	Developer; back-end	9	106802.633353452
9.118639139	125454.2183	Engineer; data	9	106802.633353452
8.939187347	102792.3237	Developer; QA or test	9	106802.633353452
9.230899114	108271.3855	Developer; game or graphics	9	106802.633353452
11.76445024	120308.0049	Product manager	12	127869.024924474
null	null	null	18	170001.808066519
null	null	null	15	148935.416495497
null	null	null	17	162979.677542845
null	null	null	13	134891.155448148
null	null	null	7	92758.372306104

Now we have our scatter with our calculated trend line!

4. Evaluate goodness of fit

Now we can evaluate how well our line fits our data using REGR_R2:

Select * from A full outer join LINE_XY ON ROUND(A.AVG_YEARS_CODE) = LINE_XY.X

Loading code...

R_2
0.4651945324

An R-squared of 0.465 is not ideal, but it tells us how much we should (or shouldn't) trust the predictions we make with this equation.

5. Make predictions!

Now we can use our formula to work out for years of coding experience, what a developer's average salary may be:

SELECT regr_r2(AVG_YEARS_CODE,AVG_SALARY) R_2 FROM PUBLIC.SO_SALARY

YEARS_EXPERIENCE	ESTIMATED_SALARY
5	78714.111258756
15	148935.416495497

How We Built This

This page was built using Count, the first notebook built around SQL. It combines the best features of a SQL IDE, Data Visualization Tool, and Computational Notebooks. In the Count notebook, each cell acts like a CTE, meaning you can reference any other cell in your queries.

This makes not only for far more readable reports (like this one) but also a much faster and more powerful way to do your analysis, essentially turning your analysis into a connected graph of data frames rather than one-off convoluted queries and CSV files. And with a built-in visualization framework, you won't have to export your data to make your charts. Go from raw data to interactive report in one document.

SELECT 5 years_experience, SLOPES.M * 5 + SLOPES.B Estimated_salary FROM SLOPES
union all 
(SELECT 15 years_experience, SLOPES.M * 15 + SLOPES.B Estimated_salary FROM SLOPES)

Loading code...