SELECT knowledge FROM sql_resources WHERE category='snowflake' AND slug='linear-regression'

Linear Regression

Linear regression is a powerful way to understand your data and make predictions. Doing this in SQL has always been difficult, but Snowflake has a few built-in functions that simplify the process.

Crash course

Linear regression is the process of fitting a linear model to observed data. This often takes the shape of fitting a trend line to data.

These lines can then be used to:

Loading code...
  • Better understand the relationship between the two variables
  • Make predictions about un-observed values

There are many variations and methodologies to calculate this trend line, but the most common is called Least-Squares Regression, which attempts to minimize, on average, the distance between the fitted line and the observed points.

The idea is to place a line that minimizes the sum of the squared residuals, represented as the red lines above.

We use the square of the error so errors above and below the line are treated the same.

Finding the line

The line is represented by:

y_est = mx + b
  • y_est - estimated y value
  • x - x value
  • b - y-intercept

We can estimate m and b using the following formulas:

Loading code...

Goodness of fit

Once we use these formulas to have our line, we can evaluate how well it fits our data using a metric called R-squared.

In effect, R-squared measures how much of the variance can be explained by our fit line.

This will evaluate to a number between 0 and 1, with 1 being a better fitting line.

Step-By-Step

Thankfully, Snowflake has some built-in functions to make finding a trend line much easier. These include:

SELECT AVG_YEARS_CODE, AVG_SALARY, DEV_TYPE FROM PUBLIC.SO_SALARY
  • REGR_COUNT
  • REGR_INTERCEPT
  • REGR_SLOPE

Let's see how to use them to make the following plot:

Loading code...
AVG_YEARS_CODEAVG_SALARYDEV_TYPE
10.83692773109036.0829Database administrator
10.50219539147103.3569Engineer; site reliability
10.98557182107815.8756System administrator
10.76276576105691.588Developer; desktop or enterprise applications
10.87868481116052.123Developer; embedded applications or devices
16.49919485155010.0265Senior executive/VP
13.77134029152102.1363Engineering manager
8.15930308699119.27968Developer; front-end
8.3850888490740.52521Developer; mobile
8.16277105123911.1833Data scientist or machine learning specialist
8.13129671879984.80214Academic researcher
10.49096595117256.6903DevOps specialist
9.976124375117228.032Data or business analyst
9.791820753108321.9202Designer
9.931633207112595.8757Scientist
10.4667440195941.5808Educator
10.3957529149901.0961Marketing or sales professional
8.653050774103901.0658Developer; full-stack
8.919727838102814.5598Developer; back-end
9.118639139125454.2183Engineer; data
8.939187347102792.3237Developer; QA or test
9.230899114108271.3855Developer; game or graphics
11.76445024120308.0049Product manager

Source: StackOverflow 2020 Survey

1. Find The Slope and Intercepts

Using REGR_SLOPE and REGR_INTERCEPT, we can quickly find the estimated slope and intercepts:

y_est = 7022.131X + 43603.459
MB
7022.13052367443603.458640385

So now our line can be modeled with:

Loading code...

2. Generate values for the line

In order to add the line to the chart, we can generate x values for the trend line above. To do this we can use the GENERATOR function to generate rows throughout the range of the Years of Experience we have in our dataset.

X
7
8
9
10
11
12
13
14
15
16
17
18

Then we can use our trend line equation to generate the Y values for our trend line:

Loading code...
XY_EST
792758.372306104
899780.502829778
9106802.633353452
10113824.763877126
11120846.8944008
12127869.024924474
13134891.155448148
14141913.285971823
15148935.416495497
16155957.547019171
17162979.677542845
18170001.808066519

3. Join trend line points back to original data

AVG_YEARS_CODEAVG_SALARYDEV_TYPEXY_EST
10.83692773109036.0829Database administrator11120846.8944008
10.50219539147103.3569Engineer; site reliability11120846.8944008
10.98557182107815.8756System administrator11120846.8944008
10.76276576105691.588Developer; desktop or enterprise applications11120846.8944008
10.87868481116052.123Developer; embedded applications or devices11120846.8944008
16.49919485155010.0265Senior executive/VP16155957.547019171
13.77134029152102.1363Engineering manager14141913.285971823
8.15930308699119.27968Developer; front-end899780.502829778
8.3850888490740.52521Developer; mobile899780.502829778
8.16277105123911.1833Data scientist or machine learning specialist899780.502829778
8.13129671879984.80214Academic researcher899780.502829778
10.49096595117256.6903DevOps specialist10113824.763877126
9.976124375117228.032Data or business analyst10113824.763877126
9.791820753108321.9202Designer10113824.763877126
9.931633207112595.8757Scientist10113824.763877126
10.4667440195941.5808Educator10113824.763877126
10.3957529149901.0961Marketing or sales professional10113824.763877126
8.653050774103901.0658Developer; full-stack9106802.633353452
8.919727838102814.5598Developer; back-end9106802.633353452
9.118639139125454.2183Engineer; data9106802.633353452
8.939187347102792.3237Developer; QA or test9106802.633353452
9.230899114108271.3855Developer; game or graphics9106802.633353452
11.76445024120308.0049Product manager12127869.024924474
nullnullnull18170001.808066519
nullnullnull15148935.416495497
nullnullnull17162979.677542845
nullnullnull13134891.155448148
nullnullnull792758.372306104

Now we have our scatter with our calculated trend line!

4. Evaluate goodness of fit

Now we can evaluate how well our line fits our data using REGR_R2:

Loading code...
R_2
0.4651945324

An R-squared of 0.465 is not ideal, but it tells us how much we should (or shouldn't) trust the predictions we make with this equation.

5. Make predictions!

Now we can use our formula to work out for  years of coding experience, what a developer's average salary may be:

SELECT regr_r2(AVG_YEARS_CODE,AVG_SALARY) R_2 FROM PUBLIC.SO_SALARY
YEARS_EXPERIENCEESTIMATED_SALARY
578714.111258756
15148935.416495497

How We Built This

This page was built using Count, the first notebook built around SQL. It combines the best features of a SQL IDE, Data Visualization Tool, and Computational Notebooks. In the Count notebook, each cell acts like a CTE, meaning you can reference any other cell in your queries.

This makes not only for far more readable reports (like this one) but also a much faster and more powerful way to do your analysis, essentially turning your analysis into a connected graph of data frames rather than one-off convoluted queries and CSV files. And with a built-in visualization framework, you won't have to export your data to make your charts. Go from raw data to interactive report in one document.

Loading code...