Linear Regression
Linear regression is a powerful way to understand your data and make predictions. Doing this in SQL has always been difficult, but Snowflake has a few built-in functions that simplify the process.
Crash course
Linear regression is the process of fitting a linear model to observed data. This often takes the shape of fitting a trend line to data.
These lines can then be used to:
- Better understand the relationship between the two variables
- Make predictions about un-observed values
There are many variations and methodologies to calculate this trend line, but the most common is called Least-Squares Regression, which attempts to minimize, on average, the distance between the fitted line and the observed points.
The idea is to place a line that minimizes the sum of the squared residuals, represented as the red lines above.
We use the square of the error so errors above and below the line are treated the same.
Finding the line
The line is represented by:
y_est = mx + by_est- estimated y valuex- x valueb- y-intercept
We can estimate m and b using the following formulas:
Goodness of fit
Once we use these formulas to have our line, we can evaluate how well it fits our data using a metric called R-squared.
In effect, R-squared measures how much of the variance can be explained by our fit line.
This will evaluate to a number between 0 and 1, with 1 being a better fitting line.
Step-By-Step
Thankfully, Snowflake has some built-in functions to make finding a trend line much easier. These include:
SELECT AVG_YEARS_CODE, AVG_SALARY, DEV_TYPE FROM PUBLIC.SO_SALARY- REGR_COUNT
- REGR_INTERCEPT
- REGR_SLOPE
Let's see how to use them to make the following plot:
| AVG_YEARS_CODE | AVG_SALARY | DEV_TYPE |
|---|---|---|
| 10.83692773 | 109036.0829 | Database administrator |
| 10.50219539 | 147103.3569 | Engineer; site reliability |
| 10.98557182 | 107815.8756 | System administrator |
| 10.76276576 | 105691.588 | Developer; desktop or enterprise applications |
| 10.87868481 | 116052.123 | Developer; embedded applications or devices |
| 16.49919485 | 155010.0265 | Senior executive/VP |
| 13.77134029 | 152102.1363 | Engineering manager |
| 8.159303086 | 99119.27968 | Developer; front-end |
| 8.38508884 | 90740.52521 | Developer; mobile |
| 8.16277105 | 123911.1833 | Data scientist or machine learning specialist |
| 8.131296718 | 79984.80214 | Academic researcher |
| 10.49096595 | 117256.6903 | DevOps specialist |
| 9.976124375 | 117228.032 | Data or business analyst |
| 9.791820753 | 108321.9202 | Designer |
| 9.931633207 | 112595.8757 | Scientist |
| 10.46674401 | 95941.5808 | Educator |
| 10.3957529 | 149901.0961 | Marketing or sales professional |
| 8.653050774 | 103901.0658 | Developer; full-stack |
| 8.919727838 | 102814.5598 | Developer; back-end |
| 9.118639139 | 125454.2183 | Engineer; data |
| 8.939187347 | 102792.3237 | Developer; QA or test |
| 9.230899114 | 108271.3855 | Developer; game or graphics |
| 11.76445024 | 120308.0049 | Product manager |
Source: StackOverflow 2020 Survey
1. Find The Slope and Intercepts
Using REGR_SLOPE and REGR_INTERCEPT, we can quickly find the estimated slope and intercepts:
y_est = 7022.131X + 43603.459| M | B |
|---|---|
| 7022.130523674 | 43603.458640385 |
So now our line can be modeled with:
2. Generate values for the line
In order to add the line to the chart, we can generate x values for the trend line above. To do this we can use the GENERATOR function to generate rows throughout the range of the Years of Experience we have in our dataset.
| X |
|---|
| 7 |
| 8 |
| 9 |
| 10 |
| 11 |
| 12 |
| 13 |
| 14 |
| 15 |
| 16 |
| 17 |
| 18 |
Then we can use our trend line equation to generate the Y values for our trend line:
| X | Y_EST |
|---|---|
| 7 | 92758.372306104 |
| 8 | 99780.502829778 |
| 9 | 106802.633353452 |
| 10 | 113824.763877126 |
| 11 | 120846.8944008 |
| 12 | 127869.024924474 |
| 13 | 134891.155448148 |
| 14 | 141913.285971823 |
| 15 | 148935.416495497 |
| 16 | 155957.547019171 |
| 17 | 162979.677542845 |
| 18 | 170001.808066519 |
3. Join trend line points back to original data
| AVG_YEARS_CODE | AVG_SALARY | DEV_TYPE | X | Y_EST |
|---|---|---|---|---|
| 10.83692773 | 109036.0829 | Database administrator | 11 | 120846.8944008 |
| 10.50219539 | 147103.3569 | Engineer; site reliability | 11 | 120846.8944008 |
| 10.98557182 | 107815.8756 | System administrator | 11 | 120846.8944008 |
| 10.76276576 | 105691.588 | Developer; desktop or enterprise applications | 11 | 120846.8944008 |
| 10.87868481 | 116052.123 | Developer; embedded applications or devices | 11 | 120846.8944008 |
| 16.49919485 | 155010.0265 | Senior executive/VP | 16 | 155957.547019171 |
| 13.77134029 | 152102.1363 | Engineering manager | 14 | 141913.285971823 |
| 8.159303086 | 99119.27968 | Developer; front-end | 8 | 99780.502829778 |
| 8.38508884 | 90740.52521 | Developer; mobile | 8 | 99780.502829778 |
| 8.16277105 | 123911.1833 | Data scientist or machine learning specialist | 8 | 99780.502829778 |
| 8.131296718 | 79984.80214 | Academic researcher | 8 | 99780.502829778 |
| 10.49096595 | 117256.6903 | DevOps specialist | 10 | 113824.763877126 |
| 9.976124375 | 117228.032 | Data or business analyst | 10 | 113824.763877126 |
| 9.791820753 | 108321.9202 | Designer | 10 | 113824.763877126 |
| 9.931633207 | 112595.8757 | Scientist | 10 | 113824.763877126 |
| 10.46674401 | 95941.5808 | Educator | 10 | 113824.763877126 |
| 10.3957529 | 149901.0961 | Marketing or sales professional | 10 | 113824.763877126 |
| 8.653050774 | 103901.0658 | Developer; full-stack | 9 | 106802.633353452 |
| 8.919727838 | 102814.5598 | Developer; back-end | 9 | 106802.633353452 |
| 9.118639139 | 125454.2183 | Engineer; data | 9 | 106802.633353452 |
| 8.939187347 | 102792.3237 | Developer; QA or test | 9 | 106802.633353452 |
| 9.230899114 | 108271.3855 | Developer; game or graphics | 9 | 106802.633353452 |
| 11.76445024 | 120308.0049 | Product manager | 12 | 127869.024924474 |
| null | null | null | 18 | 170001.808066519 |
| null | null | null | 15 | 148935.416495497 |
| null | null | null | 17 | 162979.677542845 |
| null | null | null | 13 | 134891.155448148 |
| null | null | null | 7 | 92758.372306104 |
Now we have our scatter with our calculated trend line!
4. Evaluate goodness of fit
Now we can evaluate how well our line fits our data using REGR_R2:
| R_2 |
|---|
| 0.4651945324 |
An R-squared of 0.465 is not ideal, but it tells us how much we should (or shouldn't) trust the predictions we make with this equation.
5. Make predictions!
Now we can use our formula to work out for years of coding experience, what a developer's average salary may be:
SELECT regr_r2(AVG_YEARS_CODE,AVG_SALARY) R_2 FROM PUBLIC.SO_SALARY| YEARS_EXPERIENCE | ESTIMATED_SALARY |
|---|---|
| 5 | 78714.111258756 |
| 15 | 148935.416495497 |
How We Built This
This page was built using Count, the first notebook built around SQL. It combines the best features of a SQL IDE, Data Visualization Tool, and Computational Notebooks. In the Count notebook, each cell acts like a CTE, meaning you can reference any other cell in your queries.
This makes not only for far more readable reports (like this one) but also a much faster and more powerful way to do your analysis, essentially turning your analysis into a connected graph of data frames rather than one-off convoluted queries and CSV files. And with a built-in visualization framework, you won't have to export your data to make your charts. Go from raw data to interactive report in one document.