SQL Resources
SQL
Linear Regression

Linear Regression

Linear regression is a powerful way to understand your data and make predictions. Doing this in SQL has always been difficult, but Snowflake has a few built-in functions that simplify the process.

Crash course

Linear regression is the process of fitting a linear model to observed data. This often takes the shape of fitting a trend line to data.

These lines can then be used to:

  • Better understand the relationship between the two variables
  • Make predictions about un-observed values
  • There are many variations and methodologies to calculate this trend line, but the most common is called Least-Squares Regression, which attempts to minimize, on average, the distance between the fitted line and the observed points.

    The idea is to place a line that minimizes the sum of the squared residuals, represented as the red lines above.

    MINIMIZE SUM(error_1^2, error_2^2, ... , error_n^2)
    We use the square of the error so errors above and below the line are treated the same.

    Finding the line

    The line is represented by:

    y_est = mx + b

    where:

  • y_est - estimated y value
  • m - slope
  • x - x value
  • b - y-intercept
  • We can estimate m and b using the following formulas:

    m = sum((x - x_bar) / (y - y_bar)) / sum((x-x_bar)^2)
    b = y_bar - (m * x_bar)

    Goodness of fit

    Once we use these formulas to have our line, we can evaluate how well it fits our data using a metric called R-squared.

    In effect, R-squared measures how much of the variance can be explained by our fit line.

    R-squared = sum((y_est - y_bar)^2) / sum((y - y_bar)^2)
    

    This will evaluate to a number between 0 and 1, with 1 being a better fitting line.

    Step-By-Step

    Thankfully, Snowflake has some built-in functions to make finding a trend line much easier. These include:

  • REGR_AVGX
  • REGR_AVGY
  • REGR_COUNT
  • REGR_INTERCEPT
  • REGR_R2
  • REGR_SLOPE
  • REGR_SXX
  • REGR_SXY
  • REGR_SYY
  • Let's see how to use them to make the following plot:

    SELECT AVG_YEARS_CODE, AVG_SALARY, DEV_TYPE FROM PUBLIC.SO_SALARY
    
    AVG_YEARS_CODEAVG_SALARYDEV_TYPE
    10.83692773109036.0829Database administrator
    10.50219539147103.3569Engineer; site reliability
    10.98557182107815.8756System administrator
    10.76276576105691.588Developer; desktop or enterprise applications
    10.87868481116052.123Developer; embedded applications or devices
    16.49919485155010.0265Senior executive/VP
    13.77134029152102.1363Engineering manager
    8.15930308699119.27968Developer; front-end
    8.3850888490740.52521Developer; mobile
    8.16277105123911.1833Data scientist or machine learning specialist
    8.13129671879984.80214Academic researcher
    10.49096595117256.6903DevOps specialist
    9.976124375117228.032Data or business analyst
    9.791820753108321.9202Designer
    9.931633207112595.8757Scientist
    10.4667440195941.5808Educator
    10.3957529149901.0961Marketing or sales professional
    8.653050774103901.0658Developer; full-stack
    8.919727838102814.5598Developer; back-end
    9.118639139125454.2183Engineer; data
    8.939187347102792.3237Developer; QA or test
    9.230899114108271.3855Developer; game or graphics
    11.76445024120308.0049Product manager
    Source: StackOverflow 2020 Survey

    1. Find The Slope and Intercepts

    Using REGR_SLOPE and REGR_INTERCEPT, we can quickly find the estimated slope and intercepts:

    SELECT
      REGR_SLOPE(AVG_SALARY,AVG_YEARS_CODE) m,
      regr_intercept(AVG_SALARY,AVG_YEARS_CODE) b
    from PUBLIC.SO_SALARY
    MB
    7022.13052367443603.458640385

    So now our line can be modeled with:

    y_est = 7022.131X + 43603.459
    

    2. Generate values for the line

    In order to add the line to the chart, we can generate x values for the trend line above. To do this we can use the GENERATOR function to generate rows throughout the range of the Years of Experience we have in our dataset.

    SELECT row_number() over (order by seq4()) + 6 x FROM table(generator(ROWCOUNT => 12))
    
    X
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18

    Then we can use our trend line equation to generate the Y values for our trend line:

    Select LINE_X.X, LINE_X.X * M + B y_est from LINE_X cross join SLOPES
    
    XY_EST
    792758.372306104
    899780.502829778
    9106802.633353452
    10113824.763877126
    11120846.8944008
    12127869.024924474
    13134891.155448148
    14141913.285971823
    15148935.416495497
    16155957.547019171
    17162979.677542845
    18170001.808066519

    3. Join trend line points back to original data

    Select * from A full outer join LINE_XY ON ROUND(A.AVG_YEARS_CODE) = LINE_XY.X
    
    AVG_YEARS_CODEAVG_SALARYDEV_TYPEXY_EST
    10.83692773109036.0829Database administrator11120846.8944008
    10.50219539147103.3569Engineer; site reliability11120846.8944008
    10.98557182107815.8756System administrator11120846.8944008
    10.76276576105691.588Developer; desktop or enterprise applications11120846.8944008
    10.87868481116052.123Developer; embedded applications or devices11120846.8944008
    16.49919485155010.0265Senior executive/VP16155957.547019171
    13.77134029152102.1363Engineering manager14141913.285971823
    8.15930308699119.27968Developer; front-end899780.502829778
    8.3850888490740.52521Developer; mobile899780.502829778
    8.16277105123911.1833Data scientist or machine learning specialist899780.502829778
    8.13129671879984.80214Academic researcher899780.502829778
    10.49096595117256.6903DevOps specialist10113824.763877126
    9.976124375117228.032Data or business analyst10113824.763877126
    9.791820753108321.9202Designer10113824.763877126
    9.931633207112595.8757Scientist10113824.763877126
    10.4667440195941.5808Educator10113824.763877126
    10.3957529149901.0961Marketing or sales professional10113824.763877126
    8.653050774103901.0658Developer; full-stack9106802.633353452
    8.919727838102814.5598Developer; back-end9106802.633353452
    9.118639139125454.2183Engineer; data9106802.633353452
    8.939187347102792.3237Developer; QA or test9106802.633353452
    9.230899114108271.3855Developer; game or graphics9106802.633353452
    11.76445024120308.0049Product manager12127869.024924474
    nullnullnull18170001.808066519
    nullnullnull15148935.416495497
    nullnullnull17162979.677542845
    nullnullnull13134891.155448148
    nullnullnull792758.372306104

    Now we have our scatter with our calculated trend line!

    4. Evaluate goodness of fit

    Now we can evaluate how well our line fits our data using REGR_R2:

    SELECT regr_r2(AVG_YEARS_CODE,AVG_SALARY) R_2 FROM PUBLIC.SO_SALARY
    
    R_2
    0.4651945324

    An R-squared of 0.465 is not ideal, but it tells us how much we should (or shouldn't) trust the predictions we make with this equation.

    5. Make predictions!

    Now we can use our formula to work out for  years of coding experience, what a developer's average salary may be:

    SELECT 5 years_experience, SLOPES.M * 5 + SLOPES.B Estimated_salary FROM SLOPES
    union all 
    (SELECT 15 years_experience, SLOPES.M * 15 + SLOPES.B Estimated_salary FROM SLOPES)
    YEARS_EXPERIENCEESTIMATED_SALARY
    578714.111258756
    15148935.416495497

    How We Built This

    This page was built using Count, the first notebook built around SQL. It combines the best features of a SQL IDE, Data Visualization Tool, and Computational Notebooks. In the Count notebook, each cell acts like a CTE, meaning you can reference any other cell in your queries.

    This makes not only for far more readable reports (like this one) but also a much faster and more powerful way to do your analysis, essentially turning your analysis into a connected graph of data frames rather than one-off convoluted queries and CSV files. And with a built-in visualization framework, you won't have to export your data to make your charts. Go from raw data to interactive report in one document.

    Subscribe to newsletter

    Subscribe to receive the latest blog posts to your inbox every week.

    By subscribing you agree to our Privacy Policy.
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.

    Start solving your organization's biggest problems with Count today.

    Stay up to date with all things count, data and problem solving.