SQL Resources
SQL
Summary Statistics

Summary Statistics

When you get a new dataset, one of the first things you want to do is find your summary statistics. These essential numbers tell you the basics about your data so you start off your analysis on an informed foot.

The Basics

The main summary statistics are:

  • Min
  • Max
  • Average
  • Median
  • Mode
  • Inter-quartile Range
  • Standard Deviation
  • select * from PUBLIC.LONDON_WEATHER
    DATETEMPWEATHERPRECIPITATIONVISIBILITYWINDSPEED
    2018-04-1148.5fog0.225.7
    2016-10-3154.7fog0.012.52.1
    2018-11-2344.2rain0.012.55.5
    2018-01-0941.6rainnull35.5
    2017-01-2631sunny037.7
    2017-02-1342.8sunny0311.9
    2016-12-3142.8fognull3.55.7
    2018-05-3165.7fog03.55.1
    2017-01-2737.8sunny03.56.1
    2018-03-0230.6rain03.514.2
    2019-02-1647.4sunny0.013.56.3
    2018-02-1947.7rain0.023.55.4
    2017-01-3145rain0.033.57.1
    2019-12-0438.9fog042.3
    2017-11-0247.5fog042.7
    2017-03-1154.2sunny045.6
    2018-05-2959.5fog046.4
    2017-03-1648.7fog047.8
    2018-03-2446.2rain0.0944.1
    2016-12-2836fog0.0142.9
    2019-11-0940.4fog0.0146
    2018-05-3060rain0.4243.3
    2018-04-1455.2sunny04.54.5
    2019-12-2846.4fog04.55.7
    2017-02-0639.2fog04.55.1
    2019-04-0242.6rain04.55.8
    2019-02-2848.8rain04.58.6
    2018-03-0443rain0.194.57
    2017-10-1856.1rain0.014.54.3
    2016-12-1448.8fog0.014.54.9
    2018-11-0148fog0.164.55.3
    2016-11-2147.8rain0.44.58.8
    2017-08-2162.1rain0.264.54.9
    2017-12-1034.9rain0.214.56.1
    2016-06-0459.9sunny054.1
    2019-11-0840.7fog053.7
    2018-10-0560.5sunny054.3
    2017-02-1444.6sunny056.9
    2019-04-3053.5sunny054.3
    2018-10-0650.1fog056.8
    2019-02-1749fog058
    2018-11-2853.3rain0.27511.7
    2016-04-1152.5rain0.0556
    2016-12-1052.8rain0.0358.4
    2018-02-2047.3rain0.0657.8
    2016-01-1736.3rain0.0756.4
    2017-12-0240.2rainnull5.54.8
    2016-09-2261.4rainnull5.56.6
    2018-10-2053.1sunny05.54
    2019-01-1145.5rain05.54.4
    2017-05-0754.5sunny05.55.4
    2018-03-0943.6rain05.55.8
    2019-04-0647.4fog05.57.6
    2018-02-2824.6snow05.58.9
    2018-03-1051.5rain0.255.58.1
    2018-05-2559.5rain0.095.53.3
    2019-06-2567.9rain0.115.54.1
    2019-06-1354.8rain0.575.510.5
    2018-04-0246.9rain0.465.59.3
    2018-12-2541.5sunny0.015.53.4
    2017-04-0453.1rain0.015.54.8
    2018-07-2964.6rain0.015.513
    2019-02-0848.6rain0.015.515.5
    2019-01-2334.8fog0.165.53.9
    2018-05-2769.3rain0.435.55.5
    2018-04-2261.7rain0.435.59.1
    2019-12-2046.8rain0.435.58.2
    2016-06-0962.7sunny0.055.54.9
    2016-04-1550rain0.15.54.2
    2018-09-1256.4rain0.135.53.8
    2017-12-0748rain0.125.512.7
    2019-11-1643.1rain0.075.54
    2017-09-1953.7rain0.145.54.3
    2016-11-0343.5rainnull65
    2016-04-0250.3rainnull67.6
    2016-05-1755.3rainnull67.7
    2019-02-2650.7sunny063
    2018-11-2643.1rain064.8
    2016-12-0135sunny064.1
    2019-11-1843.6rain064.6
    2019-04-2360.4sunny064.8
    2019-08-2576.3fog065.1
    2017-03-1049.1rain066.1
    2018-06-0265.4sunny067.8
    2019-01-2935.4rain065
    2018-10-2549.6sunny065.4
    2018-11-1452.1rain067.2
    2018-02-2237.7rain067.5
    2017-04-0351.5fog067
    2018-05-0957.1sunny067.2
    2018-03-2141.5sunny067.3
    2016-04-0145.9sunny067.7
    2016-04-2151.1sunny069.5
    2019-01-1645.6rain069.8
    2018-09-1166.6rain0611.6
    2018-11-0950.9rain0611.3
    2016-09-0564.6rain0.0966
    2017-10-1361.4sunny0.09610.5
    2018-04-0351.8rain0.18610.3
    2019-06-0858.4rain0.33615.3
    SELECT * FROM (
    SELECT
      'TEMP' COL,
      MIN(TEMP) MIN,
      PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY TEMP) "25th PERCENTILE",
      MEDIAN(TEMP) MEDIAN,
      PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY TEMP) "75th PERCENTILE",
      MAX(TEMP) MAX,
      AVG(TEMP) AVG,
      MODE(TEMP) MODE,
      STDDEV(TEMP) STD_DEV
    FROM PUBLIC.LONDON_WEATHER
    GROUP BY COL)
    UNION ALL (
      SELECT
      'PRECIPITATION' COL,
      MIN(PRECIPITATION) MIN,
      PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY PRECIPITATION) "25th PERCENTILE",
      MEDIAN(PRECIPITATION) MEDIAN,
      PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY PRECIPITATION) "75th PERCENTILE",
      MAX(PRECIPITATION) MAX,
      AVG(PRECIPITATION) AVG,
      MODE(PRECIPITATION) MODE,
      STDDEV(PRECIPITATION) STD_DEV
    FROM PUBLIC.LONDON_WEATHER
    GROUP BY COL
    )
    UNION ALL (
      SELECT
      'VISIBILITY' COL,
      MIN(VISIBILITY) MIN,
      PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY VISIBILITY) "25th PERCENTILE",
      MEDIAN(VISIBILITY) MEDIAN,
      PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY VISIBILITY) "75th PERCENTILE",
      MAX(VISIBILITY) MAX,
      AVG(VISIBILITY) AVG,
      MODE(VISIBILITY) MODE,
      STDDEV(VISIBILITY) STD_DEV
    FROM PUBLIC.LONDON_WEATHER
    GROUP BY COL
    )
    UNION ALL (
      SELECT
      'WINDSPEED' COL,
      MIN(WINDSPEED) MIN,
      PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY WINDSPEED) "25th PERCENTILE",
      MEDIAN(WINDSPEED) MEDIAN,
      PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY WINDSPEED) "75th PERCENTILE",
      MAX(WINDSPEED) MAX,
      AVG(WINDSPEED) AVG,
      MODE(WINDSPEED) MODE,
      STDDEV(WINDSPEED) STD_DEV
    FROM PUBLIC.LONDON_WEATHER
    GROUP BY COL
    )
    COLMIN25th PERCENTILEMEDIAN75th PERCENTILEMAXAVGMODESTD_DEV
    TEMP24.645.67552.7561.883.453.51274104757.710.211334572
    PRECIPITATION0000.071.440.0696046852100.1466155086
    VISIBILITY0.25.86.46.714.86.1050275486.71.080844202
    WINDSPEED1.85.27.29.522.17.6900137744.83.228622125
    SELECT
      'DATE' COL,
      MIN(DATE) MIN,
      MAX(DATE) MAX,
      MODE(DATE) MODE,
      COUNT(DISTINCT DATE) UNIQUE_VALUES
    FROM PUBLIC.LONDON_WEATHER
    GROUP BY COL
    COLMINMAXMODEUNIQUE_VALUES
    DATE2016-01-012019-12-312018-04-111452
    SELECT
      'WEATHER' COL,
      MIN(WEATHER) MIN,
      MAX(WEATHER) MAX,
      MODE(WEATHER) MODE,
      COUNT(DISTINCT WEATHER) UNIQUE_VALUES
    FROM PUBLIC.LONDON_WEATHER
    GROUP BY COL
    COLMINMAXMODEUNIQUE_VALUES
    WEATHERfogthunderstormrain5

    Correlation Matrix

    Correlation matrices let you quickly see the correlations between all the different variables.

    To create them, we will make use of the CORR

    SELECT * FROM 
    (SELECT 'TEMP' COL_1, 'PRECIPITATION' COL_2, CORR(TEMP,PRECIPITATION) CORR
    FROM PUBLIC.LONDON_WEATHER) UNION ALL 
    (SELECT 'TEMP' COL_1, 'VISIBILITY' COL_2, CORR(TEMP,VISIBILITY) CORR
    FROM PUBLIC.LONDON_WEATHER) UNION ALL
    (SELECT 'TEMP' COL_1, 'WINDSPEED' COL_2, CORR(TEMP,WINDSPEED) CORR
    FROM PUBLIC.LONDON_WEATHER) UNION ALL
    (SELECT 'PRECIPITATION' COL_1, 'VISIBILITY' COL_2, CORR(PRECIPITATION,VISIBILITY) CORR
    FROM PUBLIC.LONDON_WEATHER) UNION ALL
    (SELECT 'PRECIPITATION' COL_1, 'WINDSPEED' COL_2, CORR(PRECIPITATION,WINDSPEED) CORR
    FROM PUBLIC.LONDON_WEATHER)
    UNION ALL
    (SELECT 'VISIBILITY' COL_1, 'WINDSPEED' COL_2, CORR(VISIBILITY,WINDSPEED) CORR
    FROM PUBLIC.LONDON_WEATHER)
    UNION ALL
    (SELECT 'PRECIPITATION' COL_1, 'TEMP' COL_2, CORR(PRECIPITATION,TEMP) CORR
    FROM PUBLIC.LONDON_WEATHER)
    UNION ALL
    (SELECT 'VISIBILITY' COL_1, 'PRECIPITATION' COL_2, CORR(VISIBILITY,PRECIPITATION) CORR
    FROM PUBLIC.LONDON_WEATHER)
    UNION ALL
    (SELECT 'WINDSPEED' COL_1, 'TEMP' COL_2, CORR(WINDSPEED,TEMP) CORR
    FROM PUBLIC.LONDON_WEATHER) UNION ALL
    (SELECT 'VISIBILITY' COL_1, 'TEMP' COL_2, CORR(VISIBILITY,TEMP) CORR
    FROM PUBLIC.LONDON_WEATHER)
    UNION ALL
    (SELECT 'WINDSPEED' COL_1, 'PRECIPITATION' COL_2, CORR(WINDSPEED,PRECIPITATION) CORR
    FROM PUBLIC.LONDON_WEATHER) UNION ALL
    (SELECT 'WINDSPEED' COL_1, 'VISIBILITY' COL_2, CORR(WINDSPEED,VISIBILITY) CORR
    FROM PUBLIC.LONDON_WEATHER)
    COL_1COL_2CORR
    TEMPPRECIPITATION-0.05288072854
    TEMPVISIBILITY0.3310294722
    TEMPWINDSPEED0.03470707836
    PRECIPITATIONVISIBILITY-0.08879966527
    PRECIPITATIONWINDSPEED0.131898554
    VISIBILITYWINDSPEED0.2166210513
    PRECIPITATIONTEMP-0.05288072854
    VISIBILITYPRECIPITATION-0.08879966527
    WINDSPEEDTEMP0.03470707836
    VISIBILITYTEMP0.3310294722
    WINDSPEEDPRECIPITATION0.131898554
    WINDSPEEDVISIBILITY0.2166210513

    We can see the highest correlations are between temperature and visibility

    Identifying Outliers

    Identifying outliers, or points that lie at the most extreme ends of your dataset, is a crucial to the rest of your analysis. Early identification, and potential removal of outliers makes for more accurate predictions and a better understanding of your data.

    There are many ways to define outliers, but for this example, we'll use Tukey's fences:

    A point is an outlier if it lies outside the range [Q1 - k(Q3-Q1) , Q3 + k(Q3-Q1)] where k is either 1.5 or 3 for identifying particularly 'far out' data.

    Step 1: Identfy our Quartiles

    We can use PERCENTILE_CONT to find our 25th and 75th percentiles:

    SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY WINDSPEED) "Q1",
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY WINDSPEED) "Q3"
    from PUBLIC.LONDON_WEATHER
    Q1Q3
    5.29.5

    This means we can define an outlier that is outside of the following upper and lower bounds:

    SELECT (Q1-1.5*(Q3-Q1)) LOWER_BOUND , (Q3+1.5*(Q3-Q1)) UPPER_BOUND FROM IQR
    LOWER_BOUNDUPPER_BOUND
    -1.2515.95

    Step 2: Identify points outside our range:

    Select WINDSPEED from PUBLIC.LONDON_WEATHER cross join IQR where WINDSPEED not between (Q1-1.5*(Q3-Q1)) AND (Q3+1.5*(Q3-Q1))
    WINDSPEED
    17.6
    16.2
    16
    16.1
    17.3
    18
    17
    18.4
    18
    16.1
    20.2
    17.8
    19.5
    18.5
    19.5
    22.1
    16.3
    20.3
    16.8
    21.6
    19.2
    17.7
    19.7
    17.7
    16
    18.3
    17.6

    How We Built This

    This page was built using Count. It combines the best features of a SQL IDE, Data Visualization Tool, and Computational Notebooks. In the Count notebook, each cell acts like a CTE, meaning you can reference any other cell in your queries.

    This makes not only for far more readable reports (like this one) but also a much faster and more powerful way to do your analysis, essentially turning your analysis into a connected graph of data frames rather than one-off convoluted queries and CSV files. And with a built-in visualization framework, you won't have to export your data to make your charts. Go from raw data to interactive report in one document.

    Subscribe to newsletter

    Subscribe to receive the latest blog posts to your inbox every week.

    By subscribing you agree to our Privacy Policy.
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.

    Start solving your organization's biggest problems with Count today.

    Stay up to date with all things count, data and problem solving.