SELECT knowledge FROM sql_resources WHERE category='snowflake' AND slug='summarystats'

Summary Statistics

When you get a new dataset, one of the first things you want to do is find your summary statistics. These essential numbers tell you the basics about your data so you start off your analysis on an informed foot.

The Basics

The main summary statistics are:

select * from PUBLIC.LONDON_WEATHER
  • Inter-quartile Range
  • Standard Deviation
DATETEMPWEATHERPRECIPITATIONVISIBILITYWINDSPEED
2018-04-1148.5fog0.225.7
2016-10-3154.7fog0.012.52.1
2018-11-2344.2rain0.012.55.5
2018-01-0941.6rainnull35.5
2017-01-2631sunny037.7
2017-02-1342.8sunny0311.9
2016-12-3142.8fognull3.55.7
2018-05-3165.7fog03.55.1
2017-01-2737.8sunny03.56.1
2018-03-0230.6rain03.514.2
2019-02-1647.4sunny0.013.56.3
2018-02-1947.7rain0.023.55.4
2017-01-3145rain0.033.57.1
2019-12-0438.9fog042.3
2017-11-0247.5fog042.7
2017-03-1154.2sunny045.6
2018-05-2959.5fog046.4
2017-03-1648.7fog047.8
2018-03-2446.2rain0.0944.1
2016-12-2836fog0.0142.9
2019-11-0940.4fog0.0146
2018-05-3060rain0.4243.3
2018-04-1455.2sunny04.54.5
2019-12-2846.4fog04.55.7
2017-02-0639.2fog04.55.1
2019-04-0242.6rain04.55.8
2019-02-2848.8rain04.58.6
2018-03-0443rain0.194.57
2017-10-1856.1rain0.014.54.3
2016-12-1448.8fog0.014.54.9
2018-11-0148fog0.164.55.3
2016-11-2147.8rain0.44.58.8
2017-08-2162.1rain0.264.54.9
2017-12-1034.9rain0.214.56.1
2016-06-0459.9sunny054.1
2019-11-0840.7fog053.7
2018-10-0560.5sunny054.3
2017-02-1444.6sunny056.9
2019-04-3053.5sunny054.3
2018-10-0650.1fog056.8
2019-02-1749fog058
2018-11-2853.3rain0.27511.7
2016-04-1152.5rain0.0556
2016-12-1052.8rain0.0358.4
2018-02-2047.3rain0.0657.8
2016-01-1736.3rain0.0756.4
2017-12-0240.2rainnull5.54.8
2016-09-2261.4rainnull5.56.6
2018-10-2053.1sunny05.54
2019-01-1145.5rain05.54.4
2017-05-0754.5sunny05.55.4
2018-03-0943.6rain05.55.8
2019-04-0647.4fog05.57.6
2018-02-2824.6snow05.58.9
2018-03-1051.5rain0.255.58.1
2018-05-2559.5rain0.095.53.3
2019-06-2567.9rain0.115.54.1
2019-06-1354.8rain0.575.510.5
2018-04-0246.9rain0.465.59.3
2018-12-2541.5sunny0.015.53.4
2017-04-0453.1rain0.015.54.8
2018-07-2964.6rain0.015.513
2019-02-0848.6rain0.015.515.5
2019-01-2334.8fog0.165.53.9
2018-05-2769.3rain0.435.55.5
2018-04-2261.7rain0.435.59.1
2019-12-2046.8rain0.435.58.2
2016-06-0962.7sunny0.055.54.9
2016-04-1550rain0.15.54.2
2018-09-1256.4rain0.135.53.8
2017-12-0748rain0.125.512.7
2019-11-1643.1rain0.075.54
2017-09-1953.7rain0.145.54.3
2016-11-0343.5rainnull65
2016-04-0250.3rainnull67.6
2016-05-1755.3rainnull67.7
2019-02-2650.7sunny063
2018-11-2643.1rain064.8
2016-12-0135sunny064.1
2019-11-1843.6rain064.6
2019-04-2360.4sunny064.8
2019-08-2576.3fog065.1
2017-03-1049.1rain066.1
2018-06-0265.4sunny067.8
2019-01-2935.4rain065
2018-10-2549.6sunny065.4
2018-11-1452.1rain067.2
2018-02-2237.7rain067.5
2017-04-0351.5fog067
2018-05-0957.1sunny067.2
2018-03-2141.5sunny067.3
2016-04-0145.9sunny067.7
2016-04-2151.1sunny069.5
2019-01-1645.6rain069.8
2018-09-1166.6rain0611.6
2018-11-0950.9rain0611.3
2016-09-0564.6rain0.0966
2017-10-1361.4sunny0.09610.5
2018-04-0351.8rain0.18610.3
2019-06-0858.4rain0.33615.3
COLMIN25th PERCENTILEMEDIAN75th PERCENTILEMAXAVGMODESTD_DEV
TEMP24.645.67552.7561.883.453.51274104757.710.211334572
PRECIPITATION0000.071.440.0696046852100.1466155086
VISIBILITY0.25.86.46.714.86.1050275486.71.080844202
WINDSPEED1.85.27.29.522.17.6900137744.83.228622125
COLMINMAXMODEUNIQUE_VALUES
DATE2016-01-012019-12-312018-04-111452
COLMINMAXMODEUNIQUE_VALUES
WEATHERfogthunderstormrain5

Correlation Matrix

Correlation matrices let you quickly see the correlations between all the different variables.

To create them, we will make use of the CORR

COL_1COL_2CORR
TEMPPRECIPITATION-0.05288072854
TEMPVISIBILITY0.3310294722
TEMPWINDSPEED0.03470707836
PRECIPITATIONVISIBILITY-0.08879966527
PRECIPITATIONWINDSPEED0.131898554
VISIBILITYWINDSPEED0.2166210513
PRECIPITATIONTEMP-0.05288072854
VISIBILITYPRECIPITATION-0.08879966527
WINDSPEEDTEMP0.03470707836
VISIBILITYTEMP0.3310294722
WINDSPEEDPRECIPITATION0.131898554
WINDSPEEDVISIBILITY0.2166210513

We can see the highest correlations are between temperature and visibility

Identifying Outliers

Identifying outliers, or points that lie at the most extreme ends of your dataset, is a crucial to the rest of your analysis. Early identification, and potential removal of outliers makes for more accurate predictions and a better understanding of your data.

There are many ways to define outliers, but for this example, we'll use Tukey's fences:

Loading code...

A point is an outlier if it lies outside the range [Q1 - k(Q3-Q1) , Q3 + k(Q3-Q1)] where k is either 1.5 or 3 for identifying particularly 'far out' data.

Step 1: Identfy our Quartiles

We can use PERCENTILE_CONT to find our 25th and 75th percentiles:

Loading code...
Q1Q3
5.29.5

This means we can define an outlier that is outside of the following upper and lower bounds:

Loading code...
LOWER_BOUNDUPPER_BOUND
-1.2515.95

Step 2: Identify points outside our range:

WINDSPEED
17.6
16.2
16
16.1
17.3
18
17
18.4
18
16.1
20.2
17.8
19.5
18.5
19.5
22.1
16.3
20.3
16.8
21.6
19.2
17.7
19.7
17.7
16
18.3
17.6

How We Built This

This page was built using Count. It combines the best features of a SQL IDE, Data Visualization Tool, and Computational Notebooks. In the Count notebook, each cell acts like a CTE, meaning you can reference any other cell in your queries.

This makes not only for far more readable reports (like this one) but also a much faster and more powerful way to do your analysis, essentially turning your analysis into a connected graph of data frames rather than one-off convoluted queries and CSV files. And with a built-in visualization framework, you won't have to export your data to make your charts. Go from raw data to interactive report in one document.

Loading code...
Loading code...
Loading code...
Loading code...