SQL Resources/Snowflake/Summary Statistics

Summary Statistics

When you get a new dataset, one of the first things you want to do is find your summary statistics. These essential numbers tell you the basics about your data so you start off your analysis on an informed foot.

The Basics

The main summary statistics are:

  • Min
  • Max
  • Average
  • Median
  • Mode
  • Inter-quartile Range
  • Standard Deviation
select * from PUBLIC.LONDON_WEATHER
DATE
2018-04-11
2016-10-31
2018-11-23
2018-01-09
2017-01-26
2017-02-13
2016-12-31
2018-05-31
2017-01-27
2018-03-02
2019-02-16
2018-02-19
2017-01-31
2019-12-04
2017-11-02
2017-03-11
2018-05-29
2017-03-16
2018-03-24
2016-12-28
2019-11-09
2018-05-30
2018-04-14
2019-12-28
2017-02-06
2019-04-02
2019-02-28
2018-03-04
2017-10-18
2016-12-14
2018-11-01
2016-11-21
2017-08-21
2017-12-10
2016-06-04
2019-11-08
2018-10-05
2017-02-14
2019-04-30
2018-10-06
2019-02-17
2018-11-28
2016-04-11
2016-12-10
2018-02-20
2016-01-17
2017-12-02
2016-09-22
2018-10-20
2019-01-11
2017-05-07
2018-03-09
2019-04-06
2018-02-28
2018-03-10
2018-05-25
2019-06-25
2019-06-13
2018-04-02
2018-12-25
2017-04-04
2018-07-29
2019-02-08
2019-01-23
2018-05-27
2018-04-22
2019-12-20
2016-06-09
2016-04-15
2018-09-12
2017-12-07
2019-11-16
2017-09-19
2016-11-03
2016-04-02
2016-05-17
2019-02-26
2018-11-26
2016-12-01
2019-11-18
2019-04-23
2019-08-25
2017-03-10
2018-06-02
2019-01-29
2018-10-25
2018-11-14
2018-02-22
2017-04-03
2018-05-09
2018-03-21
2016-04-01
2016-04-21
2019-01-16
2018-09-11
2018-11-09
2016-09-05
2017-10-13
2018-04-03
2019-06-08
TEMP
48.5
54.7
44.2
41.6
31
42.8
42.8
65.7
37.8
30.6
47.4
47.7
45
38.9
47.5
54.2
59.5
48.7
46.2
36
40.4
60
55.2
46.4
39.2
42.6
48.8
43
56.1
48.8
48
47.8
62.1
34.9
59.9
40.7
60.5
44.6
53.5
50.1
49
53.3
52.5
52.8
47.3
36.3
40.2
61.4
53.1
45.5
54.5
43.6
47.4
24.6
51.5
59.5
67.9
54.8
46.9
41.5
53.1
64.6
48.6
34.8
69.3
61.7
46.8
62.7
50
56.4
48
43.1
53.7
43.5
50.3
55.3
50.7
43.1
35
43.6
60.4
76.3
49.1
65.4
35.4
49.6
52.1
37.7
51.5
57.1
41.5
45.9
51.1
45.6
66.6
50.9
64.6
61.4
51.8
58.4
WEATHER
fog
fog
rain
rain
sunny
sunny
fog
fog
sunny
rain
sunny
rain
rain
fog
fog
sunny
fog
fog
rain
fog
fog
rain
sunny
fog
fog
rain
rain
rain
rain
fog
fog
rain
rain
rain
sunny
fog
sunny
sunny
sunny
fog
fog
rain
rain
rain
rain
rain
rain
rain
sunny
rain
sunny
rain
fog
snow
rain
rain
rain
rain
rain
sunny
rain
rain
rain
fog
rain
rain
rain
sunny
rain
rain
rain
rain
rain
rain
rain
rain
sunny
rain
sunny
rain
sunny
fog
rain
sunny
rain
sunny
rain
rain
fog
sunny
sunny
sunny
sunny
rain
rain
rain
rain
sunny
rain
rain
PRECIPITATION
0.2
0.01
0.01
NULL
0
0
NULL
0
0
0
0.01
0.02
0.03
0
0
0
0
0
0.09
0.01
0.01
0.42
0
0
0
0
0
0.19
0.01
0.01
0.16
0.4
0.26
0.21
0
0
0
0
0
0
0
0.27
0.05
0.03
0.06
0.07
NULL
NULL
0
0
0
0
0
0
0.25
0.09
0.11
0.57
0.46
0.01
0.01
0.01
0.01
0.16
0.43
0.43
0.43
0.05
0.1
0.13
0.12
0.07
0.14
NULL
NULL
NULL
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.09
0.09
0.18
0.33
VISIBILITY
2
2.5
2.5
3
3
3
3.5
3.5
3.5
3.5
3.5
3.5
3.5
4
4
4
4
4
4
4
4
4
4.5
4.5
4.5
4.5
4.5
4.5
4.5
4.5
4.5
4.5
4.5
4.5
5
5
5
5
5
5
5
5
5
5
5
5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
5.5
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
WINDSPEED
5.7
2.1
5.5
5.5
7.7
11.9
5.7
5.1
6.1
14.2
6.3
5.4
7.1
2.3
2.7
5.6
6.4
7.8
4.1
2.9
6
3.3
4.5
5.7
5.1
5.8
8.6
7
4.3
4.9
5.3
8.8
4.9
6.1
4.1
3.7
4.3
6.9
4.3
6.8
8
11.7
6
8.4
7.8
6.4
4.8
6.6
4
4.4
5.4
5.8
7.6
8.9
8.1
3.3
4.1
10.5
9.3
3.4
4.8
13
15.5
3.9
5.5
9.1
8.2
4.9
4.2
3.8
12.7
4
4.3
5
7.6
7.7
3
4.8
4.1
4.6
4.8
5.1
6.1
7.8
5
5.4
7.2
7.5
7
7.2
7.3
7.7
9.5
9.8
11.6
11.3
6
10.5
10.3
15.3
SELECT * FROM (
SELECT
  'TEMP' COL,
  MIN(TEMP) MIN,
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY TEMP) "25th PERCENTILE",
  MEDIAN(TEMP) MEDIAN,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY TEMP) "75th PERCENTILE",
  MAX(TEMP) MAX,
  AVG(TEMP) AVG,
  MODE(TEMP) MODE,
  STDDEV(TEMP) STD_DEV
FROM PUBLIC.LONDON_WEATHER
GROUP BY COL)
UNION ALL (
  SELECT
  'PRECIPITATION' COL,
  MIN(PRECIPITATION) MIN,
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY PRECIPITATION) "25th PERCENTILE",
  MEDIAN(PRECIPITATION) MEDIAN,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY PRECIPITATION) "75th PERCENTILE",
  MAX(PRECIPITATION) MAX,
  AVG(PRECIPITATION) AVG,
  MODE(PRECIPITATION) MODE,
  STDDEV(PRECIPITATION) STD_DEV
FROM PUBLIC.LONDON_WEATHER
GROUP BY COL
)
UNION ALL (
  SELECT
  'VISIBILITY' COL,
  MIN(VISIBILITY) MIN,
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY VISIBILITY) "25th PERCENTILE",
  MEDIAN(VISIBILITY) MEDIAN,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY VISIBILITY) "75th PERCENTILE",
  MAX(VISIBILITY) MAX,
  AVG(VISIBILITY) AVG,
  MODE(VISIBILITY) MODE,
  STDDEV(VISIBILITY) STD_DEV
FROM PUBLIC.LONDON_WEATHER
GROUP BY COL
)
UNION ALL (
  SELECT
  'WINDSPEED' COL,
  MIN(WINDSPEED) MIN,
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY WINDSPEED) "25th PERCENTILE",
  MEDIAN(WINDSPEED) MEDIAN,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY WINDSPEED) "75th PERCENTILE",
  MAX(WINDSPEED) MAX,
  AVG(WINDSPEED) AVG,
  MODE(WINDSPEED) MODE,
  STDDEV(WINDSPEED) STD_DEV
FROM PUBLIC.LONDON_WEATHER
GROUP BY COL
)
COL
TEMP
PRECIPITATION
VISIBILITY
WINDSPEED
MIN
24.6
0
0.2
1.8
25th PERCENTILE
45.675
0
5.8
5.2
MEDIAN
52.75
0
6.4
7.2
75th PERCENTILE
61.8
0.07
6.7
9.5
MAX
83.4
1.44
14.8
22.1
AVG
53.512741047
0.06960468521
6.105027548
7.690013774
MODE
57.7
0
6.7
4.8
STD_DEV
10.211334572
0.1466155086
1.080844202
3.228622125
SELECT
  'DATE' COL,
  MIN(DATE) MIN,
  MAX(DATE) MAX,
  MODE(DATE) MODE,
  COUNT(DISTINCT DATE) UNIQUE_VALUES
FROM PUBLIC.LONDON_WEATHER
GROUP BY COL
COL
DATE
MIN
2016-01-01
MAX
2019-12-31
MODE
2018-04-11
UNIQUE_VALUES
1452
SELECT
  'WEATHER' COL,
  MIN(WEATHER) MIN,
  MAX(WEATHER) MAX,
  MODE(WEATHER) MODE,
  COUNT(DISTINCT WEATHER) UNIQUE_VALUES
FROM PUBLIC.LONDON_WEATHER
GROUP BY COL
COL
WEATHER
MIN
fog
MAX
thunderstorm
MODE
rain
UNIQUE_VALUES
5

Correlation Matrix

Correlation matrices let you quickly see the correlations between all the different variables.

To create them, we will make use of the CORR

SELECT * FROM 
(SELECT 'TEMP' COL_1, 'PRECIPITATION' COL_2, CORR(TEMP,PRECIPITATION) CORR
FROM PUBLIC.LONDON_WEATHER) UNION ALL 
(SELECT 'TEMP' COL_1, 'VISIBILITY' COL_2, CORR(TEMP,VISIBILITY) CORR
FROM PUBLIC.LONDON_WEATHER) UNION ALL
(SELECT 'TEMP' COL_1, 'WINDSPEED' COL_2, CORR(TEMP,WINDSPEED) CORR
FROM PUBLIC.LONDON_WEATHER) UNION ALL
(SELECT 'PRECIPITATION' COL_1, 'VISIBILITY' COL_2, CORR(PRECIPITATION,VISIBILITY) CORR
FROM PUBLIC.LONDON_WEATHER) UNION ALL
(SELECT 'PRECIPITATION' COL_1, 'WINDSPEED' COL_2, CORR(PRECIPITATION,WINDSPEED) CORR
FROM PUBLIC.LONDON_WEATHER)
UNION ALL
(SELECT 'VISIBILITY' COL_1, 'WINDSPEED' COL_2, CORR(VISIBILITY,WINDSPEED) CORR
FROM PUBLIC.LONDON_WEATHER)
UNION ALL
(SELECT 'PRECIPITATION' COL_1, 'TEMP' COL_2, CORR(PRECIPITATION,TEMP) CORR
FROM PUBLIC.LONDON_WEATHER)
UNION ALL
(SELECT 'VISIBILITY' COL_1, 'PRECIPITATION' COL_2, CORR(VISIBILITY,PRECIPITATION) CORR
FROM PUBLIC.LONDON_WEATHER)
UNION ALL
(SELECT 'WINDSPEED' COL_1, 'TEMP' COL_2, CORR(WINDSPEED,TEMP) CORR
FROM PUBLIC.LONDON_WEATHER) UNION ALL
(SELECT 'VISIBILITY' COL_1, 'TEMP' COL_2, CORR(VISIBILITY,TEMP) CORR
FROM PUBLIC.LONDON_WEATHER)
UNION ALL
(SELECT 'WINDSPEED' COL_1, 'PRECIPITATION' COL_2, CORR(WINDSPEED,PRECIPITATION) CORR
FROM PUBLIC.LONDON_WEATHER) UNION ALL
(SELECT 'WINDSPEED' COL_1, 'VISIBILITY' COL_2, CORR(WINDSPEED,VISIBILITY) CORR
FROM PUBLIC.LONDON_WEATHER)
COL_1
TEMP
TEMP
TEMP
PRECIPITATION
PRECIPITATION
VISIBILITY
PRECIPITATION
VISIBILITY
WINDSPEED
VISIBILITY
WINDSPEED
WINDSPEED
COL_2
PRECIPITATION
VISIBILITY
WINDSPEED
VISIBILITY
WINDSPEED
WINDSPEED
TEMP
PRECIPITATION
TEMP
TEMP
PRECIPITATION
VISIBILITY
CORR
-0.05288072854
0.3310294722
0.03470707836
-0.08879966527
0.131898554
0.2166210513
-0.05288072854
-0.08879966527
0.03470707836
0.3310294722
0.131898554
0.2166210513

We can see the highest correlations are between temperature and visibility

Identifying Outliers

Identifying outliers, or points that lie at the most extreme ends of your dataset, is a crucial to the rest of your analysis. Early identification, and potential removal of outliers makes for more accurate predictions and a better understanding of your data.

There are many ways to define outliers, but for this example, we'll use Tukey's fences:

Step 1: Identfy our Quartiles

We can use PERCENTILE_CONT to find our 25th and 75th percentiles:

SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY WINDSPEED) "Q1",
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY WINDSPEED) "Q3"
from PUBLIC.LONDON_WEATHER
Q1
5.2
Q3
9.5

This means we can define an outlier that is outside of the following upper and lower bounds:

SELECT (Q1-1.5*(Q3-Q1)) LOWER_BOUND , (Q3+1.5*(Q3-Q1)) UPPER_BOUND FROM IQR
LOWER_BOUND
-1.25
UPPER_BOUND
15.95

Step 2: Identify points outside our range:

Select WINDSPEED from PUBLIC.LONDON_WEATHER cross join IQR where WINDSPEED not between (Q1-1.5*(Q3-Q1)) AND (Q3+1.5*(Q3-Q1))
WINDSPEED
17.6
16.2
16
16.1
17.3
18
17
18.4
18
16.1
20.2
17.8
19.5
18.5
19.5
22.1
16.3
20.3
16.8
21.6
19.2
17.7
19.7
17.7
16
18.3
17.6

How We Built This

This page was built using Count, the first notebook built around SQL. It combines the best features of a SQL IDE, Data Visualization Tool, and Computational Notebooks. In the Count notebook, each cell acts like a CTE, meaning you can reference any other cell in your queries.

This makes not only for far more readable reports (like this one) but also a much faster and more powerful way to do your analysis, essentially turning your analysis into a connected graph of data frames rather than one-off convoluted queries and CSV files. And with a built-in visualization framework, you won't have to export your data to make your charts. Go from raw data to interactive report in one document.

Become a SQL Expert
Get five free SQL lessons from Count's Head of Data, Taylor Brownlow.
Email
Title