SQL Resources/Snowflake/Histograms

Histograms

Histograms are essential tools in understanding how our data is shaped or distributed. Mastering the art of building and customizing histograms in SQL can be tricky at first, but once you've got it down it's a recipe you'll be coming back to again and again.

Step-by-step

The data

We'll use some historical weather data for London for this example. We can see the temperature (in Fahrenheit) for each day below:

Select DATE, TEMP from PUBLIC.LONDON_WEATHER 
DATE
2018-04-11
2016-10-31
2018-11-23
2018-01-09
2017-01-26
2017-02-13
2016-12-31
2018-05-31
2017-01-27
2018-03-02
2019-02-16
2018-02-19
2017-01-31
2019-12-04
2017-11-02
2017-03-11
2018-05-29
2017-03-16
2018-03-24
2016-12-28
2019-11-09
2018-05-30
2018-04-14
2019-12-28
2017-02-06
2019-04-02
2019-02-28
2018-03-04
2017-10-18
2016-12-14
2018-11-01
2016-11-21
2017-08-21
2017-12-10
2016-06-04
2019-11-08
2018-10-05
2017-02-14
2019-04-30
2018-10-06
2019-02-17
2018-11-28
2016-04-11
2016-12-10
2018-02-20
2016-01-17
2017-12-02
2016-09-22
2018-10-20
2019-01-11
2017-05-07
2018-03-09
2019-04-06
2018-02-28
2018-03-10
2018-05-25
2019-06-25
2019-06-13
2018-04-02
2018-12-25
2017-04-04
2018-07-29
2019-02-08
2019-01-23
2018-05-27
2018-04-22
2019-12-20
2016-06-09
2016-04-15
2018-09-12
2017-12-07
2019-11-16
2017-09-19
2016-11-03
2016-04-02
2016-05-17
2019-02-26
2018-11-26
2016-12-01
2019-11-18
2019-04-23
2019-08-25
2017-03-10
2018-06-02
2019-01-29
2018-10-25
2018-11-14
2018-02-22
2017-04-03
2018-05-09
2018-03-21
2016-04-01
2016-04-21
2019-01-16
2018-09-11
2018-11-09
2016-09-05
2017-10-13
2018-04-03
2019-06-08
TEMP
48.5
54.7
44.2
41.6
31
42.8
42.8
65.7
37.8
30.6
47.4
47.7
45
38.9
47.5
54.2
59.5
48.7
46.2
36
40.4
60
55.2
46.4
39.2
42.6
48.8
43
56.1
48.8
48
47.8
62.1
34.9
59.9
40.7
60.5
44.6
53.5
50.1
49
53.3
52.5
52.8
47.3
36.3
40.2
61.4
53.1
45.5
54.5
43.6
47.4
24.6
51.5
59.5
67.9
54.8
46.9
41.5
53.1
64.6
48.6
34.8
69.3
61.7
46.8
62.7
50
56.4
48
43.1
53.7
43.5
50.3
55.3
50.7
43.1
35
43.6
60.4
76.3
49.1
65.4
35.4
49.6
52.1
37.7
51.5
57.1
41.5
45.9
51.1
45.6
66.6
50.9
64.6
61.4
51.8
58.4

1. Create bins

Histograms count the number of values that fall in a certain bin, so the first thing we'll do is create our bins. We'll do this using the FLOOR function, which will round down a value to the nearest integer.

If we want to create bins of 5 degrees, we can divide our temperature by 5, round down, then multiply that by 5 to get the lower end of the bin.

FLOOR(value / bin_size) * bin_size

For example:

SELECT distinct
  A.TEMP,
  floor(A.TEMP / 5) floor_div_5, 
  floor(A.TEMP/ 5) * 5 bin_start,
  floor(A.TEMP / 5) * 5 + 5 bin_end, 
  concat(
    cast(floor(A.TEMP / 5) * 5 as STRING),
    ' - ',
    cast(floor(A.TEMP / 5) * 5 + 5 as STRING)
  ) bin_range
FROM A
ORDER BY TEMP
TEMP
24.6
27.3
27.9
30.6
31
31.4
31.7
32
32.3
32.5
32.7
33.3
33.7
33.8
34.2
34.3
34.4
34.5
34.7
34.8
34.9
35
35.1
35.2
35.3
35.4
35.5
35.6
35.7
35.8
35.9
36
36.1
36.2
36.3
36.4
36.5
36.6
36.7
36.8
36.9
37.1
37.2
37.3
37.4
37.5
37.6
37.7
37.8
37.9
38
38.1
38.3
38.4
38.5
38.7
38.8
38.9
39
39.1
39.2
39.3
39.4
39.5
39.6
39.8
39.9
40
40.1
40.2
40.3
40.4
40.5
40.6
40.7
40.8
40.9
41
41.1
41.2
41.4
41.5
41.6
41.7
41.8
41.9
42
42.1
42.2
42.3
42.4
42.5
42.6
42.7
42.8
42.9
43
43.1
43.2
43.3
FLOOR_DIV_5
4
5
5
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
BIN_START
20
25
25
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
BIN_END
25
30
30
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
35
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
45
BIN_RANGE
20 - 25
25 - 30
25 - 30
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
30 - 35
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
35 - 40
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45
40 - 45

2. Count values per bin

Once we have our bins, we can count the number of values in each bin using a simple COUNT function.

SELECT
  COUNT(DATE) COUNT,
  floor(A.TEMP / 5) * 5 bin_start,
  floor(A.TEMP / 5) * 5 + 5 bin_end, 
  concat(
    cast(floor(A.TEMP / 5) * 5 as STRING),
    ' - ',
    cast(floor(A.TEMP / 5) * 5 + 5 as STRING)
  ) bin_range
FROM A 
GROUP BY 2, 3, 4
ORDER BY BIN_START
COUNT
1
2
28
99
191
255
244
180
224
162
45
20
1
BIN_START
20
25
30
35
40
45
50
55
60
65
70
75
80
BIN_END
25
30
35
40
45
50
55
60
65
70
75
80
85
BIN_RANGE
20 - 25
25 - 30
30 - 35
35 - 40
40 - 45
45 - 50
50 - 55
55 - 60
60 - 65
65 - 70
70 - 75
75 - 80
80 - 85
Interpretation: Most days in London fall between 45 - 50 degrees Fahrenheit.

ALTERNATE: Use WIDTH_BUCKET

Snowflake has a handy function width_bucket that will also automatically assign your data into buckets based on a pre-determined range and number of buckets:

WIDTH_BUCKET( <expr> , <min_value> , <max_value> , <num_buckets> )
  • exprThe expression for which the histogram is created. This expression must evaluate to a numeric value or to a value that can be implicitly converted to a numeric value.
  • min_value and max_valueThe low and high end points of the acceptable range for the expression. The end points must also evaluate to numeric values and not be equal.
  • num_bucketsThe desired number of buckets; must be a positive integer value. A value from the expression is assigned to each bucket, and the function then returns the corresponding bucket number.When an expression falls outside the range, the function returns:0 if the expression is less than min_value, num_buckets + 1 if the expression is greater than or equal to max_value.
select count(DATE) count, width_bucket(TEMP, 0, 100, 10) bucket
from A group by bucket
COUNT
446
424
127
386
3
1
65
BUCKET
5
6
4
7
3
9
8

How we built this

This page was built using Count, the first notebook built around SQL. It combines the best features of a SQL IDE, Data Visualization Tool, and Computational Notebooks. In the Count notebook, each cell acts like a CTE, meaning you can reference any other cell in your queries.

This makes not only for far more readable reports (like this one) but also a much faster and more powerful way to do your analysis, essentially turning your analysis into a connected graph of data frames rather than one-off convoluted queries and CSV files. And with a built-in visualization framework, you won't have to export your data to make your charts. Go from raw data to interactive report in one document.

Become a SQL Expert
Get five free SQL lessons from Count's Head of Data, Taylor Brownlow.
Email
Title