String Functions Explained
Introduction
STRING
s are a crucial part of any dataset and being able to confidently manipulate and transform them can make all the difference in your analysis. This notebook covers the common STRING
manipulations in BigQuery.
The Basics
STRING
s are their own data type in Big Query. They are the most flexible type - often our dates if not formatted exactly right will be read in as a STRING
, or our survey results will be listed as STRING
s of responses. Therefore it's crucial to know how to master them.
Manipulating Strings
How to modify and transform entire STRING
s. Each of these functions returns a STRING
.
Removing and adding characters
There are several functions available in BigQuery to add and remove whitespace to your STRING
s. They either involve:
For trimming, the functions are:
TRIM(value1[, value2])
-> Removes all leading and trailing characters that match value2
(whitespace if not specified)LTRIM(value1[, value2])
-> Removes only leading characters that match value2
(whitespace if not specified)RTRIM(value1[, value2])
-> Removes only trailing characters that match value2
(whitespace if not specified)SELECT
' Original String__',
TRIM(' Original String__') AS trimmed,
LTRIM(' Original String__') AS left_trim,
RTRIM(' Original String__', "__") AS right_trim
💡 TRIM
is very useful when dealing Substrings as shown in the next section.
For padding, the functions available are:
RPAD(original_value, return_length[, pattern])
-> Returns the original_value
appended with the pattern
up to the return_length number of characters.LPAD(original_value, return_length[, pattern])
-> Returns the original_value
prepended with the pattern
up to the return_length number of characters.SELECT
RPAD('abc', 5) AS right_spaces,
LPAD('abc', 5, '--') AS left_pad_hypen
Changing case
A common way to re-format STRINGs is to change the case. To do that in BigQuery, you can use:
LOWER(value)
-> Returns value
in lowercaseUPPER(value)
-> Returns value
in uppercaseINITCAP(value[, delimiters])
-> Returns the first character in each word as uppercase and the rest as lowercaseSELECT
UPPER('AbCd ef') AS upper,
LOWER('AbCd ef') AS lower,
INITCAP('AbCd ef') AS word_caps
Re-arranging values
To re-arrange the characters in a STRING
, the most common function is:
REVERSE(value)
-> Returns string in reverse orderSELECT
REVERSE('racecar') AS palindrome,
REVERSE('palindrome') AS not_palindrome
Analysing Strings
How to get information about each STRING
. These functions may return INT64
or BOOL
.
String length
To find the length of a STRING
, there are a surprising number of options in BigQuery:
BYTE_LENGTH(value)
-> Returns the length of the STRING
in BYTE
sCHAR_LENGTH(value)
or CHARACTER_LENGTH(value)
-> Returns the length of the STRING
in charactersLENGTH(value)
-> Returns number of charactersSELECT
BYTE_LENGTH('Hello! 👪') AS bytes,
CHAR_LENGTH('Hello! 👪') AS characters,
LENGTH('Hello! 👪') AS length
Comparing strings
When building filters or CASE
statements, it's common to check whether a STRING
is or is similar to another string or substring. To do that we can use one of the STRING
comparison functions:
STARTS_WITH(value1, value2)
-> Returns True/False if value1
starts with the substring value2
ENDS_WITH(value1, value2)
-> Returns True/False if value1
ends with with the substring value2
REGEXP_CONTAINS(value, regexp)
-> Returns True/False if value contains the pattern matched in the regexp
expressionYou can see BigQuery's regex library here.
in combination with one of the following comparison operators:
=
, !=
IN
, NOT IN
LIKE
, NOT LIKE
SELECT
STARTS_WITH('Hello, there', 'Hello') AS starts_with_hello,
'Bob' IN ('Mary', 'John', 'Barry') AS in_list,
('xyzFind Mejf3' LIKE '%Find Me%') AS like_string,
REGEXP_CONTAINS('xyzFind Mejf3', r'Find Me') AS regex_contains
💡 For LIKE, the %
is a wildcard character
Substrings
How to find, extract, and modify substrings, or parts of STRING
s. These functions may return INT64
, ARRAY
or STRING
.
Finding substrings
When dealing with substrings you often want to first locate a substring within a STRING
. To do that, we can use:
REGEXP_INSTR(source_value, regexp [, position[, occurrence, [occurrence_position]]])
-> Returns the lowest 1-based index of a regular expression, regexp
, in source_value
STRPOS(value1, value2)
-> Returns the 1-based index of the first occurrence of value2
inside value1
. Returns 0
if value2
is not found.SELECT
REGEXP_INSTR('hello@cool_stuff.com', r'@[^.]+.') AS regex_position,
STRPOS('racecar', 'car') AS string_position
These functions tell you which index the substring first appears, useful for replacing or extracting substrings given this index.
Replacing substrings
Once you've found a substring, you likely want to either extract it or replace it. To replace a substring, you can use one of the following:
REPLACE(original_value, from_value, to_value)
->Replaces all occurrences of from_value
with to_value
in original_value
.REGEXP_REPLACE(value, regexp, replacement)
-> Returns a STRING
where all substrings of value
that match regular expression regexp
are replaced with replacement
.SELECT
REPLACE('My Name is ____', '____', 'Judge') AS replace,
REGEXP_REPLACE('My Name is ____', r'(_+)', 'Judge') AS regex_replace
Extracting substrings
Perhaps the most valuable manipulation to perform on substrings is to extract them to their own column or entity. To do that, there are several options:
LEFT(value, length)
-> Returns a STRING
value that consists of the specified number of leftmost characters or bytes from value
RIGHT(value, length)
-> Returns a STRING
value that consists of the specified number of rightmost characters or bytes from value
SPLIT(value[, delimiter])
-> Returns an ARRAY
of STRING
s split by the delimiter
(or space if omitted) SUBSTR(value, position[, length])
-> Returns a substring of value
from the position
index up to the length
of characters specified REGEXP_EXTRACT(value, regexp[, position[, occurrence]])
or REGEXP_SUBSTR(value, regexp[, position[, occurrence]])
-> Returns a substring of value
that matches the regexp
expression starting at position
index.REGEXP_EXTRACT_ALL(value, regexp)
-> Returns an ARRAY of substrings that match the regexp
expressionSELECT
RIGHT('@hello.com',9) domain,
SPLIT('This is a sentence',' ') words,
SUBSTR('_xyz_',2,3) substring,
REGEXP_EXTRACT_ALL('anything in CAPS is SPECIAL',r'([A-Z]+)') special_words
Combining substrings
To combine substrings together you can use:
CONCAT(value1[, ...])
-> Concatenates 2 or more STRING
s into a single result. SELECT
CONCAT('Hello', " ", "World")
How Do I...
Count the number of occurrences of a character in a string?
To do this we can make use of REGEXP_EXTRACT_ALL
and ARRAY_LENGTH
.
SELECT
ARRAY_LENGTH(REGEXP_EXTRACT_ALL("how many a's in this sentence?", r'(a)')) AS a_count
Extract numbers from string using regex?
To do this we can again use REGEXP_EXTRACT_ALL
We can choose to make our regex greedy, meaning once it found one number it will look for another one. This is the difference between:
r'([0-9]+)'
: which says find 1 or more digits together and r'([0-9])'
which says find any digits. Depending on what you want, both can be useful.
SELECT
REGEXP_EXTRACT_ALL('12x12=144', r'([0-9]+)') AS greedy_number_count,
REGEXP_EXTRACT_ALL('12x12=144', r'([0-9])') AS not_greedy_number_count