STRINGs are a crucial part of any dataset and being able to confidently manipulate and transform them can make all the difference in your analysis. This notebook covers the common STRING manipulations in BigQuery.
STRINGs are their own data type in Big Query. They are the most flexible type - often our dates if not formatted exactly right will be read in as a STRING, or our survey results will be listed as STRINGs of responses. Therefore it's crucial to know how to master them.
How to modify and transform entire STRINGs. Each of these functions returns a STRING.
There are several functions available in BigQuery to add and remove whitespace to your STRINGs. They either involve:
For trimming, the functions are:
SELECT
' Original String__',
TRIM(' Original String__') AS trimmed,
LTRIM(' Original String__') AS left_trim,
RTRIM(' Original String__', "__") AS right_trim
For padding, the functions available are:
SELECT RPAD('abc', 5) AS right_spaces, LPAD('abc', 5, '--') AS left_pad_hypen
A common way to re-format STRINGs is to change the case. To do that in BigQuery, you can use:
SELECT
UPPER('AbCd ef') AS upper,
LOWER('AbCd ef') AS lower,
INITCAP('AbCd ef') AS word_caps
To re-arrange the characters in a STRING, the most common function is:
SELECT REVERSE('racecar') AS palindrome, REVERSE('palindrome') AS not_palindrome
How to get information about each STRING. These functions may return INT64 or BOOL.
To find the length of a STRING, there are a surprising number of options in BigQuery:
SELECT BYTE_LENGTH('Hello! 👪') AS bytes, CHAR_LENGTH('Hello! 👪') AS characters, LENGTH('Hello! 👪') AS length
When building filters or CASE statements, it's common to check whether a STRING is or is similar to another string or substring. To do that we can use one of the STRING comparison functions:
in combination with one of the following comparison operators:
SELECT STARTS_WITH('Hello, there', 'Hello') AS starts_with_hello, 'Bob' IN ('Mary', 'John', 'Barry') AS in_list, ('xyzFind Mejf3' LIKE '%Find Me%') AS like_string, REGEXP_CONTAINS('xyzFind Mejf3', r'Find Me') AS regex_contains
How to find, extract, and modify substrings, or parts of STRINGs. These functions may return INT64, ARRAY or STRING.
When dealing with substrings you often want to first locate a substring within a STRING. To do that, we can use:
SELECT REGEXP_INSTR('hello@cool_stuff.com', r'@[^.]+.') AS regex_position, STRPOS('racecar', 'car') AS string_position
These functions tell you which index the substring first appears, useful for replacing or extracting substrings given this index.
Once you've found a substring, you likely want to either extract it or replace it. To replace a substring, you can use one of the following:
SELECT REPLACE('My Name is ____', '____', 'Judge') AS replace, REGEXP_REPLACE('My Name is ____', r'(_+)', 'Judge') AS regex_replace
Perhaps the most valuable manipulation to perform on substrings is to extract them to their own column or entity. To do that, there are several options:
SELECT RIGHT('@hello.com',9) domain, SPLIT('This is a sentence',' ') words, SUBSTR('_xyz_',2,3) substring, REGEXP_EXTRACT_ALL('anything in CAPS is SPECIAL',r'([A-Z]+)') special_words
To combine substrings together you can use:
SELECT CONCAT('Hello', " ", "World")
To do this we can make use of REGEXP_EXTRACT_ALL and ARRAY_LENGTH.
SELECT ARRAY_LENGTH(REGEXP_EXTRACT_ALL("how many a's in this sentence?", r'(a)')) AS a_count
To do this we can again use REGEXP_EXTRACT_ALL
We can choose to make our regex greedy, meaning once it found one number it will look for another one. This is the difference between:
r'([0-9]+)' : which says find 1 or more digits together and r'([0-9])' which says find any digits. Depending on what you want, both can be useful.
SELECT REGEXP_EXTRACT_ALL('12x12=144', r'([0-9]+)') AS greedy_number_count, REGEXP_EXTRACT_ALL('12x12=144', r'([0-9])') AS not_greedy_number_count