String Functions Explained
Introduction
STRINGs are a crucial part of any dataset and being able to confidently manipulate and transform them can make all the difference in your analysis. This notebook covers the common STRING manipulations in BigQuery.
The Basics
STRINGs are their own data type in Big Query. They are the most flexible type - often our dates if not formatted exactly right will be read in as a STRING, or our survey results will be listed as STRINGs of responses. Therefore it's crucial to know how to master them.
Manipulating Strings
How to modify and transform entire STRINGs. Each of these functions returns a STRING.
Removing and adding characters
There are several functions available in BigQuery to add and remove whitespace to your STRINGs. They either involve:
- trimming: removing whitespace
- padding: adding whitespace
For trimming, the functions are:
TRIM(value1[, value2])-> Removes all leading and trailing characters that matchvalue2(whitespace if not specified)LTRIM(value1[, value2])-> Removes only leading characters that matchvalue2(whitespace if not specified)RTRIM(value1[, value2])-> Removes only trailing characters that matchvalue2(whitespace if not specified)
| f0_ | trimmed | left_trim | right_trim |
|---|---|---|---|
| Original String__ | Original String__ | Original String__ | Original String |
💡 TRIM is very useful when dealing Substrings as shown in the next section.
For padding, the functions available are:
RPAD(original_value, return_length[, pattern])-> Returns theoriginal_valueappended with thepatternup to the return_length number of characters.LPAD(original_value, return_length[, pattern])-> Returns theoriginal_valueprepended with thepatternup to the return_length number of characters.
| right_spaces | left_pad_hypen |
|---|---|
| abc | --abc |
Changing case
A common way to re-format STRINGs is to change the case. To do that in BigQuery, you can use:
LOWER(value)-> Returnsvaluein lowercaseUPPER(value)-> Returnsvaluein uppercaseINITCAP(value[, delimiters])-> Returns the first character in each word as uppercase and the rest as lowercase
| upper | lower | word_caps |
|---|---|---|
| ABCD EF | abcd ef | Abcd Ef |
Re-arranging values
To re-arrange the characters in a STRING, the most common function is:
REVERSE(value)-> Returns string in reverse order
| palindrome | not_palindrome |
|---|---|
| racecar | emordnilap |
Analysing Strings
How to get information about each STRING. These functions may return INT64 or BOOL.
String length
To find the length of a STRING, there are a surprising number of options in BigQuery:
BYTE_LENGTH(value)-> Returns the length of theSTRINGinBYTEsCHAR_LENGTH(value)orCHARACTER_LENGTH(value)-> Returns the length of theSTRINGin charactersLENGTH(value)-> Returns number of characters
| bytes | characters | length |
|---|---|---|
| 11 | 8 | 8 |
Comparing strings
When building filters or CASE statements, it's common to check whether a STRING is or is similar to another string or substring. To do that we can use one of the STRING comparison functions:
STARTS_WITH(value1, value2)-> Returns True/False ifvalue1starts with the substringvalue2ENDS_WITH(value1, value2)-> Returns True/False ifvalue1ends with with the substringvalue2REGEXP_CONTAINS(value, regexp)-> Returns True/False if value contains the pattern matched in theregexpexpression
You can see BigQuery's regex library here.
in combination with one of the following comparison operators:
IN,NOT INLIKE,NOT LIKE
| starts_with_hello | in_list | like_string | regex_contains |
|---|---|---|---|
| true | false | true | true |
💡 For LIKE, the % is a wildcard character
Substrings
How to find, extract, and modify substrings, or parts of STRINGs. These functions may return INT64, ARRAY or STRING.
Finding substrings
When dealing with substrings you often want to first locate a substring within a STRING. To do that, we can use:
SELECT
CONCAT('Hello', " ", "World")REGEXP_INSTR(source_value, regexp [, position[, occurrence, [occurrence_position]]])-> Returns the lowest 1-based index of a regular expression,regexp, insource_valueSTRPOS(value1, value2)-> Returns the 1-based index of the first occurrence ofvalue2insidevalue1. Returns0ifvalue2is not found.
| regex_position | string_position |
|---|---|
| 6 | 5 |
These functions tell you which index the substring first appears, useful for replacing or extracting substrings given this index.
Replacing substrings
Once you've found a substring, you likely want to either extract it or replace it. To replace a substring, you can use one of the following:
SELECT
ARRAY_LENGTH(REGEXP_EXTRACT_ALL("how many a's in this sentence?", r'(a)')) AS a_countREPLACE(original_value, from_value, to_value)->Replaces all occurrences offrom_valuewithto_valueinoriginal_value.REGEXP_REPLACE(value, regexp, replacement)-> Returns aSTRINGwhere all substrings ofvaluethat match regular expressionregexpare replaced withreplacement.
SELECT
REPLACE('My Name is ____', '____', 'Judge') AS replace,
REGEXP_REPLACE('My Name is ____', r'(_+)', 'Judge') AS regex_replace
| replace | regex_replace |
|---|---|
| My Name is Judge | My Name is Judge |
Extracting substrings
Perhaps the most valuable manipulation to perform on substrings is to extract them to their own column or entity. To do that, there are several options:
LEFT(value, length)-> Returns aSTRINGvalue that consists of the specified number of leftmost characters or bytes fromvalueRIGHT(value, length)-> Returns aSTRINGvalue that consists of the specified number of rightmost characters or bytes fromvalueSPLIT(value[, delimiter])-> Returns anARRAYofSTRINGs split by thedelimiter(or space if omitted)SUBSTR(value, position[, length])-> Returns a substring ofvaluefrom thepositionindex up to thelengthof characters specifiedREGEXP_EXTRACT(value, regexp[, position[, occurrence]])orREGEXP_SUBSTR(value, regexp[, position[, occurrence]])-> Returns a substring ofvaluethat matches theregexpexpression starting atpositionindex.REGEXP_EXTRACT_ALL(value, regexp)-> Returns an ARRAY of substrings that match theregexpexpression
| domain | words | substring | special_words |
|---|---|---|---|
| hello.com | This,is,a,sentence | xyz | CAPS,SPECIAL |
Combining substrings
To combine substrings together you can use:
CONCAT(value1[, ...])-> Concatenates 2 or moreSTRINGs into a single result.
| f0_ |
|---|
| Hello World |
How Do I...
Count the number of occurrences of a character in a string?
To do this we can make use of REGEXP_EXTRACT_ALL and ARRAY_LENGTH.
Extract numbers from string using regex?
To do this we can again use REGEXP_EXTRACT_ALL
We can choose to make our regex greedy, meaning once it found one number it will look for another one. This is the difference between:
r'([0-9]+)' : which says find 1 or more digits together and r'([0-9])' which says find any digits. Depending on what you want, both can be useful.
| greedy_number_count | not_greedy_number_count |
|---|---|
| 12,12,144 | 1,2,1,2,1,4,4 |