Extract parts of a string in Data Explorer
You might run into a scenario where you have a column that has pieces of information you’d like to extract separated by a character of some sort. In the Data Pipeline, you have the ability to extract the sub-string to the left or right of the delimiting character.
If your start and end positions are fixed, you could instead use the Extract Text Pre-built Formula.
In our example, we have a column called “State/City” that has state and city information, where the state abbreviation and the city name are separated by a hyphen:
Let’s say we want to parse the string to only have the city name in a separate column. To do this, we turn to the Data Pipeline to help transform the initial string. First, we need to use an Add Column step, choose Custom formula, then use the following formula:
substr("State/City", instr("State/City", '-')+1, length("State/City"))
Let’s try to breakdown what’s happening in this formula. First, we’ll look at the substr
function. This function extracts a substring from a string. There are three parameters for the function, substr(1, 2, 3)
.
- Parameter 1 is the column name containing the initial string.
- Parameter 2 is an integer representing the starting position of the substring to extract. This can be at the beginning of the string, the middle, or even towards the end of the string. To set the position, you just need to know the numerical position of where you’d like to start the extraction.
- Parameter 3 is also an integer, and it represents the position of where our extracted substring ends in the initial string.
In parameter 2 of our substr
function, we use the instr
function. instr()
returns the position of a character in a string. There are two parameters for the instr() function, instr(A, B)
.
- Parameter A is the column containing the string which will be searched.
- Parameter B is the character(s) you’re searching for. The output is the position of the first instance of the character in the string, which is why the
instr
function is in parameter 2 of thesubstr
function. In our example, the functioninstr("State/City", '-')+1
searches for the position of the hyphen character,'-'
, within the strings in our “State/City” column. The+1
is to return the position to the right of the hyphen, in order to not include the hyphen in our new column.
Finally, for parameter 3, we use the length
function, which just calculates the length of a string. Since we want to get the entire city name for each row, and each city name length varies, we can account for this by using the length
function. This returns the last character’s position in the string for each varying city name.
Running the full formula outputs the following:
We can also extract the state abbreviation using the following formula:
substr("State/City", 1, instr("State/City", '-')-1)
We’ll quickly go through this example.
- Parameter 1 remains the same as our first example, since we’re extracting a substring from the same initial string.
- Parameter 2 is simply a
1
to indicate our substring starts at the first position of our initial string. - We use the
instr
function again, but this time we use it in parameter 3. We want to get the position of the character just before the hyphen, so we search for the hyphen character, get its position from theinstr
function, then subtract1
from that position to get the end position of our substring.
Running the full formula outputs the following: