Extracting parts of a string

Data Pipeline

You might run into a scenario where you have a column that has pieces of information you’d like to extract separated by a character of some sort. In our data pipeline, you have the ability to extract the first instance to the left, or to the right of that character string acting as a separator.

Using the Chartio Demo datasource, we have a column that has state and city information, the state abbreviation and the city name separated by a dash:

Say we were interested in parsing out the city names into a separate column. In order to do so, we turn to the data pipeline to help transform the current output. First, we have to utilize an add column step, title it, choose custom formula, then use a formula containing the following functions:

substr(“State/City”, instr(“State/City”, ‘-‘)+1, length(“State/City”))

The substr() function extracts a substring from a string. There are three parameters for the function, substr(1, 2, 3). Position 1 is the column name from which a substring will be extracted. Position 2, is an integer parameter that is the starting position to extract the substring, this can be at the beginning of the string, the middle, or even towards the end of the string. To set the position, you just need to know the numerical position of where you’d like to start the extraction. Position 3, also an integer parameter, is the position of where our substring should be extracted up to in our string.

In our formula above, we see another function, instr(), in position 2 of the substr() function. The instr() function returns the position of a character in a string. There are two parameters for the instr() function, instr(A, B), where position A is the column containing the string which will be searched and position B is the character(s) you’re looking to search. The output is the first instance for the position of the character in the string coded, which is why the instr() function is in position 2 for the substr() function. In our example, the function instr(“State/City”, ‘-’)+1 is added at position 2 to search for the position of the dash character, ‘-’, within the strings in our State/City column. The +1 is to return the position right of the dash, in order to not include the dash in our new column.

Finally, for position 3, we have the function length(). This just calculates the length of a string. Since we want to get, for each row, the entire city name, and each city name length varies, we can account for this by using the length() function. This would bring back the final character position for the last character string for each varying city name.

Our output produces:

Extracting parts of a string

We can also extract the state abbreviation using the formula:
substr(“State/City”, 1, instr(“State/City”, ‘-‘)-1)

Now, we’re starting from the beginning of our string indicated by the number 1, the first position in the string, up to the position right before the dash. Again, we used the instr() function on our string column, but this time we use it in position 3, where the substr() function should end the substring extract right before the position of the dash. The final output produces the results below:

Final result

Related Data Pipeline Help Articles

See more