How to Implement SQL's LIKE Operator in Google BigQuery

Data Tutorial

Most common SQL database engines implement the LIKE operator – or something functionally similar – to allow queries the flexibility of finding string pattern matches between one column and another column (or between a column and a specific text string). Luckily, Google BigQuery is no exception and includes support for the common LIKE operator. For situations that require something with more flexibility and power, BigQuery also allows for the use of regular xxpressions using the RE2 engine by Google.

Below we’ll explore using both methods for narrowing your queries down to the results you’re after.

Using the LIKE Operator

For anyone familiar with other common SQL engines, using the LIKE operator in BigQuery should be a breeze. Simply add the LIKE operator and a comparitor to your WHERE clause and that’s all there is to it:

SELECT
  repository_name,
  repository_size,
  repository_forks,
  created_at
FROM
  [bigquery-public-data:samples.github_timeline]
WHERE
  repository_name LIKE 'node'
LIMIT
  1000

The results, as expected, are the first 1000 records where repository_name is equal to (i.e. LIKE) ‘node’:

[
  {
    "repository_name": "node",
    "repository_size": "23500",
    "repository_forks": "1568",
    "created_at": "2012-03-12 21:37:46"
  },
  {
    "repository_name": "node",
    "repository_size": "23500",
    "repository_forks": "1568",
    "created_at": "2012-03-12 21:45:17"
  },
  {
    "repository_name": "node",
    "repository_size": "23500",
    "repository_forks": "1568",
    "created_at": "2012-03-12 21:49:11"
  },
  ...
]

As the official documentation indicates, we can also add special characters into our pattern, such as the % sign, which will match any number of characters. Typically, a word or phrase is surrounded by two % signs to retrieve records where that field contains a specific string:

SELECT
  repository_name,
  repository_size,
  repository_forks,
  created_at
FROM
  [bigquery-public-data:samples.github_timeline]
WHERE
  repository_name LIKE '%node%'
LIMIT
  1000

Now we get results where the repository_name simply has the letters 'node' somewhere in there:

[
  {
    "repository_name": "node-geos",
    "repository_size": "120",
    "repository_forks": "4",
    "created_at": "2012-03-12 21:29:10"
  },
  {
    "repository_name": "nyc-nodejs-packages",
    "repository_size": "928",
    "repository_forks": "1",
    "created_at": "2012-03-12 21:30:56"
  },
  {
    "repository_name": "node-async-testing",
    "repository_size": "420",
    "repository_forks": "8",
    "created_at": "2012-03-12 21:32:11"
  },
  ...
]

The LIKE operator works for both Legacy SQL syntax (which is the default for BigQuery), as well as the updated SQL syntax (if enabled). Here’s the same example as above using the updated syntax:

SELECT
  repository_name,
  repository_size,
  repository_forks,
  created_at
FROM
  `bigquery-public-data.samples.github_timeline`
WHERE
  repository_name LIKE '%node%'
LIMIT
  1000

Using the CONTAINS Function

If you find yourself using LIKE along with surrounding % symbols as above, you’re better off using the CONTAINS string function, which behaves identically but typically has better readability.

Consequently, this WHERE clause:

WHERE
  repository_name LIKE '%node%'

is functionallity identical to this WHERE clause:

WHERE
  repository_name CONTAINS 'node'

Using Regular Expressions

For situations where you need to be very precise with your search pattern in your WHERE clause, the best option is to use the REGEX_MATCH function.

By taking full advantage of the power of the RE2 regular expression engine and syntax, we can ensure the WHERE clause we’re creating returns exactly the record matches we want.

To implement REGEX_MATCH, just pass the column_name as the first argument and the regex pattern as the second argument:

SELECT
  repository_name,
  repository_size,
  repository_forks,
  created_at
FROM
  [bigquery-public-data:samples.github_timeline]
WHERE
  REGEXP_MATCH (repository_name, r'^node-[[:alnum:]]+js$')
LIMIT
  1000

Here we only want matches where repository_name begins with the word node followed by a hyphen, then contains at least one or more alphanumeric characters, followed by js as the final two characters. The results are as expected:

[
  {
    "repository_name": "node-jinjs",
    "repository_size": "104",
    "repository_forks": "5",
    "created_at": "2012-03-13 02:48:09"
  },
  {
    "repository_name": "node-ldapjs",
    "repository_size": "172",
    "repository_forks": "17",
    "created_at": "2012-04-23 16:29:19"
  },
  {
    "repository_name": "node-xml2js",
    "repository_size": "204",
    "repository_forks": "39",
    "created_at": "2012-04-23 14:37:43"
  },
  ,,,
]