How to Use Partitioned Tables in Google BigQuery

Data Tutorial

Google BigQuery is designed to house some truly monstrous datasets, sometimes hosting tables billions of rows. It is therefore no surprise that Google has implemented the handy capability of partitioned tables, which allow otherwise daunting datasets to be broken up into smaller, more manageable chunks without losing performance or scalability.

What is a Partitioned Table?

In the past, it was common for database administrators using BigQuery to split large datasets into smaller tables that were divided by date and/or time. For example, each day a table may be generated where the table name is suffixed with the current date (e.g. books_20160724, books_20160725, etc). While this method is functional, it creates a lot of headache to not only manage all these split tables, but queries rapidly become quite complex as the potential date ranges increase and crossover.

To resolve this, Google BigQuery introduced the partitioned table, which is essentially just a normal table, except that it is automatically partitioned for each date.

The _PARTITIONTIME Psuedo Column

Functionally, a partitioned table acts just like any other table, except it contains a special _PARTITIONTIME psuedo column which contains the date-based timestamp representing when a particular record was loaded. This field uses UTC time and is the number of microseconds since the unix epoch.

Creating a Partitioned Table

To create a partitioned table, you must issue a call to the Tables.insert API method. This can be done either through the API or through the command-line tool.

The critical point here is that you do not need to specify a schema for the partitioned table at this time, but instead are effectively telling BigQuery to create a special partitioned table, and then await the insertion of data at a later time which will give it a proper schema.

Here is the API configuration JSON to create our table (named partition) via the API:

{
  "tableReference": {
    "projectId": "bookstore-1382",
    "datasetId": "exports",
    "tableId": "partition"
  },
  "timePartitioning": {
    "type": "DAY"
  }
}

And here’s the same creation using the command-line tool:

bq mk --time_partitioning_type=DAY exports.partition

Loading Data in a Partitioned Table

Loading data into the partitioned table is no different than loading data into any other table in BigQuery. For the purposes of this example, we’re just using the WebUI and grabbing some data from the [bigquery-public-data:samples.github_timeline] dataset and setting our Destination Table to the previously created bookstore-1382:exports.partition table.

Below is our query:

SELECT
  repository_name,
  repository_language,
  repository_size,
  repository_watchers,
  created_at
FROM
  [bigquery-public-data:samples.github_timeline]
WHERE
  created_at IS NOT NULL
ORDER BY
  created_at ASC
LIMIT
  1000

And the output:

[
  {
    "repository_name": "node-orm",
    "repository_language": "JavaScript",
    "repository_size": "208",
    "repository_watchers": "101",
    "created_at": "2012-03-11 06:36:13"
  },
  {
    "repository_name": "nimbus",
    "repository_language": "Objective-C",
    "repository_size": "18304",
    "repository_watchers": "1259",
    "created_at": "2012-03-11 06:36:13"
  },
  {
    "repository_name": "impress.js",
    "repository_language": "JavaScript",
    "repository_size": "388",
    "repository_watchers": "7149",
    "created_at": "2012-03-11 06:36:13"
  },
  ...
]

Querying a Partitioned Table

Now that we have loaded some data into our partitioned table, we can make use of the “automagical” nature when querying our table.

Performing a basic query with no regard for the _PARTITIONTIME is standard fare, but often we’ll want to retrieve just the data that we loaded into our table for a particular date or daterange. To do this, we simply add a WHERE clause and compare the _PARTITIONTIME value to the daterange in question.

For example, here we’re retrieving only records that were loaded today (which, therefore, exist in the partition for today’s date):

SELECT
  repository_name,
  repository_language,
  repository_size,
  repository_watchers,
  created_at
FROM
  exports.partition
WHERE
    _PARTITIONTIME == TIMESTAMP(UTC_USEC_TO_DAY(CURRENT_TIMESTAMP()));

The key is the WHERE clause comparing TIMESTAMP values to the _PARTITIONTIME field.

Here we’re getting data from the partition exactly 5 days ago:

SELECT
  repository_name,
  repository_language,
  repository_size,
  repository_watchers,
  created_at
FROM
  exports.partition
WHERE
    _PARTITIONTIME == TIMESTAMP(UTC_USEC_TO_DAY(NOW() - 5 * 60 * 60 * 24 * 1000000));

And here we’re getting data from all partitions within the last 3 days:

SELECT
  repository_name,
  repository_language,
  repository_size,
  repository_watchers,
  created_at,
  _PARTITIONTIME pt
FROM
  exports.partition
WHERE
    _PARTITIONTIME
  BETWEEN
    TIMESTAMP(UTC_USEC_TO_DAY(NOW() - 3 * 60 * 60 * 24 * 1000000))
  AND
    TIMESTAMP(UTC_USEC_TO_DAY(CURRENT_TIMESTAMP()));

The options are limitless for querying the partitioned data. Note that if we want to actually SELECT the _PARTITIONTIME value (as seen in the above example), since it’s a psuedo column we must assign it to an alias name before output (e.g. SELECT _PARTITIONTIME pt).