FLATTEN Data Using Google BigQuery's Legacy vs Standard SQL

Typical Handling of Repeated Records
How BigQuery Handles Repeated Records
Inherent Flattening
Querying with FLATTEN
Using BigQuery’s Updated SQL

In addition to the standard relational database method of one-to-one relationships within a record and it’s fields, Google BigQuery also supports schemas with nested and repeated data. This allows BigQuery to store complex data structures and relationships between many types of Records, but doing so all within one single table.

In this tutorial we’ll briefly explore how nested and repeated Records work in BigQuery, and how using functions such as FLATTEN allow us to easily manage these types of Records.

Typical Handling of Repeated Records

It has been common practice within most relational SQL-like databases to store associated data across multiple tables using ID fields and keys to confer relationships between records.

For example, if we had a persons table listing a number of people and we wanted to indicate parent/child relationships, we’d typically use a second table such as lineages.

Our persons table has a list of names and the unique personId value:

personId	name
1	Bob
2	Jane
3	Jennifer

Now to indicate that Bob and Jane are the parents of Jennifer, we’d typically add some associative records in the lineages table using the personId values for each:

lineageId	parentId	childId
1	1	3
2	2	3

How BigQuery Handles Repeated Records

While BigQuery can (and often does) handle associative records in the same standard manner as seen above, it also allows records to be nested and REPEATED from the outset. This means that instead of creating two tables, persons and lineages, as seen above in order to associate parents and children, BigQuery can add children Records directly into the persons table, and set the children Record to a REPEATED type. This is, in fact, the example the official documentation uses with the personsDataSchema.json.

Consequently, every person entry can have one or more children Records, all functionally contained within the same persons table.

Inherent Flattening

The power of storing and managing nested and repeated Records comes at the cost of requiring query outputs to be inherently FLATTENED, which effectively duplicates the rows returned in a query to accomodate for every REPEATED value.

An issue arises when BigQuery is asked to output unassociated REPEATED fields within a query, producing an error.

For example, using the above persons.json data imported into our own table, we can attempt to query everything in the table like so:

SELECT
  *
FROM
  [primary.persons]

Doing so returns Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_place.

While the error message implies the issue is with the sub-fields children.age and citiesLived.place, the actual issue is because of their associated parent Records both being REPEATABLE types. The error message simply picked the first sub-field it found in each Record to report the error.

If we bypassed this issue by only SELECTING one of the REPEATABLE fields (children in this case), the query functions fine:

SELECT
  fullName,
  age,
  gender,
  children.*
FROM
  [primary.persons]

And returned results are automatically FLATTENED, duplicating the primary persons.fullName, .age, and .gender values as many times as necessary to list each REPEATED children Record:

Row fullName    age gender  children_name   children_gender children_age
 John Doe    22  Male    Jane    Female  6
 John Doe    22  Male    John    Male    15
 Mike Jones  35  Male    Earl    Male    10
 Mike Jones  35  Male    Sam Male    6
 Mike Jones  35  Male    Kit Male    8
 Anna Karenina   45  Female  null    null    null

Querying with `FLATTEN`

In order to query multiple REPEATED Records as we intended to do originally, we’ll need to make use of the FLATTEN function. This acts similarly to Entity SQL’s FLATTEN function by purposefully flattening the specified field into the rest of the dataset.

For example, if we want to perform our original query to return all the data from our persons table, we’ll need to FLATTEN one of the REPEATED records:

SELECT
  *
FROM
  FLATTEN([primary.persons], children)

Here we’re FLATTENING the children REPEATED Record into the rest of the table, so our results are duplicated as often as necessary to accomodate for every repetition of nested fields (children and citiesLives):

Row kind    fullName    age gender  phoneNumber_areaCode    phoneNumber_number  children_name   children_gender children_age    citiesLived_place   citiesLived_yearsLived
 person  John Doe    22  Male    206 1234567 Jane    Female  6   Seattle 1995
 person  John Doe    22  Male    206 1234567 Jane    Female  6   Stockholm   2005
 person  John Doe    22  Male    206 1234567 John    Male    15  Seattle 1995
 person  John Doe    22  Male    206 1234567 John    Male    15  Stockholm   2005
 person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 1989
 person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 1993
 person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 1998
 person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 2002
 person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Washington DC   1990
person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Washington DC   1993
...

Using BigQuery’s Updated SQL

The good news is that if you are using BigQuery’s updated SQL syntax (and thus not Legacy SQL), you don’t need to bother with the FLATTEN function at all: BigQuery returns results that retain their nested and REPEATED associations automatically.

For example, this query:

SELECT
  *
FROM
  `primary.persons`

Returns nested data like so:

kind	fullName	age	gender	phoneNumber.areaCode	phoneNumber.number	children.name	children.gender	children.age	citiesLived.place	citiesLives.yearsLives
person	Mike Jones	35	Male	622	1567845	Earl	Male	10	Los Angeles	1989
						Sam	Male	6		1993
						Kit	Male	8		1998
										2002
									Washington DC	1990
										1993
										1998

How to FLATTEN Data Using Google BigQuery's Legacy vs Standard SQL

Typical Handling of Repeated Records

How BigQuery Handles Repeated Records

Inherent Flattening

Querying with `FLATTEN`

Using BigQuery’s Updated SQL

similar articles

Typical Handling of Repeated Records

How BigQuery Handles Repeated Records

Inherent Flattening

Querying with FLATTEN

Using BigQuery’s Updated SQL

similar articles

How to Use Partitioned Tables in Google BigQuery

BigQuery vs Athena

How to Use Google BigQuery's Wildcard Functions in Legacy SQL vs. Standard SQL

Querying with `FLATTEN`