Close
Data TutorialsGoogle BigQuery

How to FLATTEN Data Using Google BigQuery's Legacy vs Standard SQL

Posted by AJ Welch

In addition to the standard relational database method of one-to-one relationships within a record and it’s fields, Google BigQuery also supports schemas with nested and repeated data. This allows BigQuery to store complex data structures and relationships between many types of Records, but doing so all within one single table.

In this tutorial we’ll briefly explore how nested and repeated Records work in BigQuery, and how using functions such as FLATTEN allow us to easily manage these types of Records.

Typical Handling of Repeated Records

It has been common practice within most relational SQL-like databases to store associated data across multiple tables using ID fields and keys to confer relationships between records.

For example, if we had a persons table listing a number of people and we wanted to indicate parent/child relationships, we’d typically use a second table such as lineages.

Our persons table has a list of names and the unique personId value:

personId name
1 Bob
2 Jane
3 Jennifer

Now to indicate that Bob and Jane are the parents of Jennifer, we’d typically add some associative records in the lineages table using the personId values for each:

lineageId parentId childId
1 1 3
2 2 3

How BigQuery Handles Repeated Records

While BigQuery can (and often does) handle associative records in the same standard manner as seen above, it also allows records to be nested and REPEATED from the outset. This means that instead of creating two tables, persons and lineages, as seen above in order to associate parents and children, BigQuery can add children Records directly into the persons table, and set the children Record to a REPEATED type. This is, in fact, the example the official documentation uses with the personsDataSchema.json.

Consequently, every person entry can have one or more children Records, all functionally contained within the same persons table.

Inherent Flattening

The power of storing and managing nested and repeated Records comes at the cost of requiring query outputs to be inherently FLATTENED, which effectively duplicates the rows returned in a query to accomodate for every REPEATED value.

An issue arises when BigQuery is asked to output unassociated REPEATED fields within a query, producing an error.

For example, using the above persons.json data imported into our own table, we can attempt to query everything in the table like so:

SELECT
  *
FROM
  [primary.persons]

Doing so returns Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_place.

While the error message implies the issue is with the sub-fields children.age and citiesLived.place, the actual issue is because of their associated parent Records both being REPEATABLE types. The error message simply picked the first sub-field it found in each Record to report the error.

If we bypassed this issue by only SELECTING one of the REPEATABLE fields (children in this case), the query functions fine:

SELECT
  fullName,
  age,
  gender,
  children.*
FROM
  [primary.persons]

And returned results are automatically FLATTENED, duplicating the primary persons.fullName, .age, and .gender values as many times as necessary to list each REPEATED children Record:

Row fullName    age gender  children_name   children_gender children_age
1   John Doe    22  Male    Jane    Female  6
2   John Doe    22  Male    John    Male    15
3   Mike Jones  35  Male    Earl    Male    10
4   Mike Jones  35  Male    Sam Male    6
5   Mike Jones  35  Male    Kit Male    8
6   Anna Karenina   45  Female  null    null    null

Querying with FLATTEN

In order to query multiple REPEATED Records as we intended to do originally, we’ll need to make use of the FLATTEN function. This acts similarly to Entity SQL’s FLATTEN function by purposefully flattening the specified field into the rest of the dataset.

For example, if we want to perform our original query to return all the data from our persons table, we’ll need to FLATTEN one of the REPEATED records:

SELECT
  *
FROM
  FLATTEN([primary.persons], children)

Here we’re FLATTENING the children REPEATED Record into the rest of the table, so our results are duplicated as often as necessary to accomodate for every repetition of nested fields (children and citiesLives):

Row kind    fullName    age gender  phoneNumber_areaCode    phoneNumber_number  children_name   children_gender children_age    citiesLived_place   citiesLived_yearsLived
1   person  John Doe    22  Male    206 1234567 Jane    Female  6   Seattle 1995
2   person  John Doe    22  Male    206 1234567 Jane    Female  6   Stockholm   2005
3   person  John Doe    22  Male    206 1234567 John    Male    15  Seattle 1995
4   person  John Doe    22  Male    206 1234567 John    Male    15  Stockholm   2005
5   person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 1989
6   person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 1993
7   person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 1998
8   person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Los Angeles 2002
9   person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Washington DC   1990
10  person  Mike Jones  35  Male    622 1567845 Earl    Male    10  Washington DC   1993
...

Using BigQuery’s Updated SQL

The good news is that if you are using BigQuery’s updated SQL syntax (and thus not Legacy SQL), you don’t need to bother with the FLATTEN function at all: BigQuery returns results that retain their nested and REPEATED associations automatically.

For example, this query:

SELECT
  *
FROM
  `primary.persons`

Returns nested data like so:

kind fullName age gender phoneNumber.areaCode phoneNumber.number children.name children.gender children.age citiesLived.place citiesLives.yearsLives
person Mike Jones 35 Male 622 1567845 Earl Male 10 Los Angeles 1989
            Sam Male 6   1993
            Kit Male 8   1998
                    2002
                  Washington DC 1990
                    1993
                    1998