- Typical Handling of Repeated Records
- How BigQuery Handles Repeated Records
- Inherent Flattening
- Querying with
FLATTEN - Using BigQuery’s Updated SQL
In addition to the standard relational database method of one-to-one relationships within a record and it’s fields, Google BigQuery also supports schemas with nested and repeated data. This allows BigQuery to store complex data structures and relationships between many types of Records, but doing so all within one single table.
In this tutorial we’ll briefly explore how nested and repeated Records work in BigQuery, and how using functions such as FLATTEN allow us to easily manage these types of Records.
Typical Handling of Repeated Records
It has been common practice within most relational SQL-like databases to store associated data across multiple tables using ID fields and keys to confer relationships between records.
For example, if we had a persons table listing a number of people and we wanted to indicate parent/child relationships, we’d typically use a second table such as lineages.
Our persons table has a list of names and the unique personId value:
| personId | name |
|---|---|
| 1 | Bob |
| 2 | Jane |
| 3 | Jennifer |
Now to indicate that Bob and Jane are the parents of Jennifer, we’d typically add some associative records in the lineages table using the personId values for each:
| lineageId | parentId | childId |
|---|---|---|
| 1 | 1 | 3 |
| 2 | 2 | 3 |
How BigQuery Handles Repeated Records
While BigQuery can (and often does) handle associative records in the same standard manner as seen above, it also allows records to be nested and REPEATED from the outset. This means that instead of creating two tables, persons and lineages, as seen above in order to associate parents and children, BigQuery can add children Records directly into the persons table, and set the children Record to a REPEATED type. This is, in fact, the example the official documentation uses with the personsDataSchema.json.
Consequently, every person entry can have one or more children Records, all functionally contained within the same persons table.
Inherent Flattening
The power of storing and managing nested and repeated Records comes at the cost of requiring query outputs to be inherently FLATTENED, which effectively duplicates the rows returned in a query to accomodate for every REPEATED value.
An issue arises when BigQuery is asked to output unassociated REPEATED fields within a query, producing an error.
For example, using the above persons.json data imported into our own table, we can attempt to query everything in the table like so:
SELECT
*
FROM
[primary.persons]
Doing so returns Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_place.
While the error message implies the issue is with the sub-fields children.age and citiesLived.place, the actual issue is because of their associated parent Records both being REPEATABLE types. The error message simply picked the first sub-field it found in each Record to report the error.
If we bypassed this issue by only SELECTING one of the REPEATABLE fields (children in this case), the query functions fine:
SELECT
fullName,
age,
gender,
children.*
FROM
[primary.persons]
And returned results are automatically FLATTENED, duplicating the primary persons.fullName, .age, and .gender values as many times as necessary to list each REPEATED children Record:
Row fullName age gender children_name children_gender children_age
1 John Doe 22 Male Jane Female 6
2 John Doe 22 Male John Male 15
3 Mike Jones 35 Male Earl Male 10
4 Mike Jones 35 Male Sam Male 6
5 Mike Jones 35 Male Kit Male 8
6 Anna Karenina 45 Female null null null
Querying with FLATTEN
In order to query multiple REPEATED Records as we intended to do originally, we’ll need to make use of the FLATTEN function. This acts similarly to Entity SQL’s FLATTEN function by purposefully flattening the specified field into the rest of the dataset.
For example, if we want to perform our original query to return all the data from our persons table, we’ll need to FLATTEN one of the REPEATED records:
SELECT
*
FROM
FLATTEN([primary.persons], children)
Here we’re FLATTENING the children REPEATED Record into the rest of the table, so our results are duplicated as often as necessary to accomodate for every repetition of nested fields (children and citiesLives):
Row kind fullName age gender phoneNumber_areaCode phoneNumber_number children_name children_gender children_age citiesLived_place citiesLived_yearsLived
1 person John Doe 22 Male 206 1234567 Jane Female 6 Seattle 1995
2 person John Doe 22 Male 206 1234567 Jane Female 6 Stockholm 2005
3 person John Doe 22 Male 206 1234567 John Male 15 Seattle 1995
4 person John Doe 22 Male 206 1234567 John Male 15 Stockholm 2005
5 person Mike Jones 35 Male 622 1567845 Earl Male 10 Los Angeles 1989
6 person Mike Jones 35 Male 622 1567845 Earl Male 10 Los Angeles 1993
7 person Mike Jones 35 Male 622 1567845 Earl Male 10 Los Angeles 1998
8 person Mike Jones 35 Male 622 1567845 Earl Male 10 Los Angeles 2002
9 person Mike Jones 35 Male 622 1567845 Earl Male 10 Washington DC 1990
10 person Mike Jones 35 Male 622 1567845 Earl Male 10 Washington DC 1993
...
Using BigQuery’s Updated SQL
The good news is that if you are using BigQuery’s updated SQL syntax (and thus not Legacy SQL), you don’t need to bother with the FLATTEN function at all: BigQuery returns results that retain their nested and REPEATED associations automatically.
For example, this query:
SELECT
*
FROM
`primary.persons`
Returns nested data like so:
| kind | fullName | age | gender | phoneNumber.areaCode | phoneNumber.number | children.name | children.gender | children.age | citiesLived.place | citiesLives.yearsLives |
|---|---|---|---|---|---|---|---|---|---|---|
| person | Mike Jones | 35 | Male | 622 | 1567845 | Earl | Male | 10 | Los Angeles | 1989 |
| Sam | Male | 6 | 1993 | |||||||
| Kit | Male | 8 | 1998 | |||||||
| 2002 | ||||||||||
| Washington DC | 1990 | |||||||||
| 1993 | ||||||||||
| 1998 |