- Typical Handling of Repeated Records
- How BigQuery Handles Repeated Records
- Inherent Flattening
- Querying with
FLATTEN
- Using BigQuery’s Updated SQL
In addition to the standard relational database method of one-to-one relationships within a record and it’s fields, Google BigQuery also supports schemas with nested and repeated data. This allows BigQuery to store complex data structures and relationships between many types of Records
, but doing so all within one single table
.
In this tutorial we’ll briefly explore how nested and repeated Records
work in BigQuery, and how using functions such as FLATTEN
allow us to easily manage these types of Records
.
Typical Handling of Repeated Records
It has been common practice within most relational SQL-like databases to store associated data across multiple tables using ID
fields and keys to confer relationships between records.
For example, if we had a persons
table listing a number of people and we wanted to indicate parent/child relationships, we’d typically use a second table such as lineages
.
Our persons
table has a list of names
and the unique personId
value:
personId | name |
---|---|
1 | Bob |
2 | Jane |
3 | Jennifer |
Now to indicate that Bob
and Jane
are the parents of Jennifer
, we’d typically add some associative records in the lineages
table using the personId
values for each:
lineageId | parentId | childId |
---|---|---|
1 | 1 | 3 |
2 | 2 | 3 |
How BigQuery Handles Repeated Records
While BigQuery can (and often does) handle associative records in the same standard manner as seen above, it also allows records to be nested
and REPEATED
from the outset. This means that instead of creating two tables, persons
and lineages
, as seen above in order to associate parents and children, BigQuery can add children
Records
directly into the persons
table, and set the children
Record
to a REPEATED
type. This is, in fact, the example the official documentation uses with the personsDataSchema.json.
Consequently, every person
entry can have one or more children
Records
, all functionally contained within the same persons
table.
Inherent Flattening
The power of storing and managing nested and repeated Records
comes at the cost of requiring query outputs to be inherently FLATTENED
, which effectively duplicates the rows returned in a query to accomodate for every REPEATED
value.
An issue arises when BigQuery is asked to output unassociated REPEATED
fields within a query, producing an error.
For example, using the above persons.json
data imported into our own table, we can attempt to query everything in the table like so:
SELECT
*
FROM
[primary.persons]
Doing so returns Error: Cannot output multiple independently repeated fields at the same time. Found children_age and citiesLived_place.
While the error message implies the issue is with the sub-fields children.age
and citiesLived.place
, the actual issue is because of their associated parent Records
both being REPEATABLE
types. The error message simply picked the first sub-field it found in each Record
to report the error.
If we bypassed this issue by only SELECTING
one of the REPEATABLE
fields (children
in this case), the query functions fine:
SELECT
fullName,
age,
gender,
children.*
FROM
[primary.persons]
And returned results are automatically FLATTENED
, duplicating the primary persons.fullName
, .age
, and .gender
values as many times as necessary to list each REPEATED
children
Record
:
Row fullName age gender children_name children_gender children_age
1 John Doe 22 Male Jane Female 6
2 John Doe 22 Male John Male 15
3 Mike Jones 35 Male Earl Male 10
4 Mike Jones 35 Male Sam Male 6
5 Mike Jones 35 Male Kit Male 8
6 Anna Karenina 45 Female null null null
Querying with FLATTEN
In order to query multiple REPEATED
Records
as we intended to do originally, we’ll need to make use of the FLATTEN
function. This acts similarly to Entity SQL’s FLATTEN function by purposefully flattening the specified field
into the rest of the dataset.
For example, if we want to perform our original query to return all the data from our persons
table, we’ll need to FLATTEN
one of the REPEATED
records:
SELECT
*
FROM
FLATTEN([primary.persons], children)
Here we’re FLATTENING
the children
REPEATED
Record
into the rest of the table, so our results are duplicated as often as necessary to accomodate for every repetition of nested fields (children
and citiesLives
):
Row kind fullName age gender phoneNumber_areaCode phoneNumber_number children_name children_gender children_age citiesLived_place citiesLived_yearsLived
1 person John Doe 22 Male 206 1234567 Jane Female 6 Seattle 1995
2 person John Doe 22 Male 206 1234567 Jane Female 6 Stockholm 2005
3 person John Doe 22 Male 206 1234567 John Male 15 Seattle 1995
4 person John Doe 22 Male 206 1234567 John Male 15 Stockholm 2005
5 person Mike Jones 35 Male 622 1567845 Earl Male 10 Los Angeles 1989
6 person Mike Jones 35 Male 622 1567845 Earl Male 10 Los Angeles 1993
7 person Mike Jones 35 Male 622 1567845 Earl Male 10 Los Angeles 1998
8 person Mike Jones 35 Male 622 1567845 Earl Male 10 Los Angeles 2002
9 person Mike Jones 35 Male 622 1567845 Earl Male 10 Washington DC 1990
10 person Mike Jones 35 Male 622 1567845 Earl Male 10 Washington DC 1993
...
Using BigQuery’s Updated SQL
The good news is that if you are using BigQuery’s updated SQL syntax (and thus not Legacy SQL), you don’t need to bother with the FLATTEN
function at all: BigQuery returns results that retain their nested
and REPEATED
associations automatically.
For example, this query:
SELECT
*
FROM
`primary.persons`
Returns nested data like so:
kind | fullName | age | gender | phoneNumber.areaCode | phoneNumber.number | children.name | children.gender | children.age | citiesLived.place | citiesLives.yearsLives |
---|---|---|---|---|---|---|---|---|---|---|
person | Mike Jones | 35 | Male | 622 | 1567845 | Earl | Male | 10 | Los Angeles | 1989 |
Sam | Male | 6 | 1993 | |||||||
Kit | Male | 8 | 1998 | |||||||
2002 | ||||||||||
Washington DC | 1990 | |||||||||
1993 | ||||||||||
1998 |