# How COUNT(DISTINCT [field]) Works in Google BigQuery

###### Data Tutorial

- Statistical Approximations
- Increasing the
`DISTINCT`

Approximation Threshold - Using the
`EXACT_COUNT_DISTINCT`

Function

Typically in SQL database engines, the use of `COUNT(DISTINCT [field])`

within a query is used to count the exact number of `DISTINCT`

items within the specified field. In Google BigQuery, however, `COUNT(DISTINCT [field])`

functions slightly differently due to the massive quantities of data that are often involved when performing queries.

In this tutorial we’ll briefly explore how BigQuery handles `COUNT(DISTINCT [field])`

to support better performance and scalability, and how you can choose to circumvent those limitations, if necessary.

### Statistical Approximations

As stated directly in the official documentation, BigQuery’s implementation of `DISTINCT`

returns a value that is a “statistical approximation and is not guaranteed to be exact.” Obviously this is for performance reasons and to reduce the cost to the end-user. In most cases, this difference is largely irrelevant, since when you perform a Google search, it doesn’t matter to you whether it says “About 10,400,000 results” or it says “10,415,027 results” – you’re still clicking on the first handful of links and going about your business.

BigQuery uses approximation for all `DISTINCT`

quantities greater than the default `threshold`

value of `1000`

.

As an example, if we execute the following query, which aggregates the total number of `DISTINCT`

authors, publishers, and titles from all books in the `gdelt-bq:hathitrustbooks`

dataset between 1920 and 1929, we will not get exact results:

```
SELECT
COUNT(DISTINCT BookMeta_Author) AS authors,
COUNT(DISTINCT BookMeta_Publisher) AS publishers,
COUNT(DISTINCT BookMeta_Title) AS titles
FROM
TABLE_QUERY([gdelt-bq:hathitrustbooks],
'REGEXP_MATCH(table_id , r"^192[\d]")')
```

The `DISTINCT`

quantities it returned can be seen below:

```
[
{
"authors": "48911",
"publishers": "34517",
"titles": "77386"
}
]
```

### Increasing the `DISTINCT`

Approximation Threshold

As mentioned above, by default, the approximation threshold for `DISTINCT`

queries is set to `1000`

, but by including the second numeric parameter (`n`

) in the `COUNT(DISTINCT [field], n)`

function call, we can increase this threshold, forcing BigQuery to return an **exact count** for any number at or below that threshold.

For example, let’s change our query from above to use a threshold of `50,000`

:

```
SELECT
COUNT(DISTINCT BookMeta_Author, 50000) AS authors,
COUNT(DISTINCT BookMeta_Publisher, 50000) AS publishers,
COUNT(DISTINCT BookMeta_Title, 50000) AS titles
FROM
TABLE_QUERY([gdelt-bq:hathitrustbooks],
'REGEXP_MATCH(table_id , r"^192[\d]")')
```

Our expectation now is that the first two quantities, `authors`

and `publishers`

, will be exact counts since those quantities are below our threshold:

```
[
{
"authors": "48642",
"publishers": "35140",
"titles": "76938"
}
]
```

The results are all different from our default threshold example above, but we cannot yet determine if the threshold setting worked as intended. To verify that the threshold is working, we can perform one final test, by increasing the threshold yet again to exceed all three quantities, this time to `80,000`

:

```
SELECT
COUNT(DISTINCT BookMeta_Author, 80000) AS authors,
COUNT(DISTINCT BookMeta_Publisher, 80000) AS publishers,
COUNT(DISTINCT BookMeta_Title, 80000) AS titles
FROM
TABLE_QUERY([gdelt-bq:hathitrustbooks],
'REGEXP_MATCH(table_id , r"^192[\d]")')
```

Our expectation is that the first two values for `authors`

and `publishers`

should remain identical to the returned values from our `50,000`

threshold query, and sure enough they are the same.

We can therefore conclude that all three numbers are now exact counts of the `DISTINCT`

quantities for each field across all `1920s`

tables in the dataset.

```
[
{
"authors": "48642",
"publishers": "35140",
"titles": "77155"
}
]
```

### Using the `EXACT_COUNT_DISTINCT`

Function

Obviously, if there are situations where you don’t care about the processing time or performance of the query, and instead absolutely **must** ensure your `DISTINCT`

counter returns the exact quantities, you can make use of the `EXACT_COUNT_DISTINCT`

function.

Here we’re performing the same query as the above example, but using `EXACT_COUNT_DISTINCT`

so we don’t need to manually specify any approximation threshold value:

```
SELECT
EXACT_COUNT_DISTINCT(BookMeta_Author) AS authors,
EXACT_COUNT_DISTINCT(BookMeta_Publisher) AS publishers,
EXACT_COUNT_DISTINCT(BookMeta_Title) AS titles
FROM
TABLE_QUERY([gdelt-bq:hathitrustbooks],
'REGEXP_MATCH(table_id , r"^192[\d]")')
```

We would expect our results to match the query above where we specicied a threshold of `80,000`

, giving us the exact values, and sure enough the data is identical:

```
[
{
"authors": "48642",
"publishers": "35140",
"titles": "77155"
}
]
```