How to Select the Right Data Types

Data Tutorial Amazon Redshift

As a typical company’s amount of data has grown exponentially it’s become even more critical to optimize data storage. The size of your data doesn’t just impact storage size and costs, it also affects query performance. A key factor in determining the size of your data is the data type you select. This tutorial will explain how to select the right data types.

What are Data Types?

Data types define what kind and range of data can be stored in a given field (or column).

Consider a sales table with these sample records:

figure 1

Each field has its own type and range of values:

purchase_time: date and time of the sale sale_id: integer values incrementing by one for every new sale customer_id: integer values incrementing by one for every new customer currency: text always in the 3-character currency code amount_paid: monetary real numeric values between $0.00 and $1,000.00 device: text, where the values can be: ‘desktop’, ‘mobile app’, and ‘mobile web’ has_discount: boolean where entries can be TRUE or FALSE notes: text, where the entry can be as long as what is allowed in our agent input tool (250 characters)

The kind of data (integers, text, real numbers, etc…) and the possible value ranges (0 to 1,000; any 3 characters; etc…) correspond to specific database data types.

What are the Possible Data Types?

Different databases have different data types available, but most fit into these categories:

Numeric:

  • integers: for numbers without fractions. Can be signed (allow positive and negative values) or unsigned (only allow positive numbers). Commonly used for ID fields and counts of something

  • decimals(x,y): for numbers with fractions requiring exact precision. Can be signed (allow positive and negative values) or unsigned (only allow positive numbers). Commonly used for monetary fields. The user specifies the number of significant digits allowed overall and after the decimal point in the parentheses

  • float / doubles: for numbers with fractions not requiring exact precision. Can be signed (allow positive and negative values) or unsigned (only allow positive numbers). Commonly used for all real numbers except monetary fields

Date/time:

date: for date values

  • time: for time values

  • timestamp / datetime: for date and time values

Text:

  • character(n):for fixed-length character strings, where the value in the parenthesis dictates the fixed size of each entry

  • varchar(n): for variable-length character strings, where the value in the parenthesis dictates the maximum accepted size of each entry

Boolean:

  • boolean: for boolean (true/false) values. Some databases (like MySQL) don’t have boolean data type and instead convert boolean values into integers (1=TRUE, 0 = FALSE)

Most databases offer size variations for each type. For example, MySQL provides these possible integer data types:

figure 2

How to Select the Right Data Types

The basic strategy for selecting the best data type is to select the smallest data type that matches the kind of data you have and that allows for all the feasible values of your data.

For example, customer_id in our sample sales table is a whole number starting with 0. Today our fictional company only has 15,000 customers. Using the MySQL integer data type table from the previous section, we may be tempted to select SMALLINT unsigned as the data type, since that’s the smallest data type that will accept our current integer values from 0 to 15,000. However, we expect to get to 100,000 customers over the next 6-12 months. When we go over 65,535, the SMALLINT will no longer be sufficient. Thus, a better selection is MEDIUMINT unsigned, which should cover us for the next several years.

You could let your database select data types for you when creating your table with sample records. However, this will rarely give you the best data type selection. For example, letting MySQL select the data types for a sales table with the sample values shown previously gives several issues.

figure 3

The issues we encounter with the data types MySQL automatically selected are:

  1. wrong data type: you won’t be able to use the field in the way that you would expect. For example, having purchase_time as a string and not a time data type means you won’t be able to perform time operations (such as calculating how long it’s been since the purchase) on the field

  2. too constrained: MySQL will give errors when you try to input values that are larger than what the data type allows. For example, we’ll get errors if we have a sale with an amount_paid of $100.00 or more or an agent_comment that is longer than 65 characters

  3. too conservative:while nothing will break from being too conservative with data types, you will be wasting storage space. With our example data, we could save 15% of storage by using the less conservative preferred options

As smart as modern databases are, the owners of the data still know best about what kind of data is stored and what are the possible values the data can take in the foreseeable future. So you need to carefully specify the data types for each of your fields.

Documentation on data types for different common databases are provided here:

Conclusion

For your database to function correctly and be as optimized as possible, it’s important to carefully select the data types for each field in your tables.