Understanding Amazon Aurora's Multi-AZ Deployment

Data Tutorial

To fully understand what a Multi-AZ Deployment means for your infrastructure, it’s critical to recognize how Amazon Web Services is configured across the globe and thus how it provides the redundancy services no matter your location.

As discussed in the official documentation, the AWS Cloud is made up of a number of Regions, which are physical locations around the world, such as Oregon, United States; North Virginia, United States; Ireland; and Tokyo.

Within each Region exists a number of separate physical data centers, known as Availability Zones. Each Availability Zone is a self-contained facility with its own power, connectivity, and networking capabilities. Most Regions are home to 2-3 different Availability Zones each, providing adequate redundancy when necessary within a given Region.

While Amazon is always expanding their Region and Availability Zone coverage, you may view a current map of the AWS Cloud infrastructure in the image below:

Amazon Regions

Image courtesy of Amazon Web Services

All Availability Zones within a single Region are connected to one another through private fiber-optic networking, allowing each Availability Zone to communicate with one another and transfer data quickly and efficiently as required.

Identifying an Availability Zone Code

When creating a new instance through the AWS dashboard, you may be presented with the option to select a specific Availability Zone, or in many cases simply a Region and the system will select the Availability Zone for you.

Regions are labeled by a simple string to present the country and/or sub-region if necessary. For example, us-west-2 is the designation for the Oregon, United States Region while us-west-1 is for California, United States.

Availability Zones are designated by following the Region tag with a letter designation, such as us-west-1b or us-west-2a.

Storage Layers vs Server Instances

Another important concept to understand in order to grasp what Multi-AZ Deployments entail is the difference between the storage layer and the server instance.

The server instance for your database is best thought of as the physical machine that controls the structure of your database and routes all your data that is contained within the storage layer.

The storage layer is an SSD-backed virtualized representation of all the actual data within your database. The keyword to focus on here is virtualized, which is Amazon’s fancy way of saying that the storage layer which represents the actual data in your system is not attached to any one physical location or machine, but instead is virtualized and propagated to numerous locations (six in total across three Availability Zones in most cases).

What Does Multi-AZ Deployment Provide?

In nearly all cases using Amazon Web Services, it is standard practice for the storage layer (where all the data resides) to be redundantly stored across all the Availability Zones within the given Region at no extra cost. In the event that one Availability Zone goes offline for some reason (as unlikely as that might be), the system is already in place to instantly and automatically continue the services of your database through an identical copy of the storage layer from one of the other connected Availability Zones.

However, unless otherwise specified, this redundancy is only applied to the storage layer, but does not exist for the physical machine of your actual server instance. If something were to cause the Availability Zone where your server instance resides to shutdown, your database would cease to function, as the physical server instance is offline.

This is where Multi-AZ Deployment comes in for services like Amazon Aurora. Just like the automatic redundancy of the data in your storage layer, a Multi-AZ Deployment means that your server instance is also redundantly copied across multiple Availability Zones. For this reason, any Amazon Aurora Multi-AZ Deployment is assured that should a single Availability Zone go offline where the physical server instance machine resides, an automatic failover is initiated onto an up-to-date standby replication in another connected Availability Zone.

As discussed in the official documentation, in order to maximize your system’s uptime, the failover procedure (which typically only takes 1-2 minutes) will be automatically performed in the case of any of the following events:

  • Loss of availability in primary Availability Zone
  • Loss of network connectivity to primary
  • Compute unit failure on primary
  • Storage failure on primary