Azure Availability Zones in the Real World

I will discuss Azure’s availability zones feature in this post, sharing what they can offer for you and some of the things to be aware of.

Uptime Versus SLA

Noobs to hosting and cloud focus on three magic letters: S, L, A or service level agreement. This is a contractual promise that something will be running for a certain percentage of time in the billing period or the hosting/cloud vendor will credit or compensate the customer.

You’ll hear phrases like “three nines”, or “four nines” to express the measure of uptime. The first is a 99.9% measure, and the second is a 99.99% measure. Either is quite a high level of uptime. Azure does have SLAs for all sorts of things. For example, a service deployed in a valid virtual machine availability set has a connectivity (uptime) SLA of 99.9%.

Why did I talk about noobs? Promises are easy to make. I once worked for a hosting company that offers a ridiculous 100% SLA for everything, including cheap-ass generic Pentium “servers” from eBay with single IDE disks. 100% is an unachievable target because … let’s be real here … things break. Even systems with redundant components have downtime. I prefer to see realistic SLAs and honest statements on what you must do to get that guarantee.

Azure gives us those sorts of SLAs. For virtual machines we have:

5% for machines with just Premium SSD disks
9% for services running in a valid availability set
99% for services running in multiple availability zones

Ah… let’s talk about that last one!

Availability Sets

First, we must discuss availability sets and what they are before we move one step higher. An availability set is anti-affinity, a feature of vSphere and in Hyper-V Failover Clustering (PowerShell or SCVMM); this is a label on a virtual machine that instructs the compute cluster to spread the virtual machines across different parts of the cluster. In Azure, virtual machines in the same availability set are placed into different:

Update domains: Avoiding downtime caused by (rare) host reboots for updates.
Fault domains: Enable services to remain operational despite hardware/software failure in a single rack.

The above solution spreads your machines around a single compute (Hyper-V) cluster, in a single room, in a single building. That’s amazing for on-premises, but there can still be an issue. Last summer, a faulty humidity sensor brought down one such room and affected a “small subset” of customers. “Small subset” is OK, unless you are included and some mission critical system was down for several hours. At that point, SLAs are meaningless – a refund for the lost runtime cost of a pair of Linux VMs running network appliance software won’t compensate for thousands or millions of Euros of lost business!

Availability Zones

We can go one step further by instructing Azure to deploy virtual machines into different availability zones. A single region can be made up of different physical locations with independent power and networking. These locations might be close together, as is typically the case in North Europe or West Europe. Or they might be on the other side of a city from each other, as is the case in some in North America. There is a low level of latency between the buildings, but this is still higher than that of a LAN connection.

A region that supports availability zones is split into 4 zones. You see three zones (round robin between customers), labeled as 1, 2, and 3. You can deploy many services across availability zones – this is improving:

VNet: Is software-defined so can cross all zones in a single region.
Virtual machines: Can connect to the same subnet/address space but be in different zones. They are not in availability sets but Azure still maintains service uptime during host patching/reboots.
Public IP Addresses: Standard IP supports anycast and can be used to NAT/load balance across zones in a single region.

Other network resources can work with availability zones in one of two ways:

Zonal: Instances are deployed to a specific zone, giving optimal latency performance within that zone, but can connect to all zones in the region.
Zone Redundant: Instances are spread across the zone for an active/active configuration.

Examples of the above are:

The zone-aware VNet gateways for VPN/ExpressRoute
Standard load balancer
WAGv2 / WAFv2

Considerations

There are some things to consider when looking at availability zones.

Regions: The list of regions that supports availability zones is increasing slowly but it is far from complete. Some regions will not offer this highest level of availability.
Catchup: Not every service in Azure is aware of availability zones, but this is changing.

Let me give you two examples. The first is VM Boot Diagnostics, a service that I consider critical for seeing the console of the VM and getting serial console access without a network connection to the virtual machine. Boot Diagnostics uses an agent in the VM to write to a storage account. That storage account can be:

LRS: 3 replicas reside in a single compute cluster, in a single room, in a single building (availability zone).
GRS: LRS plus 3 asynchronous replicas in the paired region, that are not available for write unless Microsoft declares a total disaster for the primary region.

So, if I have a VM in zone 1 and a VM in zone 2, and both write to a storage account that happens to be in zone 1 (I have no control over the storage account location), and zone 1 goes down, there will be issues with the VM in zone 2. The solution would be to use ZRS GPv2 storage for Boot Diagnostics, however, the agent will not support this type of storage configuration. Gotcha!

Azure Advisor will also be a pain in the ass. Noobs are told to rely on Advisor (it is several questions in the new Azure infrastructure exams) for configuration and deployment advice. Advisor will see the above two VMs as being not highly available because they are not (and cannot) be in a common availability set, so you are advised to degrade their SLA by migrating them to a single zone for an availability set configuration – ignore that advice and be prepared to defend the decision from Azure noobs, such as management, auditors, and ill-informed consultants.

Opinion

Availability zones are important – I use them in an architecture pattern that I am working on with several customers. But you need to be aware of what they offer and how certain things do not understand them yet or do not support them yet.

9 thoughts on “Azure Availability Zones in the Real World”

Pingback: Azure Weekly: April 1, 2019 - Build Azure

I am considering to move VM’s from availablity set to availablity zone. The steps I can take care. I have one question. If there are 3 VMS in on availablity zone how to make sure that all three VM’s do not go down due to power loss or update( I know it is similar to availablity setm but there I do not have the protection from datacenter going down)

AFinn says:

June 4, 2020 at 4:57 PM

There are two scenarios here:
a) You will only have 3 VMs in this service tier. Place all 3 VMs either in different zones (1,2,3) (data center failure/update) or in a single availability set (cluster failure/update).
b) You scale so that 1 data center being offline doesn’t bring the service tier down, so there are 3 VMs in each availability zone.

Reply

“So, if I have a VM in zone 1 and a VM in zone 2, and both write to a storage account that happens to be in zone 1 (I have no control over the storage account location), and zone 1 goes down, there will be issues with the VM in zone 2. The solution would be to use ZRS GPv2 storage for Boot Diagnostics, however, the agent will not support this type of storage configuration. Gotcha!”

For this scenario I do not see how a Boot Diagnostics storage account can cause failure on the second VM running on a secondary zone, by failure do you mean I won’t be able to run boot diagnostics or do you mean that I won’t be able to RDP to the second VM and the applications within it would not work?

AFinn says:

November 9, 2020 at 11:59 AM

You will lose boot diagnostics and therefore Serial Console. And you better hope that the redundant VM doesn’t need to restart – unless MSFT has changed the dependency on the boot diagnostics storage account being available to allow a VM to boot up.

Reply

Quick question
What is the behaviour of a VM if there is a failure of availability zone ?
Will it restart automatically in another Zone ?

AFinn says:

April 19, 2021 at 2:35 PM

No. You are expected to use highly available workloads spanning multiple VMs in this scenario. Or use Azure Site Recovery between zones. It’s a rare scenario, though, where a single building goes offline.

Reply

While creating VM in Azure portal we get to see two options related to availability/redundancy.

First Option is – ‘Availability Options’ like Availability set, availability zone

Second Option is – ‘Disk Type options’ showing LRS/ZRS etc

Can you please shed some light if these two options are related or only one option take effect in the backend.

Finding it really difficult to get any documentation which can clarify how these two options work together in the backend.

Thanks
Mohammed Pasha

AFinn says:

February 13, 2023 at 3:41 PM

LRS and ZRS are documented by Microsoft under Storage Accounts. ZRS means the disks are spread across availability zones, so even if the VM is only in one zone (which it always is), the storage is resilient.

Reply