Azure Availability Zones in the Real World

I will discuss Azure’s availability zones feature in this post, sharing what they can offer for you and some of the things to be aware of.

Uptime Versus SLA

Noobs to hosting and cloud focus on three magic letters: S, L, A or service level agreement. This is a contractual promise that something will be running for a certain percentage of time in the billing period or the hosting/cloud vendor will credit or compensate the customer.

You’ll hear phrases like “three nines”, or “four nines” to express the measure of uptime. The first is a 99.9% measure, and the second is a 99.99% measure. Either is quite a high level of uptime. Azure does have SLAs for all sorts of things. For example, a service deployed in a valid virtual machine availability set has a connectivity (uptime) SLA of 99.9%.

Why did I talk about noobs? Promises are easy to make. I once worked for a hosting company that offers a ridiculous 100% SLA for everything, including cheap-ass generic Pentium “servers” from eBay with single IDE disks. 100% is an unachievable target because … let’s be real here … things break. Even systems with redundant components have downtime. I prefer to see realistic SLAs and honest statements on what you must do to get that guarantee.

Azure gives us those sorts of SLAs. For virtual machines we have:

  • 5% for machines with just Premium SSD disks
  • 9% for services running in a valid availability set
  • 99% for services running in multiple availability zones

Ah… let’s talk about that last one!

Availability Sets

First, we must discuss availability sets and what they are before we move one step higher. An availability set is anti-affinity, a feature of vSphere and in Hyper-V Failover Clustering (PowerShell or SCVMM); this is a label on a virtual machine that instructs the compute cluster to spread the virtual machines across different parts of the cluster. In Azure, virtual machines in the same availability set are placed into different:

  • Update domains: Avoiding downtime caused by (rare) host reboots for updates.
  • Fault domains: Enable services to remain operational despite hardware/software failure in a single rack.

The above solution spreads your machines around a single compute (Hyper-V) cluster, in a single room, in a single building. That’s amazing for on-premises, but there can still be an issue. Last summer, a faulty humidity sensor brought down one such room and affected a “small subset” of customers. “Small subset” is OK, unless you are included and some mission critical system was down for several hours. At that point, SLAs are meaningless – a refund for the lost runtime cost of a pair of Linux VMs running network appliance software won’t compensate for thousands or millions of Euros of lost business!

Availability Zones

We can go one step further by instructing Azure to deploy virtual machines into different availability zones. A single region can be made up of different physical locations with independent power and networking. These locations might be close together, as is typically the case in North Europe or West Europe. Or they might be on the other side of a city from each other, as is the case in some in North America. There is a low level of latency between the buildings, but this is still higher than that of a LAN connection.

A region that supports availability zones is split into 4 zones. You see three zones (round robin between customers), labeled as 1, 2, and 3. You can deploy many services across availability zones – this is improving:

  • VNet: Is software-defined so can cross all zones in a single region.
  • Virtual machines: Can connect to the same subnet/address space but be in different zones. They are not in availability sets but Azure still maintains service uptime during host patching/reboots.
  • Public IP Addresses: Standard IP supports anycast and can be used to NAT/load balance across zones in a single region.

Other network resources can work with availability zones in one of two ways:

  • Zonal: Instances are deployed to a specific zone, giving optimal latency performance within that zone, but can connect to all zones in the region.
  • Zone Redundant: Instances are spread across the zone for an active/active configuration.

Examples of the above are:

  • The zone-aware VNet gateways for VPN/ExpressRoute
  • Standard load balancer
  • WAGv2 / WAFv2

Considerations

There are some things to consider when looking at availability zones.

  • Regions: The list of regions that supports availability zones is increasing slowly but it is far from complete. Some regions will not offer this highest level of availability.
  • Catchup: Not every service in Azure is aware of availability zones, but this is changing.

Let me give you two examples. The first is VM Boot Diagnostics, a service that I consider critical for seeing the console of the VM and getting serial console access without a network connection to the virtual machine. Boot Diagnostics uses an agent in the VM to write to a storage account. That storage account can be:

  • LRS: 3 replicas reside in a single compute cluster, in a single room, in a single building (availability zone).
  • GRS: LRS plus 3 asynchronous replicas in the paired region, that are not available for write unless Microsoft declares a total disaster for the primary region.

So, if I have a VM in zone 1 and a VM in zone 2, and both write to a storage account that happens to be in zone 1 (I have no control over the storage account location), and zone 1 goes down, there will be issues with the VM in zone 2. The solution would be to use ZRS GPv2 storage for Boot Diagnostics, however, the agent will not support this type of storage configuration. Gotcha!

Azure Advisor will also be a pain in the ass. Noobs are told to rely on Advisor (it is several questions in the new Azure infrastructure exams) for configuration and deployment advice. Advisor will see the above two VMs as being not highly available because they are not (and cannot) be in a common availability set, so you are advised to degrade their SLA by migrating them to a single zone for an availability set configuration – ignore that advice and be prepared to defend the decision from Azure noobs, such as management, auditors, and ill-informed consultants.

Opinion

Availability zones are important – I use them in an architecture pattern that I am working on with several customers. But you need to be aware of what they offer and how certain things do not understand them yet or do not support them yet.

 

Azure Schedules Maintenance & Downtime For January 9th

Microsoft are currently distributing the following email template:

Performance, security, and quality are always top priorities for us. I am reaching out to give you an advanced notice about an upcoming planned maintenance of the Azure host OS. The vast majority of updates are performed without impacting VMs running on Azure, but for this specific update, a clean reboot of your VMs may be necessary. The VMs associated with your Azure subscription may be scheduled to be rebooted as part of the next Azure host maintenance event starting January 9th, 2018. The best way to receive notifications of the time your VM will undergo maintenance is to setup Scheduled Events <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/scheduled-events> .

If your VMs are maintained, they will experience a clean reboot and will be unavailable while the updates are applied to the underlying host. This is usually completed within a few minutes. For any VM in an availability set or a VM scale set, Azure will update the VMs one update domain at a time to limit the impact to your environments. Additionally, operating system and data disks as well as the temporary disk on your VM will be retained (Aidan: the VM stays on the host) during this maintenance.

Between January 2nd and 9th 2018, you will be able to proactively initiate the maintenance to control the exact time of impact on some of your VMs. Choosing this option will result in the loss of your temporary disk (Aidan: The VM redeploys to another host and gets a new temporary disk). You may not be able to proactively initiate maintenance on some VMs, but they could still be subject to scheduled maintenance from January 9th 2018. The best way to receive notifications of the time your VM will undergo maintenance is to setup Scheduled Events <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/scheduled-events> .

I have put together a list of resources that should be useful to you.

* Planned maintenance how-to guide and FAQs for Windows <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/maintenance-notifications> or Linux <https://docs.microsoft.com/en-us/azure/virtual-machines/linux/maintenance-notifications> VMs.

* Information about types of maintenance <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/maintenance-and-updates> performed on VMs.

* Discussion topics for maintenance on the Azure Virtual Machines forums.

I am committed to helping you through this process, please do reach out if I can be of any assistance.

Regards

<Insert signature>

In short, a deployment will start on Jan 9th that will introduce some downtime to services that are not in valid availability sets. If you are running VMs that might be affected, you can use the new Planned Maintenance feature between Jan 2-9 to move your VMs to previously updated hosts at a time of your choosing. There will be downtime for the Redploy action, but it happens at a time of your choosing, and not Microsoft’s.

For you cloud noobs that want to know “what time on Jan 9th the updates will happen?”, imagine this. You have a server farm that has north of 1,000,000 physical hosts. Do you think you’ll patch them all at 3am? Instead, Microsoft will be starting the deployment, one update domain (group of hosts in a compute cluster) at a time, from Jan 9th.

And what about the promise that In-Place Migration would keep downtime to approx 30 seconds. Back when the “warm reboot” feature was announced, Microsoft said that some updates would require more downtime. I guess the Jan 9th update is one of the exceptions.

My advice: follow the advice in the mail template, and do planned maintenance when you can.

Want to Learn About In-Place Migration, Availability Sets, Update & Fault Domains?

If you found this information useful, then imagine what 2 days of training might offer you. I’m delivering a 2-day course in Amsterdam on April 19-20, teaching newbies and experienced Azure admins about Azure Infrastructure. There’ll be lots of in-depth information, covering the foundations, best practices, troubleshooting, and advanced configurations. You can learn more here.

Restore An Azure VM to an Availability Set From Azure Backup in the Azure Portal

Microsoft has shared how to restore an Azure VM to an availability set using PowerShell from Azure Backup. It’s nasty-hard looking PowerShell, and my problem with examples of VM creation using PowerShell is that they’re never feature complete.

While writing some Azure VM training recently, I stumbled across a cool option in the Azure Portal that I tried out … and it worked … and it means that I never have to figure that nasty PowerShell out Smile

The key to all this is to start using Managed Disks. Even if your existing VMs are using un-managed (storage account) disks, that’s not a problem because you can still use this restore method. The other thing you should remember is that the metadata of the VM is irrelevant – everything of value is in the disks.

Restore the Disks of the VM

Using these steps you can restore the disks of your VM, managed or un-managed, to a storage location, referred to as the staging account.. Each disk is restored as a blob VHD file, and a JSON file describes the disks so that you can identify which one is the “osDisk”.

Create Managed Disks from the Restored VHDs

In this process, you create a managed disk from each restored VHD or blob file in the staging location. You have the option to restore the disks as Standard (HDD) or Premium (SSD) disks, which offers you some flexibility in your restore (you can switch storage types!). Make sure you ID the osDisk from the JSON file and mark it as either a Windows or Linux OS disk, depending on the contents.

Create a VM From the OS Managed Disk

The third set of steps bring your VM back online. You use the previously restored/identified osDisk and create a new virtual machine using that managed disk. Make sure you select the availability set that you want to restore the VM to.

Clean Up

The last step is the clean up. If you had any data disks in the original machine then you need to re-attach them to the new virtual machine. You’ll also need to configure the network settings of the Azure NIC resource. For example, if the new VM is replacing the old one, you should enter the IP settings of the old VM into the new NIC Azure resource, change any NAT/load balancing rules, NSGs, PIPs, etc.

And that’s it! There’s no PowerShell, and it’s all pretty simple clicking in the Azure Portal that won’t take that long to do after the disks are restored from the recovery services vault.

Create a New VM From An Existing Managed Disk

In previous posts I have shown how to restore the disks of a VM to a storage account and how to create managed disks from those VHD blobs. In this post, I will show how to create a new VM from a managed disk. When these 3 steps are done together, this is an easy way to restore an Azure virtual machine from backup to an availability set.

I previously created a managed disk from a restored VHD blob, and stored it in a resource group called demorestore. I deliberately named the new managed disk after the VM that I am going to create.

image

You can only create a new VM from a managed disk that contains an operating system. In the below screenshot, you can see that this disk contains Windows. If this is an OS disk, then you can click the magic button called + Create VM.

image

What you are doing by clicking the button is shorting the usual Create Virtual Machine blade/wizard. A blade you probably know appears, but some of the features are greyed out because they’re already selected by choosing to create a VM from an existing managed disk.

Enter the name of the new VM, and select the resource group.

image

In the Size blade, choose the size of the new VM. In settings, choose the availability set (key to restoring a VM to an availability set) and then all the other stuff like network, subnet, extensions, etc.

When you complete the wizard, a VM (which is just metadata) is created using your pre-existing OS managed disk. If you have any data disks to re-use, open Disks in the settings of the VM and add those managed disks with the required host caching mode. And that’s all there is to it!