Backup Versus Resiliency Versus Disaster Recovery

Most of us are no strangers to the backup versus disaster recovery conversation. Each is a different problem, typically (but not always) with different business expectations. Lately, resiliency has crawled into the mix, and a lot of social media commentary isn’t helping. In this post, I’m going to explain how I define backup, resiliency, and disaster recovery, and discuss how they impact my Azure designs for service & data availability.

Essential Terminology

There are two essential terms that we have to understand to discuss these problems/solutions:

  • RPO: The recovery point objective is how much data, measured in time, is lost when our solution kicks in.
  • RTO: The recovery time objective is how long, measured in time, services are offline while the solution kicks in.

Backup/Restore

A backup is when we take a copy of our data and (ideally) store that copy elsewhere, and even in several places. The concept is that we can restore our data from a backup if the original data (files, database, VM files, etc) are deleted either accidentally or deliberately.

The base product for backup in Azure is Azure Backup, which supports:

  • Azure VMs
  • Managed disks
  • Azure Files
  • SQL Server in Azure VMs
  • SAP HANA databases in Azure VMs
  • Azure Database for PostgreSQL servers
  • Azure Blobs
  • Azure Database for PostgreSQL Flexible server
  • Azure Kubernetes service
  • Azure Database for MySQL – Flexible Server
  • SAP ASE (Sybase) database on Azure VMs
  • Azure Data Lake Storage
  • Azure Elastic SAN

Quite honestly, that list is much longer than the last time I searched for it! Azure Backup covers a lot, but it doesn’t cover everything. Some solutions, like Azure SQL, feature their own backup solution.

It’s not unusual for people to bring another backup tool to Azure. The one I hear most of is Veeam Backup for Microsoft Azure. While I’ve never used Veeam hands-on, its reputation is excellent, and it has the unique ability to be platform agnostic. Want to restore VMs from Azure to Hyper-V, Nutanix, or VMware if you’re that way inclined ;)? You can with Veeam.

  • RPO: Backup features the longest RPO here. The data loss is depdendent on how often your backup jobs run. Daily backups? You can lose up to 24 hours of data. Backups every 15 minutes? You might lose up to 15 minutes of data.
  • RTO: This is where the pain can be; the RTO is how long it takes to copy your data from the backup storage to the production storage? Restoring an Azure Backup snapshot recovery point is a disk-to-disk copy. Restoring a 10 TB VM from blob storage over the network is going to be a long wait.

Disaster Recovery (DR)

The purpose of DR is to recover from a disaster. Let’s define what a disaster could be using real examples:

  • Hurricane Katrina was a natural disaster that wiped out huge areas of the USA in 2005.
  • The “black summer” bushfires in Australia destroyed millions of hectares of land in 2019-2020.
  • The Indian Ocean Tsunami in 2004 caused devastation in the coastal areas of many countries.
  • Keeping it local for me: post-storm winter floods have caused widespread damage throughout Ireland in the last few years.
  • Three AWS data centres were hit by drone attacks in the UAE & Bahrain in March of this year.

Disasters can be natural or they can be man-made. Disasters rarely target 1 building; they wipe out an area. They are rare – but they happen. There is another kind of disaster, which few think about:

  • KNP Logistics Group, a 125-year-old UK transport firm with over 700 employees, was put out of business because of a ransomware attack in 2025.

Pending (and passed in some countries) EU regulations (NIS2) consider this a disaster that subject organisations must be prepared for.

For cloud planning, if we need to prepare for disaster recovery, then we must plan for the loss of the Azure region ny replicating services/data to another region – typically the paired region. There is no one solution, and there are plenty of complicating factors. Techs that will be in scope include:

  • Azure Site Recovery (ASR) for Azure VMs
  • Geo-redundant storage (GRS) and the various geo-variants
  • PaaS resources that include GRS
  • Database replication
  • DevOps pipelines/workflows to redeploy resources (but not data)

There is a fun grey area here. Veeam is not only a backup solution; it is also a DR solution! You will also find that some people use backup as a budget DR solution – they replicate data from the primary location to the secondary location (Azure Backup Geo-Redundant). The right solution for your organisation is often based on business requirements and budget, with budget being the big elephant in the room.

  • RPO: DR replication is typically based on asynchronous replication. The RPO is often measured in seconds/minutes.
  • RTO: The RTO really is dependent on the complexity of services, the quantity of services to restore, the interdependencies, and how automated the process is once it starts. The RTO should be measured in hours, but a backup solution might be measured in days/weeks.

Resilience

The purpose of resilience is to enable a service to survive a localised issue, such as:

  • A VM crashes.
  • Microsoft are patching an App Service compute instance.
  • An Azure host is getting a firmware update.
  • Microsoft had a networking issue in a single data centre building.

We use resilience to keep the service operational with no perceivable outage to the service consumer. There are many ways to tackle resilience, but they are all based on scaling out:

  • Availability Zones: Most Azure regions have multiple data centre buildings. The buildings are split into what we see as 3 Availability Zones. Each Availability Zone has independent external network connections, power, and cooling. The theory is that if I spread the tier of a service across 3 zones, then that tier can survive 2 zones going offline. Some PaaS services, like Bastion, default to using zones; some have to be opted in. Beware of some PaaS resources, like App Service Environment, that have minimum consumption requirements to be placed across Availability Zones.
  • Availability Sets: If we cannot use Availability Zones (more later on this), then we can place virtual machines in Availability Sets. We can think of Availability Sets as a form of anti-affinity; machines in the same set are placed into different update domains (Azure platform updates) and fault domains (racks) in the same room in the same data centre. Microsoft does this for multi-instance PaaS services that are not using Availability Zones.
  • Zone Redundant Storage (ZRS): Azure storage is based on the concept of storing each block 3 times. ZRS places the replica blocks across 3 different Availability Zones. Your data remains operational even if 2 of the data centres are lost.

There are many architectural considerations to handle when you start resiliency planning.

The old pain-in-the-a** is the legacy line-of-business app that supports just a single VM. There is no scaling out to gain resiliency. Traditionally, VMs used LRS (locally redundant storage) managed disks. LRS managed disks are stored in a single data centre with the VM. There have been issues in the past where storage in a single room has gone offline, taking all three LRS replicas of the disks’ blocks offline. You can choose to use ZRS managed disks. The VM will continue to primarily use the local replica, but two replicas are stored in other Availability Zones in the same region. If the primary storage cluster goes offline, you can perform a manual process to get the VM back online with another replica.

  • RPO: Depending on the architecture and technologies, there is either a zero-RPO (active/active services) or an RPO of a few seconds (replicated storage).
  • RTO: In most cases, the RTO is 0. The one exception that I can think of is the single VM with a ZRS where the RTO is how long it takes you to force-detach the disk and create a new VM with the existing disks in another Availability Zone.

By the way, there are whole areas on networking resiliency that I could type about for hours too!

Confusion

As I have alluded to, I’ve seen some discussions on LinkedIn recently stating that Availability Zones can be used for disaster recovery. They could. Can they? Should they?

What is the disaster that you are planning for? If it’s any of the above natural disasters then I would argue that spreading your services/data across data centres located beside each other is going to lead to a sudden career-ending meeting.

Don’t give me the “Availability Zones are spread apart from each other” line. Suuuuure they are – except any of the ones that I’ve located on Google Maps, such as North Europe, West Europe, or US East to begin with.

Now, let’s get on with the practical realities of following the concept of using Availability Zones for DR. When was the last time you tried to deploy Azure VMs across Availability Zones? What about a firewall? Or App Services? Did you get an “Internal Service Error”, a weird quota error, or at least some helpful message to inform you that there was no capacity in “zone 2”? That’s been my experience for the last 14+ months in any regions that I’ve worked in. So, don’t recommend me to use a technology for emergency DR if I cannot even use it for operational resiliency!

Yes, I know that capacity issues also impact inter-region DR designs. If West Europe were to be flooded, you can be all but sure that you are not getting into North Europe thanks to the instant massive demand from many customers. I know that’s an unlikely scenario – but it’s one that some organisations must plan for. For example, I had a central government customer ask me about Azure region choice. The country in question has an “aggressive” neighbour to the east that likes to wage war on its neighbours. The local Microsoft office asked them to move into the new local Azure region soon after Ukraine was invaded. I asked the customer: “Where would Ukraine be now if all of its IT services were based in a local Azure region under 300 KM from Russia?” I’d extend that with a follow-up question now: “What if you used Availability Zones in that single region for DR?” Yes, the scenario is real – see above. Or consider if a hurricane reached Boydton in Virginia, USA, or a bushfire ran rampant in New South Wales/Victoria, Australia.

Before you go planning, please:

  • Understand the risks you are planning for
  • Have a budget
  • Understand the technologies
  • Comprehend how or if the technologies counter the risks
  • If the technologies are available to you at all!

Migrating Azure Firewall To Availability Zones

Microsoft recently added support for availability zones to Azure firewall in regions that offer this higher level of SLA. In this post, I will explain how you can convert an existing Azure Firewall to availability zones.

Before We Proceed

There are two things you need to understand:

  1. If you have already deployed and configured Azure Firewall then there is no easy switch to turn on availability zones. What I will be showing is actually a re-creation.
  2. You should do a “dress rehearsal” – test this process and validate the results before you do the actual migration.

The Process

The process you will do will go as follows:

  1. Plan a maintenance window when the Azure Firewall (and dependent communications) will be unavailable for 1 or 2 hours. Really, this should be very quick but, as Scotty told Geordi La Forge, a good engineer overestimates the effort, leaves room for the unexpected, and hopefully looks like a hero if all goes to the unspoken plan.
  2. Freeze configuration changes to the Azure Firewall.
  3. Perform a backup of the Azure Firewall.
  4. Create a test environment in Azure – ideally a dedicated subscription/virtual network(s) minus the Azure Firewall (see the next step).
  5. Modify the JSON file to include support for availability zones.
  6. Restore the Azure Firewall backup as a new firewall in the test environment.
  7. Validate that that new firewall has availability zones and that the rules configuration matches that of the original.
  8. Confirm & wait for the maintenance window.
  9. Delete the Azure Firewall – yes, delete it.
  10. Restore the Azure Firewall from your modified JSON file.
  11. Validate the restore
  12. Celebrate – you have an Azure Firewall that supports multiple zones in the region.

Some of the Technical Bits

The processes of backing up and restoring the Azure Firewall are covered in my post here.

The backup is a JSON export of the original Azure Firewall, describing how to rebuild and re-configure it exactly as is – without support for availability zones. Open that JSON and make 2 changes.

The first change is to make sure that the API for deploying the Azure Firewall is up to date:

        {
            "apiVersion": "2019-04-01",
            "type": "Microsoft.Network/azureFirewalls",

The next change is to instruct Azure which availability zones (numbered 1, 2, and 3) that you want to use for availability zones in the region:

        {
            "apiVersion": "2019-04-01",
            "type": "Microsoft.Network/azureFirewalls",
            "name": "[variables('FirewallName')]",
            "location": "[variables('RegionName')]",
            "zones": [
                "1",
                "2",
                "3"
            ],
            "properties": {
                "ipConfigurations": [
                    {

And that’s that. When you deploy the modified JSON the new Azure Firewall will exist in all three zones.

Note that you can use this method to place an Azure Firewall into a single specific zone.

Costs Versus SLAs

A single zone Azure Firewall has a 99.95% SLA. Using 2 or 3 zones will increase the SLA to 99.99%. You might argue “what’s the point?”. I’ve witnessed a data center (actually, it was a single storage cluster) in an Azure region go down. That can have catastrophic results on a service. It’s rare but it’s bad. If you’re building a network where the Azure Firewall is the centre of secure, then it becomes mission critical and should, in my opinion, span availability zones, not for the contractual financial protections in an SLA but for protecting mission critical services.  That protection comes at a cost – you’ll now incur the micro-costs of data flows between zones in a region. From what I’ve seen so far, that’s a tiny number and a company that can afford a firewall will easily absorb that extra relatively low cost.

Azure Availability Zones in the Real World

I will discuss Azure’s availability zones feature in this post, sharing what they can offer for you and some of the things to be aware of.

Uptime Versus SLA

Noobs to hosting and cloud focus on three magic letters: S, L, A or service level agreement. This is a contractual promise that something will be running for a certain percentage of time in the billing period or the hosting/cloud vendor will credit or compensate the customer.

You’ll hear phrases like “three nines”, or “four nines” to express the measure of uptime. The first is a 99.9% measure, and the second is a 99.99% measure. Either is quite a high level of uptime. Azure does have SLAs for all sorts of things. For example, a service deployed in a valid virtual machine availability set has a connectivity (uptime) SLA of 99.9%.

Why did I talk about noobs? Promises are easy to make. I once worked for a hosting company that offers a ridiculous 100% SLA for everything, including cheap-ass generic Pentium “servers” from eBay with single IDE disks. 100% is an unachievable target because … let’s be real here … things break. Even systems with redundant components have downtime. I prefer to see realistic SLAs and honest statements on what you must do to get that guarantee.

Azure gives us those sorts of SLAs. For virtual machines we have:

  • 5% for machines with just Premium SSD disks
  • 9% for services running in a valid availability set
  • 99% for services running in multiple availability zones

Ah… let’s talk about that last one!

Availability Sets

First, we must discuss availability sets and what they are before we move one step higher. An availability set is anti-affinity, a feature of vSphere and in Hyper-V Failover Clustering (PowerShell or SCVMM); this is a label on a virtual machine that instructs the compute cluster to spread the virtual machines across different parts of the cluster. In Azure, virtual machines in the same availability set are placed into different:

  • Update domains: Avoiding downtime caused by (rare) host reboots for updates.
  • Fault domains: Enable services to remain operational despite hardware/software failure in a single rack.

The above solution spreads your machines around a single compute (Hyper-V) cluster, in a single room, in a single building. That’s amazing for on-premises, but there can still be an issue. Last summer, a faulty humidity sensor brought down one such room and affected a “small subset” of customers. “Small subset” is OK, unless you are included and some mission critical system was down for several hours. At that point, SLAs are meaningless – a refund for the lost runtime cost of a pair of Linux VMs running network appliance software won’t compensate for thousands or millions of Euros of lost business!

Availability Zones

We can go one step further by instructing Azure to deploy virtual machines into different availability zones. A single region can be made up of different physical locations with independent power and networking. These locations might be close together, as is typically the case in North Europe or West Europe. Or they might be on the other side of a city from each other, as is the case in some in North America. There is a low level of latency between the buildings, but this is still higher than that of a LAN connection.

A region that supports availability zones is split into 4 zones. You see three zones (round robin between customers), labeled as 1, 2, and 3. You can deploy many services across availability zones – this is improving:

  • VNet: Is software-defined so can cross all zones in a single region.
  • Virtual machines: Can connect to the same subnet/address space but be in different zones. They are not in availability sets but Azure still maintains service uptime during host patching/reboots.
  • Public IP Addresses: Standard IP supports anycast and can be used to NAT/load balance across zones in a single region.

Other network resources can work with availability zones in one of two ways:

  • Zonal: Instances are deployed to a specific zone, giving optimal latency performance within that zone, but can connect to all zones in the region.
  • Zone Redundant: Instances are spread across the zone for an active/active configuration.

Examples of the above are:

  • The zone-aware VNet gateways for VPN/ExpressRoute
  • Standard load balancer
  • WAGv2 / WAFv2

Considerations

There are some things to consider when looking at availability zones.

  • Regions: The list of regions that supports availability zones is increasing slowly but it is far from complete. Some regions will not offer this highest level of availability.
  • Catchup: Not every service in Azure is aware of availability zones, but this is changing.

Let me give you two examples. The first is VM Boot Diagnostics, a service that I consider critical for seeing the console of the VM and getting serial console access without a network connection to the virtual machine. Boot Diagnostics uses an agent in the VM to write to a storage account. That storage account can be:

  • LRS: 3 replicas reside in a single compute cluster, in a single room, in a single building (availability zone).
  • GRS: LRS plus 3 asynchronous replicas in the paired region, that are not available for write unless Microsoft declares a total disaster for the primary region.

So, if I have a VM in zone 1 and a VM in zone 2, and both write to a storage account that happens to be in zone 1 (I have no control over the storage account location), and zone 1 goes down, there will be issues with the VM in zone 2. The solution would be to use ZRS GPv2 storage for Boot Diagnostics, however, the agent will not support this type of storage configuration. Gotcha!

Azure Advisor will also be a pain in the ass. Noobs are told to rely on Advisor (it is several questions in the new Azure infrastructure exams) for configuration and deployment advice. Advisor will see the above two VMs as being not highly available because they are not (and cannot) be in a common availability set, so you are advised to degrade their SLA by migrating them to a single zone for an availability set configuration – ignore that advice and be prepared to defend the decision from Azure noobs, such as management, auditors, and ill-informed consultants.

Opinion

Availability zones are important – I use them in an architecture pattern that I am working on with several customers. But you need to be aware of what they offer and how certain things do not understand them yet or do not support them yet.