Disaster recovery | Aidan Finn, IT Pro

5 Most Common Azure Review Findings

An Azure architecture review is something I’ve done many times. Some are focused on networking. Some take a broader look at governance, security, and disaster recovery. Some are urgent — a customer has a problem and needs to understand the full picture before they can fix it. Others are scheduled health checks. The nature of each engagement varies, but the findings? They’re remarkably consistent.

After completing several Azure architecture reviews across very different organisations – different sizes, sectors, and levels of Azure maturity – I’ve noticed the same issues surfacing time and again. I thought it was worth documenting them, because if these problems appear this consistently, they’re likely to appear in your environment too.

Here are the five most common findings.

1. Governance Is Either Missing or Broken

This one appears in every single review. Without exception.

The most common anti-pattern is the “everything in one subscription” model. I understand how it happens – an IT manager kicks off a cloud migration, picks up a subscription, and starts deploying things. It works, for a while. Then the environment grows, the resource groups multiply, and suddenly you have a sprawling mess where cost management is a nightmare, RBAC delegations are a headache, and nobody can tell which resources belong to which workload.

The Microsoft Cloud Adoption Framework (CAF) has a clear answer to this: Landing Zones. One subscription per workload. No cost. No catch. The result is a level of granularity that simplifies cost management, role assignment, quota management, naming, and troubleshooting in one move.

Beyond subscriptions, I typically find that Management Groups haven’t been set up correctly – or at all. Azure Policy is either absent or consists of a handful of default assignments nobody has reviewed. Naming standards are inconsistent, making the environment harder to read and operate at scale.

The fix isn’t a multi-year transformation project. The fix is a minimum viable product: get the right structure in place, assign sensible policies, and improve from there. I’ve designed starter governance architectures in a single afternoon that gave organisations a solid foundation to build on. I’ve written previously about how I interpret and apply the CAF with customers, and why it’s never too late to apply it – even if you’ve been in Azure for years.

My company, Cloud Mechanix, offers a Cloud Strategy consulting service built around the CAF that gets the right foundations in place without the overhead of a months-long engagement.

2. The Network Architecture Is Overly Complex and Doesn’t Enforce Zero Trust

The second finding is closely related to the first. When governance is weak, networks tend to be large, flat, and complicated.

The most common pattern I encounter is what I call the “big VNet” design. Everything lives in one or two large Virtual Networks. Multiple workloads share the same address space. Route Tables get bigger and bigger as more exceptions are added. The network becomes unpredictable. Nobody is entirely sure what path traffic takes from A to B.

The security implication of this is significant. Without workload isolation, without proper routing via a central firewall, and without meaningful NSG enforcement, the environment defaults to a “full trust” model. Every workload can, in principle, reach every other workload. That is the opposite of Zero Trust.

The right design is a proper hub-and-spoke architecture, with application Landing Zones providing the granularity needed to enforce isolation. Each workload gets its own small Virtual Network, peered with the hub. The hub contains the firewall, connectivity resources, and nothing else. Traffic between spokes goes through the firewall and is subject to rules and IDPS inspection. I covered this in more depth in There Is More To Azure Networking Than Connectivity & Security.

Azure Virtual Network Manager (AVNM) makes this scalable. Automatic peering, routing, and IP Address Management mean that a new workload Landing Zone can be connected correctly with minimal manual effort. Cloud Mechanix has published a Bicep module for AVNM if you want a head start. We also do a fixed-price 5 day review of your (selected) Azure networks.

3. The Firewall Has Significant Limitations

I review a lot of firewalls. Very few of them are doing the job that was intended when they were deployed.

The problems vary. In some environments, the firewall is only inspecting a fraction of the traffic. The rest bypasses it entirely because the Route Tables aren’t configured correctly, or because workloads are co-located in the hub where they communicate directly. In others, the firewall is a single instance in a single Availability Zone with no redundancy. One data centre issue and the organisation loses its primary security control.

Network Security Groups are another recurring issue. They are either missing from most subnets, configured with overly permissive rules, or duplicated inconsistently across the environment. In several environments I’ve reviewed, a single NSG was associated with just one subnet while all others had open traffic. That’s not a security boundary. That’s a gap.

WAF configurations also warrant attention. It’s not unusual to find a Web Application Firewall deployed in a way that places unnecessary load on the network firewall, or where the WAF itself has no high availability and is restricted to a single Availability Zone.

There is rarely a simple fix here. These issues tend to be symptomatic of a broader architectural problem – the network was built incrementally without a coherent design. The right answer, in most cases, is a rebuild using a proper hub-and-spoke design with a cloud-native, scalable firewall. If your team needs to get up to speed on how to design this correctly, Cloud Mechanix runs a Designing Secure Azure Networks course for exactly that purpose.

4. Disaster Recovery Is Backup-Based and Wouldn’t Survive a Real Incident

This one concerns me the most.

Almost universally, the disaster recovery capability I encounter in reviews is backup-based rather than replication-led. On the surface, this looks like disaster recovery – data is being backed up, and some of those backups are geo-replicated. But look at what would actually happen if a major incident occurred, and the picture changes quickly.

Recovery Time Objectives are measured in days or weeks rather than hours. Recovery Point Objectives are up to 24 hours because backups run once a day. Multiple backup solutions introduce complexity and inconsistency. Retention periods are short, meaning a ransomware attack that went undetected for several weeks could render the backups useless. Active Directory is being restored from backup, which is widely regarded as error-prone and risky. And in several environments, the disaster recovery region hasn’t been pre-built or secured to the same standard as production.

The regulatory stakes are rising. EU NIS2 makes clear that subject organisations must demonstrate tested recovery plans, reasonable recovery objectives, and appropriate governance. Backup-based disaster recovery will be difficult to defend to a regulator following a major incident. I explored the distinction between backup, resiliency, and genuine disaster recovery in Backup Versus Resiliency Versus Disaster Recovery – worth a read if you’re trying to explain the difference to stakeholders.

The right direction is a replication-led strategy with a warm secondary Azure region. Azure Site Recovery handles virtual machine replication. Azure Backup with geo-redundant replication handles retention and clean-room restores. Infrastructure-as-code ensures that the secondary environment stays consistent with production. And critically – it should be tested regularly, with documented, automated recovery plans.

Disaster recovery should be treated as a core business risk management capability, not an IT optimisation exercise.

5. Monitoring and Security Visibility Are Inadequate

The last finding is perhaps the least glamorous, but it enables everything else.

Across the environments I’ve reviewed, visibility is typically poor. Virtual Network Flow Logs are not enabled. Defender for Cloud is either unused or operating with a limited set of plans that don’t reflect the actual risk profile of the workloads. Subscription-level diagnostic logs and activity logs aren’t being forwarded to a central Log Analytics Workspace. Alerts – whether for threat intelligence signals, IDPS events, or operational anomalies – are either absent or minimal.

This matters for two reasons. First, without visibility, security incidents go undetected. The assumption that no alerts means no problems is dangerously wrong. The assumption should be the opposite. Second, troubleshooting complex connectivity issues without Flow Logs, firewall logs, or PaaS diagnostics is genuinely difficult. I’ve helped diagnose problems that should have taken minutes but took hours because the logging was never turned on.

The fix here isn’t particularly expensive. Virtual Network Flow Logs with Traffic Analytics, a centralised Log Analytics Workspace, Defender for Cloud with appropriate plans enabled, and a sensible set of alerts will transform the visibility of an Azure environment. These should be baseline requirements in any well-governed deployment, not optional extras.

A Pattern Worth Noting

Reading back through that list, there’s a common thread. Each of these findings is a consequence of deploying Azure without a framework. Without a governance strategy, without a landing zone architecture, without a security policy – teams make decisions in isolation, workloads accumulate, and complexity grows in ways that nobody fully intended.

The Cloud Adoption Framework exists precisely to avoid this. It’s not a lengthy consulting exercise. Done right, it provides a practical process for building Azure correctly – one that starts with business motivations, produces a clear architecture, and enables continuous improvement. Cloud Mechanix has developed its own interpretation of the CAF that keeps the process lean and focused on results rather than documentation.

If any of the above findings sound familiar, it may be worth taking stock.

Is Your Azure Environment on the Right Track?

If the findings in this post ring any bells, a structured Azure architecture review is the fastest way.

Cloud Mechanix offers a Fixed-Rate Cloud Environment Review – an expert-led review of your Azure environment, delivered in five business days. The scope is agreed upfront, access is read-only, and the output is a comprehensive report with clear, prioritised recommendations. No vague observations. No 200-page documents that nobody reads.

Whether the concern is security, governance, network architecture, disaster recovery, or the broader picture – get in touch and we can take it from there.

Backup Versus Resiliency Versus Disaster Recovery

Most of us are no strangers to the backup versus disaster recovery conversation. Each is a different problem, typically (but not always) with different business expectations. Lately, resiliency has crawled into the mix, and a lot of social media commentary isn’t helping. In this post, I’m going to explain how I define backup, resiliency, and disaster recovery, and discuss how they impact my Azure designs for service & data availability.

Essential Terminology

There are two essential terms that we have to understand to discuss these problems/solutions:

RPO: The recovery point objective is how much data, measured in time, is lost when our solution kicks in.
RTO: The recovery time objective is how long, measured in time, services are offline while the solution kicks in.

Backup/Restore

A backup is when we take a copy of our data and (ideally) store that copy elsewhere, and even in several places. The concept is that we can restore our data from a backup if the original data (files, database, VM files, etc) are deleted either accidentally or deliberately.

The base product for backup in Azure is Azure Backup, which supports:

Azure VMs
Managed disks
Azure Files
SQL Server in Azure VMs
SAP HANA databases in Azure VMs
Azure Database for PostgreSQL servers
Azure Blobs
Azure Database for PostgreSQL Flexible server
Azure Kubernetes service
Azure Database for MySQL – Flexible Server
SAP ASE (Sybase) database on Azure VMs
Azure Data Lake Storage
Azure Elastic SAN

Quite honestly, that list is much longer than the last time I searched for it! Azure Backup covers a lot, but it doesn’t cover everything. Some solutions, like Azure SQL, feature their own backup solution.

It’s not unusual for people to bring another backup tool to Azure. The one I hear most of is Veeam Backup for Microsoft Azure. While I’ve never used Veeam hands-on, its reputation is excellent, and it has the unique ability to be platform agnostic. Want to restore VMs from Azure to Hyper-V, Nutanix, or VMware if you’re that way inclined ;)? You can with Veeam.

RPO: Backup features the longest RPO here. The data loss is depdendent on how often your backup jobs run. Daily backups? You can lose up to 24 hours of data. Backups every 15 minutes? You might lose up to 15 minutes of data.
RTO: This is where the pain can be; the RTO is how long it takes to copy your data from the backup storage to the production storage? Restoring an Azure Backup snapshot recovery point is a disk-to-disk copy. Restoring a 10 TB VM from blob storage over the network is going to be a long wait.

Disaster Recovery (DR)

The purpose of DR is to recover from a disaster. Let’s define what a disaster could be using real examples:

Hurricane Katrina was a natural disaster that wiped out huge areas of the USA in 2005.
The “black summer” bushfires in Australia destroyed millions of hectares of land in 2019-2020.
The Indian Ocean Tsunami in 2004 caused devastation in the coastal areas of many countries.
Keeping it local for me: post-storm winter floods have caused widespread damage throughout Ireland in the last few years.
Three AWS data centres were hit by drone attacks in the UAE & Bahrain in March of this year.

Disasters can be natural or they can be man-made. Disasters rarely target 1 building; they wipe out an area. They are rare – but they happen. There is another kind of disaster, which few think about:

KNP Logistics Group, a 125-year-old UK transport firm with over 700 employees, was put out of business because of a ransomware attack in 2025.

Pending (and passed in some countries) EU regulations (NIS2) consider this a disaster that subject organisations must be prepared for.

For cloud planning, if we need to prepare for disaster recovery, then we must plan for the loss of the Azure region ny replicating services/data to another region – typically the paired region. There is no one solution, and there are plenty of complicating factors. Techs that will be in scope include:

Azure Site Recovery (ASR) for Azure VMs
Geo-redundant storage (GRS) and the various geo-variants
PaaS resources that include GRS
Database replication
DevOps pipelines/workflows to redeploy resources (but not data)

There is a fun grey area here. Veeam is not only a backup solution; it is also a DR solution! You will also find that some people use backup as a budget DR solution – they replicate data from the primary location to the secondary location (Azure Backup Geo-Redundant). The right solution for your organisation is often based on business requirements and budget, with budget being the big elephant in the room.

RPO: DR replication is typically based on asynchronous replication. The RPO is often measured in seconds/minutes.
RTO: The RTO really is dependent on the complexity of services, the quantity of services to restore, the interdependencies, and how automated the process is once it starts. The RTO should be measured in hours, but a backup solution might be measured in days/weeks.

Resilience

The purpose of resilience is to enable a service to survive a localised issue, such as:

A VM crashes.
Microsoft are patching an App Service compute instance.
An Azure host is getting a firmware update.
Microsoft had a networking issue in a single data centre building.

We use resilience to keep the service operational with no perceivable outage to the service consumer. There are many ways to tackle resilience, but they are all based on scaling out:

Availability Zones: Most Azure regions have multiple data centre buildings. The buildings are split into what we see as 3 Availability Zones. Each Availability Zone has independent external network connections, power, and cooling. The theory is that if I spread the tier of a service across 3 zones, then that tier can survive 2 zones going offline. Some PaaS services, like Bastion, default to using zones; some have to be opted in. Beware of some PaaS resources, like App Service Environment, that have minimum consumption requirements to be placed across Availability Zones.
Availability Sets: If we cannot use Availability Zones (more later on this), then we can place virtual machines in Availability Sets. We can think of Availability Sets as a form of anti-affinity; machines in the same set are placed into different update domains (Azure platform updates) and fault domains (racks) in the same room in the same data centre. Microsoft does this for multi-instance PaaS services that are not using Availability Zones.
Zone Redundant Storage (ZRS): Azure storage is based on the concept of storing each block 3 times. ZRS places the replica blocks across 3 different Availability Zones. Your data remains operational even if 2 of the data centres are lost.

There are many architectural considerations to handle when you start resiliency planning.

The old pain-in-the-a** is the legacy line-of-business app that supports just a single VM. There is no scaling out to gain resiliency. Traditionally, VMs used LRS (locally redundant storage) managed disks. LRS managed disks are stored in a single data centre with the VM. There have been issues in the past where storage in a single room has gone offline, taking all three LRS replicas of the disks’ blocks offline. You can choose to use ZRS managed disks. The VM will continue to primarily use the local replica, but two replicas are stored in other Availability Zones in the same region. If the primary storage cluster goes offline, you can perform a manual process to get the VM back online with another replica.

RPO: Depending on the architecture and technologies, there is either a zero-RPO (active/active services) or an RPO of a few seconds (replicated storage).
RTO: In most cases, the RTO is 0. The one exception that I can think of is the single VM with a ZRS where the RTO is how long it takes you to force-detach the disk and create a new VM with the existing disks in another Availability Zone.

By the way, there are whole areas on networking resiliency that I could type about for hours too!

Confusion

As I have alluded to, I’ve seen some discussions on LinkedIn recently stating that Availability Zones can be used for disaster recovery. They could. Can they? Should they?

What is the disaster that you are planning for? If it’s any of the above natural disasters then I would argue that spreading your services/data across data centres located beside each other is going to lead to a sudden career-ending meeting.

Don’t give me the “Availability Zones are spread apart from each other” line. Suuuuure they are – except any of the ones that I’ve located on Google Maps, such as North Europe, West Europe, or US East to begin with.

Now, let’s get on with the practical realities of following the concept of using Availability Zones for DR. When was the last time you tried to deploy Azure VMs across Availability Zones? What about a firewall? Or App Services? Did you get an “Internal Service Error”, a weird quota error, or at least some helpful message to inform you that there was no capacity in “zone 2”? That’s been my experience for the last 14+ months in any regions that I’ve worked in. So, don’t recommend me to use a technology for emergency DR if I cannot even use it for operational resiliency!

Yes, I know that capacity issues also impact inter-region DR designs. If West Europe were to be flooded, you can be all but sure that you are not getting into North Europe thanks to the instant massive demand from many customers. I know that’s an unlikely scenario – but it’s one that some organisations must plan for. For example, I had a central government customer ask me about Azure region choice. The country in question has an “aggressive” neighbour to the east that likes to wage war on its neighbours. The local Microsoft office asked them to move into the new local Azure region soon after Ukraine was invaded. I asked the customer: “Where would Ukraine be now if all of its IT services were based in a local Azure region under 300 KM from Russia?” I’d extend that with a follow-up question now: “What if you used Availability Zones in that single region for DR?” Yes, the scenario is real – see above. Or consider if a hurricane reached Boydton in Virginia, USA, or a bushfire ran rampant in New South Wales/Victoria, Australia.

Before you go planning, please:

Understand the risks you are planning for
Have a budget
Understand the technologies
Comprehend how or if the technologies counter the risks
If the technologies are available to you at all!