Aidan Finn, IT Pro

What Happens When An Azure Region Is Destroyed?

This is a topic that has been “top of mind” (I sound like a management consulting muppet) recently: how can I recover from an Azure region being destroyed?

Why Am I Thinking About This?

Data centres host critical services. If one of these data centres disappears then everything that was hosted in them is gone. The cause of the disaster might be a forest fire, a flood, or even a military attack – the latter was once considered part of a plot for a far-fetched airport novel but now we have to consider that it’s a real possibility, especially for countries close to now-proven enemies.

We have to accept that there is a genuine risk that an area that hosts several data centres could be destroyed, along with everything contained in those data centres.

Azure Resilience Features

Availability Sets

The first level of facility resilience in Microsoft’s global network (hosting all of their cloud/internal services) is the availability set concept; this is the default level of high availability designed to keep highly-available services online during a failure in a single fault domain (rack of computers) or deployment of changes/reboots to an update domain (virtual selection of computers) in a single row/room (rooms are referred to as colos). With everything in a single room/building we cannot consider an availability set to be a disaster resilience feature.

Availability Zones

The next step up is availability zones. Many Azure regions have multiple data centres. Those data centres are split into availability zones. Each availability zone has independent resources for networking, cooling and power. The theory is that if you spread a highly-available service across three zones, then if should remain operational if even two of the zones go down.

Paired Regions

An Azure region is a collection of data centres that are built close to each other (in terms of networking latency). For example, North Europe (Grangecastle, Dublin Ireland) has many physical buildings hosting Microsoft cloud services. Microsoft has applied to build more data centres in Newhall, Naas, Kildare, which is ~20 miles away but will only be a few milliseconds away on the Microsoft global network. Those new data centres will be used to expand North Europe – the existing site is full and more land would be prohibitively expensive.

Many Azure regions are deployed as pairs. Microsoft has special rules for picking the locations of those paired regions, including:

They must be a minimum distance apart from each other
They do not share risks of a common natural disaster

For example, North Europe in Dublin, Ireland is paired with West Europe in Middenmeer, Netherlands.

The pairing means that systems that have GRS-based storage are able to replicate to each other. The obvious example of that is a Storage Account. Less obvious examples are things like Recovery Services Vaults and some PaaS database systems that are built on Storage Account services such as blob or file.

Mythbusting

Microsoft Doesn’t Do Your Disaster Recovery By Default

Many people enter the cloud thinking that lots of things are done for them which are not. For example, when one deploys services in Azure, Azure does not replicate those things to the paired region for you unless:

You opt-in/configure it
You pay for it

That means if I put something in West US, I will need to configure and pay for replication somewhere else. If my resources use Virtual Networks, then I will need to have those Virtual Networks deployed in the other Azure Region for me.

The Paired Region Is Available To Me

In the case of hero regions, such as East US/West US or North Europe/West Europe, then the paired region is available to you. But in most cases that I have looked into, that is not the case with local regions.

Several regions do not have a paired region. And if you look at the list of paired regions, look for the * which denotes that the paired region is not available to you. For example:

Germany North is not available to customers of Germany West Central
Korea South is not available to users of Korea Central
Australia Central 2 is not available to customers of Australia Central
Norway West is not available to users of Norway East

The Norway case is a painful one. Many Norwegian organisations must comply with national laws that restrict the placement of data outside of Norwegian borders. This means that if they want to use Azure, they have to use Norway East. Most of those customers assume that Norway West will be available to them in the event of a disaster. Norway West is a restricted region; I am led to believe that:

Norway West is available to just three important Microsoft customers (3 or 10, it’s irrelevant because it’s not generally available to all customers).
It is hosted by a third-party company called Green Mountain near Stavanger, which is considerably smaller than an Azure region. This means that it will be small and offer a small subset of typical Azure services.

Let’s Burn Down A Region! (Hypothetically)

What’ll happen we this happens to an Azure region?

The Disaster

We can push the cause aside for one moment – there are many possible causes and the probability of each varies depending on the country that you are talking about. Certainly, I have discovered, both public and private organisations in some countries genuinely plan for some circumstances that one might consider a Tom Clancy fantasy.

I have heard Microsoft staff and heard of Microsoft staff telling people that we should use availability zones as our form of disaster recovery, not paired regions. What good will an availability zone do me if a missile, fire, flood, chemical disaster, or earthquake takes out my Azure region? Could it be that there might be other motivations for this advice?

Paired Region Failover

Let’s just say that I was using a region with an available pair. In the case of GRS-based services, we will have to wait for Microsoft to trigger a failover. I wonder how that will fare? Do you think that’s ever been tested? Will those storage systems ever have had the load that’s about to be placed on them?

As for your compute, you can forget it. You’re not going to start up/deploy your compute in the paired region. We all know that Azure is bursting at the seams. Everyone has seen quota limits of one kind of another restrict our deployments. The advice from Microsoft is to reserve your capacity – yes, you will need to pre-pay for the compute that you hope you will never need to use. That goes against the elastic and bottomless glass concepts we expect from The Cloud but reality bites – Azure is a business and Microsoft cannot afford to have oodles of compute sitting around “just in case”.

Non-Available Pair Failover

This scenario sucks! Let’s say that you are in Sweden Central or the new, not GA, region in Espoo, Finland. The region goes up in a cloud of dust & smoke, and now you need to get up and running elsewhere. The good news is that stateless compute is easy to bring online anywhere else – as long as there is capacity. But what about all that data? Your Data Lake is based on blob storage and couldn’t replicate anywhere. Your databases are based on blob/file storage and couldn’t replicate anywhere. Azure Backup is based on blob and you couldn’t enable cross-region restore. Unless you chose your storage very carefully, your data is gone along with the data centres.

Resource Groups

This one is fun! Let’s say I deploy some resources in Korea Central. Where will my resource group be? I will naturally pick Korea Central. Now let’s enable DR replication. Some Azure services will place the replica resources in the same resource group.

Now let’s assume that Korea Central is destroyed. My resources are hopefully up and running elsewhere. But have you realised that the resource IDs of those resources include the resource group that is in Korea Central (the destroyed region) then you will have some problems. According to Microsoft:

If a resource group’s region is temporarily unavailable, you might not be able to update resources in the resource group because the metadata is unavailable. The resources in other regions still function as expected, but you might not be able to update them. This condition might also apply to global resources like Azure DNS, Azure DNS Private Zones, Azure Traffic Manager, and Azure Front Door. You can view which types have their metadata managed by Azure Resource Manager in the list of types for the Azure Resource Graph resources table.

The same article mentions that you should pick an Azure region that is close to you to optimise metadata operations. I would say that if disaster recovery is important, maybe you should pick an Azure region that is independent of both your primary and secondary locations and likely to survive the same event that affects your primary region – if your resource types support it.

The Solution?

I don’t have one, but I’m thinking about it. Here are a few thoughts:

Where possible, architect workloads where compute is stateless and easy to rebuild (from IaC).
Make sure that your DevOps/GitHub/etc solutions will be available after a disaster if they are a part of your recovery strategy.
Choose data storage types/SKUs/tiers (if you can) that offer replication that is independent of region pairing.
Consider using IaaS for compute. IaaS, by the way, isn’t just simple Windows/Linux VMs. AKS is a form of very complicated IaaS. IaaS has the benefit of being independent of Azure, and can be restored elsewhere.
Use a non-Microsoft backup solution. Veeam for example (thank you Didier Van Hoye, MVP) can restore to Azure, on-premises, AWS, or GCP.

What Do You Think?

I know that there are people in some parts of the world will think I’ve fallen off something and hit my head 🙂 I get that. But I also know, and it’s been confirmed by recent private discussions, that my musings here are already considered by some markets when adoption of The Cloud is raised as a possibility. Some organisations/countries are forced to think along these lines. Just imagine how silly folks in Ukraine would have felt if they’d deployed all their government and business systems in a local (hypothetical) Azure region without any disaster recovery planning; one of the first things to be at the wrong end of a missile would have been those data centres.

Please use the comments or social media and ping me your thoughts.

Azure Virtual Networks Do Not Exist

In this post, I want to share the most important thing that you should know when you are designing connectivity and security solutions in Microsoft Azure: Azure virtual networks do not exist.

A Fiction Of Your Own Mind

I understand why Microsoft has chosen to use familiar terms and concepts with Azure networking. It’s hard enough for folks who have worked exclusively with on-premises technologies to get to grips with all of the (ongoing) change in The Cloud. Imagine how bad it would be if we ripped out everything they knew about networking and replaced it with something else.

In a way, that’s exactly what happens when you use Azure’s networking. It is most likely very different to what you have previously used. Azure is a multi-tenant cloud. Countless thousands of tenants are signed up and using a single global physical network. If we want to avoid all the pains of traditional hosting and enable self-service, then something different has to be done to abstract the underlying physical network. Microsoft has used VXLAN to create software-defined networking; this means that an Azure customer can create their own networks with address spaces that have nothing to do with the underlying physical network. The Azure fabric tracks what is running where, and what NICs can talk to each other, and forward packets as required.

In Azure, everything is either a physical (rare) or a virtual (most common) machine. This includes all the PaaS resources and even those so-called serverless resources. When you drill down far enough in the platform, you will find either a machine with an operating system with a NIC. That NIC is connected to a network of some kind, either an Azure-hosted one (in the platform) or a virtual network that you created.

The NIC Is The Router

The above image is from a slide I use quite often in my Azure networking presentations. I use it to get a concept across to the audience.

Every virtual machine (except for Azure VMware Services) is hosted on a Hyper-V host, and remember that most PaaS services are hosted in virtual machines. In the image, there are two virtual machines that want to talk to each other. They are connected to a common virtual network that uses a customer-defined prefix of 10.0.0.0/8.

The source VM sends a packet to 10.10.1.5. The packet exits the VM’s guest OS and hits the Azure NIC. The NIC is connected to a virtual switch in the host – did you know that in Hyper-V, the switch port is a part of the NIC to enable consistent processing no matter what host the VM is moved to? The virtual switch encapsulates the packet to enable transmission across the physical network – the physical network has no idea about the customer’s prefix of 10.0.0.0/8. How could it? I’d guess that 80% of customers use all or some of that prefix. Encapsulation allows the pack to hide the customer-defined source and destination addresses. The Azure Fabric knows where the customer’s destination (10.10.1.5) is running, so it uses the physical destination host’s address in the encapsulated packet.

Now the packet is free to travel across the physical Azure network – across the rack, data centre, region or even the global network – to reach the destination host. Now the packet moves up the stack, is encapsulated and dropped into the NIC of the destination VM where things like NSG rules (how the NSG is associated doesn’t matter) are processed.

Here’s what you need to learn here:

The packet went directly from source to destination at the customer level. Sure it travelled along a Microsoft physical network but we don’t see that. We see that the packet left the source NIC and arrived directly at the destination NIC.
Each NIC is effectively its own router.
Each NIC is where NSG rules are processed: source NIC for outbound rules and destination NIC for inbound rules.

The Virtual Network Does Not Exist

Have you ever noticed that every Azure subnet has a default gateway that you cannot ping?

In the above example, no packets travelled across a virtual network. There were no magical wires. Packets didn’t go to a default gateway of the source subnet, get routed to a default gateway of a destination subnet and then to the destination NIC. You might have noticed in the diagram that the source and destination were on different peered virtual networks. When you peer a virtual network, an operator is not sent sprinting into the Azure data centres to install patch cables. There is no mysterious peering connection.

This is the beauty and simplicity of Azure networking in action. When you create a virtual network, you are simply stating:

Anything connected to this network can communicate with each other.

Why do we create subnets? In the past, subnets were for broadcast control. We used them for network isolation. In Azure:

We can isolate items from each other in the same subnet using NSG rules.
We don’t have broadcasts – they aren’t possible.

Our reasons for creating subnets are greatly reduced, and so are our subnet counts. We create subnets when there is a technical requirement – for example, an Azure Bastion requires a dedicated subnet. We should end up with much simpler, smaller virtual networks.

How To Think of Azure Networks

I cannot say that I know how the underlying Azure fabric works. But I can imagine it pretty well. I think of it simply as a mapping system. And I explain it using Venn diagrams.

Here’s an example of a single virtual network with some connected Azure resources.

Connecting these resources to the same virtual network is an instruction to the fabric to say: “Let these things be able to route to each other”. When the app service (with VNet Integration) wants to send a packet to the virtual machine, the NIC on the source VM will send the packets directly to the NIC of the destination VM.

Two more virtual networks, blue and green, are created. Note that none of the virtual networks are connected/peered. Resources in the black network can talk only to each other. Resources in the blue network can talk only to each other. Resources in the green network can talk only to each other.

Now we will introduce some VNet peering:

Black <> Blue
Black <> Green

As I stated earlier, no virtual cables are created. Instead, the fabric has created new mappings. These new mappings enable new connectivity:

Black resources can talk with blue resources
Black resources can talk with green resources.

However, green resources cannot talk directly to blue resources – this would require routing to be enabled via the black network with the current peering configuration.

I can implement isolation within the VNets using NSG rules. If I want further inspection and filtering from a firewall appliance then I can deploy one and force traffic to route via it using BGP or User-Defined Routing.

Wrapping Up

The above simple concept is the biggest barrier I think that many people have when it comes to good Azure network design. If you grasp the fact that virtual networks do not exist and that packets route directly from source to destination and then be able to process those two facts then you are well on your way to designing well-connected/secured networks and being able to troubleshoot them.

If You Liked This Article

If you liked this article, then why don’t you check out my custom Azure training with my company, Cloud Mechanix. My next course is Azure Firewall Deep Dive, a two day virtual course where I go through how to design and implement Azure Firewall, including every feature. This two day course runs on February 12/13, timed for (but not limited to) European attendees.

Will International Events Impact Cloud Computing

You must have been hiding under a rock if you haven’t noticed how cloud computing has become the default in IT. I have started to wonder about the future of cloud computing. Certain international events have the potential to disrupt cloud computing in a major way. I’m going to play out two scenarios in this post and illustrate what the possible problems may be.

Bear In The East

Russia expanded their conflict with Ukraine in February 2024. This was the largest signal so far that the leadership of Russia wanted to expand their post-Soviet borders to include some of the former USSR nations. The war in Ukraine is taking much longer than expected and has eaten the Russian military, thanks to the determination of the Ukrainian people. However, we know that Russia has eyes elsewhere.

The Baltic nations (Lithuania, Latvia and Estonia) provide a potential land link between Russia and the Baltic Sea. North of those nations is Finland, a country with a long & wild border with Russia – and also one with a history of conflict with Russia. Finland (and Sweden) has recognised the potential of this expanded threat by joining NATO.

If you read “airport thrillers” like me, then you’ll know that Sweden has an island called Gotland in the Baltic Sea. It plays a huge strategic role in controlling that sea. If Russia were to take that island, they could prevent resupply via the Baltic Sea to the Baltic countries and Finland, leaving only air, land, and the long route up North – speaking of which …

Norway also shares a land border with Russia to the north of Finland. The northern Norwegian coast faces the main route from Murmansk (a place I attacked many times when playing the old Microprose F-19 game). Murmansk is the home of the Russian Atlantic fleet. Their route to the Atlantic is north of the Norwegian coast and south between Iceland and Ireland.

In the Artic is Svalbard, a group of islands that is host to polar bears and some pretty tough people. This island is also eyed up by Russia – I’m told that it’s not unusual to hear stories of some kind of espionage there.

So Russia could move west and attack. What would happen then?

Nordic Azure Regions

There are several Azure regions in the Nordics:

Norway East, paired with Norway West
Sweden Central, paired with Sweden South
One is “being built” in Espoo, Finland, just outside the capital of Helsinki.

Norway West is a small facility that is hosted in a third-party data centre and is restricted to a few customers.

I say “being built” with the Finish region because I suspect that its been active for a while with selected customers. Not long after the announcement of the region (2022) I had a nationally strategic customer tell me that the local Microsoft data centre salesperson was telling them to stop deploying in Azure West Europe (Netherlands) and to start using the new Finnish region.

FYI: the local Microsoft data centre salesperson has a target of selling only the local Azure region. The local subsidiary has to make a usage commitment to HQ before a region is approved. Adoption in another part of Azure doesn’t contribute to this target.

I remember this conversation because it was not long after tanks rolled into Ukraine and talk of Finland joining NATO began heating up. I asked my customer: “Let’s say you place nationally critical services into the new Finnish region. What is one of the first things that Russia will send missiles to?” Yes, they will aim to shut down any technology and communications systems first … including Azure regions. All the systems hosted in Espoo will disappear in a flaming pile of debris. I advised the customer that if I were them, I would continue to use cloud regions that were as far away as possible while still meeting legal requirements.

Norway’s situation is worse. Their local and central governments have to comply with a data placement law, which prevents the placement of certain data outside of Norway. If you’re using Azure, you have no choice, you must use Norway East, which is in urban Oslo (the capital on the south coast). Private enterprises can choose any of the European regions (they typically take West Europe/Netherlands, paired with North Europe/Ireland) so they have a form of disaster recovery (I’ll come back to this topic later). However, Norway East users cannot replicate into Norway West – the Stavanger-located region is only available to a select (allegedly) three customers and it is very small.

FYI: restricted access paired regions are not unusual in Azure.

Disaster Recovery

So a hypersonic missile just took out my Azure region – what do I do next? In an ideal world, all of your data was replicated in another location. Critical systems were already built with redundant replicas. Other systems can be rebuilt by executing pipelines with another Azure region selected.

Let’s shoot all of that down, shall we?

So I have used Norway East. And I’ve got a bunch of PaaS data storage systems. Many of those storage systems (Azure Backup recovery services vaults) are built on blob storage. Blob storage offers geo-redundancy which is restricted to the paired region. If my data storage can only replicate to the paired region and there is no paired region available to me, when there is no replication option. You will need to bake your own replication system.

Some compute/data resource types offer replication in any region. For example, Cosmos DB can replicate to other regions but that comes with potential sync/latency issues. Azure VMs offer Azure Site Recovery which enables replication to any region. This is where I expect the “cloud native” types to be “GitOps!” but they always seem to focus only on compute and forget things like data – no we won’t be putting massive data stores in an AKS container 🙂

Has anyone not experienced capacity issues in an Azure region in the last few years? There are probably many causes for that so we won’t go down that rabbit hole. But a simple task of deploying a new AVD worker pool or a firewall with zone resilience commonly results in a failure because the region doesn’t have capacity. What would happen if Norway East disappeared and all of the tenants started to failover/redeploy to other European regions? Let’s just say that there would be massive failures everywhere.

Orange Man In The West

Greenland is an autonomous territory of the Kingdom of Denmark. Being a Danish territory makes it a part of the EU. US president-elect, Donald Trump, has been sabre-rattling about Greenland recently. He either wants the US to take it over by economic (trade war) or military means.

If the USA goes into a trade war with Denmark, then it will go into a trade war with all of the EU. Neither side will win. If the tech giants continue to personally support Donald Trump then I can imagine the EU retaliating against them. Considering that Microsoft, Amazon, and Google are American companies, sanctions against those companies would be bad – the cost of cloud computing could rocket and make it unviable.

If the USA invaded Greenland (a NATO ally by virtue of being a Danish territory) then it would lead to very a unpleasant situation between NATO/EU and the USA. One could imagine that American companies would be shunned, not just emotionally but also legally. That would end Azure, AWS, and Google in the EU.

So how would one recover from losing their data and compute platform? It’s not like you can just live migrate a petabyte data lake or a workload based on Azure Functions.

The Answer

I don’t have a good answer. I know of an organisation that had a “only do VMs in Azure” policy. I remember bing dumbfounded at the time. They explained that it was for support reasons. But looking back on it, they abstracted themselves from Azure by use of an operating system. They could simply migrate/restore their VMs to another location if necessary – on-prem, another cloud, another country. They are not tied to the cloud platform, the location, or the hardware. But they do lose so many of the benefits of using the cloud.

I expect someone will say “use on-prem for DR”. OK, so you’ll build a private cloud, at huge expense and let it sit there doing nothing on the off-chance that it might be used. If I was in that situation then I wouldn’t be using Azure/etc at all!

I’ve been wondering for a while if the EU could fund/sponsor the creation of an IT sector in Europe that is independent from the USA. It would need an operating system, productivity software, and a cloud platform. We don’t have any tech giants as big or as cash rich as Microsoft in the EU so this would have to be sponsored. I also think that it would have to be a collaboration. My fear is that it would be bogged down in bureaucracy and have a heavy Germany/France first influence. But I am looking at the news every day and realsing that we need to consider a non-USA solution.

Wrapping Up

I’m all doom and gloom today. Maybe it’s all of the negativity in the news that is bringing me down. I see continued war in Ukraine, Russia attacking infrastructure in the Baltic sea, and threats from the USA. The world has changed and we all will need to start thinking about how we act in it.

Manage Existing Azure Firewall With Firewall Policy Using Bicep

In this post, I want to discuss how I recently took over the management of an existing Azure Firewall using Firewall Policy/Azure Firewall Manager and Bicep.

Background

We had a customer set up many years ago using our old templated Azure deployment based on ARM. At the centre of their network is Azure Firewall. That firewall plays a big role in the customer’s micro-segmented network, with over 40,000 lines of ARM code defining the many firewall rules.

The firewall was deployed before Azure Firewall Manager (AFM) was released. AFM is a pretty GUI that enables the management of several Azure networking resource types, including Azure Firewall. But when it comes to managing the firewall, AFM uses a resource called Firewall Policy; you don’t have to touch AFM at all – you can deploy a Firewall Policy, link the firewall to it (via Resource ID), and edit the Firewall Policy directly (Azure Portal or code) to manage the firewall settings or code.

One of the nicest features of Azure Firewall is a result of it being an Azure PaaS resource. Like every other resource type (there are exceptions sometimes) Azure Firewall is completely manageable via code. Not only can you deploy the firewall. You can operate it on a day-to-day basis using ARM/Bicep/Terraform/Pulumi if you want: the settings and the firewall rules. That means you can have complete change control and rollback using the features of Git in DevOps, GitHub, etc.

All new features in Azure Firewall have surfaced only via Firewall Policy since the general availability release of AFM. A legacy Azure Firewall that doesn’t have a Firewall Policy is missing many security and management features. The team that works regularly with this customer approached me about adding Firewall Policy to the customer’s deployment and including that in the code.

The Old Code

As I said before, the old code was written in ARM. I won’t get into it here, but we couldn’t add the required code to do the following without significant risk:

A module for Firewall Policy
Updating the module for Azure Firewall to include the link to the FIrewall Policy.

I got a peer to give me a second opinion and he agreed with my original assessment. We should:

Create a new set of code to manage the Azure Firewall using Bicep.
Introduce Firewall Policy via Bicep.
Remove the ARM module for Azure Firewall from the ARM code.
Leave the rest of the hub as is (ARM) because this is a mission-critical environment.

The High-Level Plan

I decided to do the following:

Set up a new repo just for the Azure Firewall and Firewall Policy.
Deploy the new code in there.
Create a test environment and test like crazy there.
The existing Azure Firewall public IP could not change because it was used in DNAT rules and by remote parties in their firewall rules.
We agreed that there should be “no” downtime in the process but I wanted time for a rollback just in case. I would create non-parameterised ARM exports of the entire hub, the GatewaySubnet route table (critical to routing intent and a risk point in this kind of work), and the Azure Firewall. Our primary rollback plan would be to run the un-modified ARM code to restore everything as it was.

The Build

I needed an environment to work in. I did a non-parameterised export of the hub, including the Azure Firewall. I decompiled that to Bicep and deployed it to a dedicated test subscription. This did require some clean-up:

The public IP of the firewall would be different so DNAT rules would need a new destination IP.
Every rules collection group (many hundreds of them) had a resource ID that needed to be removed – see regex searches in Visual Studio Code.

The deployment into the test environment was a two-stage job – I needed the public IP address to obtain the destination address value for the DNAT rules.

Now I had a clone of the production environment, including all the settings and firewall rules.

The Bicep Code

I’ve been doing a lot of Bicep since the Spring of this year (2024). I’ve been using Azure Verified Modules (AVM) since early Summer – it’s what we’ve decided should be our standard approach, emulating the styling of Azure Verified Solutions.

We don’t use Microsoft’s landing zones. I have dug into them and found a commonality. The code is too impressive. The developer has been too clever. Very often, “customer configuration” is hard-coded into the Bicep. For example, the image template for Azure Image Builder (in the AVD landing zone) is broken up across many variables which are unioned until a single variable is produced. The image template is file that should be easy to get at and commonly updated.

A managed service provider knows that architecture (the code) should be separated from customer configuration. This allows the customer configuration to be frequently updated separately from the architecture. And, in turn, it should be possible to update the architecture without having to re-import the customer configuration.

My code design is simple:

Main.bicep which deploys the Azure Firewall (AVM) and the Firewal Policy (AVM).
A two-property paramater controls the true/false (bool) condition of whether or not the two resources are deployed.
A main.bicepparam supplies parameters to configure the SKUs/features/settings of the Azure Firewall and Firewall Policy using custom types (enabling complete Intellisense in VS Code).
A simple module documents the Rules Collections in single array. This array is returned as an output to main.bicep and fed as a single value to the Firewall Policy module.

I did attempt to document the Rules Collections as ARM and use the Bicep function to load an ARM file. This was my preference because it would simplify producing the firewall rules from the Azure Portal and inputting them into the file, both for the migration and for future operations. However, the Bicep function to load a file is limited to too few characters. The eventual Rules Colleciton Group module had over 40,000 lines!

My test process eventually gave me a clean result from start to finish.

The Migration

The migration was scheduled for late at night. Earlier in the afternoon, a freeze was put in place on the firewall rules. That enabled me to:

Use Azure Firewall Manager to start the process of producing a Firewall Policy. I chose the option to import the rules from the existing production firewall. I then clicked the link to export the rules to ARM and saved the file locally.
I decompiled the ARM code to Bicep. I copied and pasted the 3 Rules Collection Groups into my Rules Collection Group module.
I then ran the deployment with no resources enabled. This told me that the pipeline was function correctly against the production environment.
When the time came, I made my “backups” of the production hub and firewall.
I updated the parameters to enable the deployment of the Firewall Policy. That was a quick run – the Azure Firewall was not touched so there was no udpate to the Firewall. This gave me one last chance to compare the firewall settings and rules before the final steps began.
I removed the DNS settings from the Azure Firewall. I found in testing that I could not attach a Firewall Policy to an Azure Firewall if both contained DNS settings. I had to remove those settings from the production firewall. This could have caused some downtime to any clients using the firewall as their DNS server but the feature is not rolled out yet.
I updated the parameters to enable management of the Azure Firewall. The code here included the name of the in-place Public IP Address. The parameters also included the resource IDs of the hub Virtual Network and the Log Analytics Workspace (Resource Specfic tables in the code). The pipeline ran … this was the key part because the Bicep code was updating the firewall with the resource ID of the Firewall Policy. Everything worked perfectly … almost … the old diagnostics settings were still there and had to be removed because the new code used a new naming standard. One quick deletion and a re-run and all was good.
One of my colleagues ran a bunch of pre-documented and pre-verified tests to confirm that all was was.
I then commented out the code for the Azure Firewall from the old ARM code for the hub. I re-ran the pipeline and cleaned up some errors until we had a repeated clean run.

The technical job was done:

Azure Firewall was managed using a Firewall Policy.
Azure Firewall had modern diagnostics settings.
The configuration is being done using code (Bicep).

You might say “Aidan, there’s a PowerShell script to do that job”. Yes there is, but it wasn’t going to produce the code that we needed to leave in place. This task did the work and has left the customer with code that is extremely flexible with every resource property available as a mandatory/optional property through a documented type specific to the resource type. As long as no bugs are found, the code can be used as is to configure any settings/features/rules in Azure Firewall or Azure Firewall manager either through the parameters files (SKUs and settings) or the Rules Collection Groups module (firewall rules).

Azure Firewall Deep Dive Training

If you thought that this post was interesting then please do check out my Azure Firewall Deep Dive course that is running on February 12th – February 13th, 2025 from 09:30-16:00 UK/Irish time/10:30-17:00 Amsterdam/Berlin time. I’ve run this course twice in the last two weeks and the feedback has been super.

Azure Firewall Deep Dive Training

I’ll tell you about my new virtual training course on Azure Firewall and share some schedule information in this post.

Background

I’ve been talking about Azure Firewall for years. I’ve done lots of sessions at user groups and conferences. I’ve done countless handovers with customers and colleagues. One of my talking points is that I reckoned that I could teach someone with a little Azure/networking knowledge everything there is to know about Azure Firewall in 2 days. And that’s what I decided to do!

I was updating one of my sessions earlier in the year when I realised that it was pretty must the structure of a training couse. Instead of me just listing out features or barely dicusssing architecture to squeeze it into a 45-60 minute-long session, I could take the time to dive deep and share all that I know or could research.

The Course

I produced a 2-day course that could be taught in-person, but my primary vector is virtual/online – it’s hard to get a bunch of people from all over into one place and there is also a cost to me in hosting a physical event that would increse the cost of the course. I decided that virtual was best, with an option off doing it in person if a suitable opportunity arose.

The course content is delivered using a combination of presentation and demo. Presentation lets me explain the what’s, why’s and so on. Demonstration lets me show you how.

The demo lab is built from a Bicep deployment, based on Azure Verified Modules (AVM). A hub & spoke network architecture is created with an Application Gateway, a simple VM workload, and a simple App Services (Private Endpoint) workload. The demonstrations follow a “hands-on guide”; this guide is written as if this was a step-by-step hands-on course, instructing the reader exactly which button to click and what/where to type. Each exercise builds on the last, eventually resulting in a secure network architecture with all of the security, monitoring, and management bells and whistles.

Why did I opt for demo instead of hands-on? Hands-on works for in-person classes. But you cannot assist in the same way when people struggle. In addition, waiting for attendees to complete labs would add another day (and cost) to the class.

Before and class, I share all of the content that I use:

System requirements and setup instructions.
The Bicep files for the demo lab.
The hands-on lab instructions
The PowerPoint
And a few more useful bits

I always update content – for example, my first run of this class was during Microsoft Ignite 2024 and I added a few bits from the news. Therefore I share the updated content with attendees after the course.

The First Run

I ran the class for the first time earlier this week, Novemer 20-21 2024. Attendees from all around Europe joined me for 2 days. At first they were quiet. Online is tough for speakers like me because I look for visual feedback on how I’m doing. But then the questions started coming – people were interested in what I was saying. Interaction also makes the class more interesting for me – sometimes you get comments that coer things you didn’t originally include and everyone benefits – I updated the course with one such item at the end of day 1!

I shared a 4-question anonymouse survey to learn what people thought. The feedback was awesome.

Feedback

This course was previously run in November 2024 for a European audience. The survey feedback was as follows:

How would you rate this course?

Excellent: 83%
Good: 17%

Was This Course Worth Your Time?

Yes: 100%

Would you recommend this course to others?

Yes: 100%

Some of the comments:

“I think it is a very good introduction to Azure Firewall, but it goes beyond foundational concepts so medium- experienced admins will also get value from this. I like the sections on architecture and explanations of routing and DNS. I think this course will enable people to do a good job more than for example az 700 because of the more practical approach. You are good at explaining the material”.

“Just what I wanted from a Deep dive course.”

“Perfectly delivered. Crystal clear content and very well explained”.

Future Classes

I have this class scheduled for two more runs, each timed for different parts of the world:

The classes are ultra-affordable. A few hundred Euros/dollars gets you custom content based on real-world usage. I did fint a virtual 2-day course on Palo Alto firewalls that cost $1700! You’ll also find that I run early-bird registration costs and discounts for more than 1 booking. If you have a large group (5+) then we might be able to figure out a lower rate 🙂

More To Come

More classes are coming! I have an old one to reinvent based on lots of experience over the years and at least 1 new one to write from scratch. Watch out for more!

Azure Image Builder Job Fails With TCP 60000, 5986 or 22 Errors

In this post, I will explain how to solve the situation when an Azure Image Builder job fails with the following errors:

[ERROR] connection error: unknown error Post “https://10.1.10.9:5986/wsman”: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.9:5986/wsman”: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout

[ERROR]connection error: unknown error Post “https://10.1.10.8:5986/wsman”: context deadline exceeded
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.8:5986/wsman”: context deadline exceeded

Note that the second error will probably be the following if you are building a Linux VM:

[ERROR]connection error: unknown error Post “https://10.1.10.8:22/ssh”: context deadline exceeded
[ERROR] SSH connection err: unknown error Post “https://10.1.10.8:22/ssh”: context deadline exceeded

The Scenario

I’m using Azure Image Builder to prepare a reusable image for Azure Virtual Desktop with some legacy software packages from external vendors. Things like re-packaging (MSIX) will be a total support no-no, so the software must be pre-installed into the image.

I need a secure solution:

The virtual machine should not be reachable on the Internet.
Software packages will be shared from a storage account with a private endpoint for the blob service.

This scenario requires that I prepare a virtual network and customise the Image Template to use an existing subnet ID. That’s all good. I even looked at the PowerShell example from Microsoft which told me to allow TCP 60000-60001 from Azure Load Balancer to Virtual Network:

I also added my customary DenyAll rule at priority 4000 – the built in Deny rule doesn’t deny all that much!

I did that … and the job failed, initially with the first of the errors above, related to TCP 60000. Weird!

Troubleshooting Time

Having migrated countless legacy applications with missing networking documentation into micro-segmented Azure networks, I knew what my next steps were:

Deploy Log Analytics and a Storage Account for logs
Enable VNet Flow Logs with Traffic Analytics on the staging subnet (where the build is happening) NSG
Recreate the problem (do a new build)
Check the NTANetAnalytics table in Log Analytics

And that’s what I did. Immediately I found that there were comms problems between the Private Endpoint (Azure Container Instance) and the proxy VM. TCP 60000 was attempted and denied because the source was not the Azure Load Balancer.

I added a rule to solve the first issue:

I re-ran the test (yes, this is painfully slow) and the job failed.

This time the logs showed failures from the proxy VM to the staging (build) VM on TCP 5986. If you’re building a Linux VM then this will be TCP 22.

I added a third rule:

Now when I test I see the status switch from Running, to Distributing, to Success.

Root Cause?

Adding my DenyAll rule caused the scenario to vary from the Microsoft docs. The built-in AllowVnetInBound rule is too open because it allows all sources in a routed “network”, including other networks in a hub & spoke. So I micro-segment using a low priority DenyAll rule.

The default AllowVnetInBound rule would have allowed the container>proxy and proxy>VM traffic, but I had overridden it. So I need to create rules to allow that traffic.

Lots of Speaking Activity

After a quiet few pandemic years with no in-person events and the arrival of twins, my in-person presentation activity was minimal. My activity has started to increase, and there have been plenty of recent events and more are scheduled for the near future.

The Recent Past

Experts Live Netherlands 2024

It was great to return to The Netherlands to present at Experts Live Netherlands. Many European IT workers know the Experts Live brand; Isidora Maurer (MVP) has nurtured & shepherded this conference brand over the years, starting with a European-wide conference and then working with others to branch it out to localised events that give more people a chance to attend. I presented at this event a few years ago but personal plans prevented me from submitting again until this year. And I was delighted to be accepted as a speaker.

Hosted in Nieuwegein, just a short train ride from Schiphol airport in Amsterdam (Dutch public transport is amazing) the conference featured a packed expo hall and many keen attendees. I presented my “Azure Firewall: The Legacy Firewall Killer” session to a standing-room-only room.

TechMentor Microsoft HQ 2024

The first conference I attended was WinConnections 2004 in Lake Las Vegas. That conference changed my career. I knew that TechMentor had become something like that – the quality of the people I knew who were presenting at the event in the past was superb. I had the chance to submit some sessions this time around and was happy to have 3 accepted, including a pre-conference all-day seminar.

I worked my tail off on that “pre-con”. It’s an expansion of one of my favourite sessions that many events are scared of, probably because they think it’s too niche or too technical: “Routing – The Virtual Cabling of Secure Azure Networking”. Expanding a 1 hour session to a full day might seem daunting but I had to limit how much content I included! Plus I had to make this a demo session. I worked endless hours on a Bicep deployment to build a demo lab for the attendees. This was necessary because it would take too long to build by hand. I had issues with Azure randomly failing and with version stuff changing inside Microsoft. As one might expect, the demo gods were not kind on the day and I quickly had to pivot from hands-on labs to demos. While the questions were few during the class, there were lots of conversations during the breaks and even on the following days.

My second session was “Azure Firewall: The Legacy Firewall Killer” – this is a popular session. I like doing this topic because it gives me a chance to crack a few jokes – my family will groan at that thought!

My final session was the one that I was most worried about. “Your Azure Migration Project Is Doomed To FAIL” was never accepted by any event before. I think the title might seem negative but it’s meant to be funny. The content is based on my experience dealing with mid-large organisations who never quite understand the difference between cloud migration and cloud adoption. I explain this through several fictional stories. There is liberal use of images from Unsplash and opportunities to make some laughter. This have been the session that I was least confident in, but it worked.

TechMentor puts a lot of effort into mixing the attendees and the presenters. On the first night, attendees and presenters went to a local pizza place/bar and sat in small booths. We had to talk to each other. The booth that I was at featured people from all over the USA with different backgrounds. People came and went, but we talked and were the last to leave. On the second day, lunch was an organised affair where each presenter was host to a table. Attendees could grab lunch and sit with a presenter to discuss what was on their minds. I knew that migrations were a hot topic. And I also knew that some of those attendees were either doing their first migration or first re-attempt at a migration. I was able to tune my session a tiny bit to the audience and it hit home. I think the best thing about this was the attention I saw in the room, the verbal feedback that I heard just after the session, and the folks who came up to talk to me after.

A Break

I brought my family to the airport the day before I flew to TechMentor. They were going to Spain for 4 weeks and I joined them a few days later after a l-o-n-g Seattle-Las Angeles-Dublin-Alicante journey (I really should have stayed one extra night in Seattle and taken the quicker 1-hop via Iceland).

33+ Celsius heat, sunshine, a pool, a relaxed atmosphere in a residential town (we didn’t go to a “hotel town”) was a great place to work for a week and then do two weeks of vacation.

I went running most mornings, doing 5-7KMs. I enjoy getting up early in places like this, picking a route to explore on a map, and hitting the streets to discover the locality and places to go with my family. It’s so different to home where I have just two routes with footpaths that I can use.

Coming home was a shock. Ireland isn’t the sunniest or the warmest place in the world, but it feels like mid-winter at the moment. I think I acclimatised to Spain as much as a pasty Irish person can. This morning I even had to put a jacket on and do a couple of KMs to wait for my legs to warm up before picking up the pace.

Upcoming Events

There are three confirmed events coming up:

Nieuwegein (Netherlands) September 11: Azure Fest 2025

I return to this Dutch city in a few days to do a new session “Azure Virtual Network Manager”. I’ve been watching this product develop since the private preview. It’s not quite ready (pricing is hopefully being fixed) but it could be a complete game changer for managing complex Azure networks for secure/compliant PaaS and IaaS deployments. I’ll discuss and demo the product, sharing what I like and don’t like.

Dublin (Ireland) October 7: Microsoft Azure Community & AI Day

Organised by Nicolas Chang (MVP) this event will feature a long list of Irish MVPs discussing Azure and AI in a rapid-fire set of short sessions. I don’t think that the event page has gone live yet so watch out for it. I will be presenting the “Azure Virtual Network Manager” again at this event.

TBA: Nordics

I’ve confirmed my speaking slots for 2 sessions at an event that has not publicly announced the agenda yet. I look forward to heading north and sharing some of my experiences.

My Sessions

If you are curious, then you can see my Sessionize public profile here, which is where you’ll see my collection of available sessions.

Azure Back To School 2024 – Govern Azure Networking Using Azure Virtual Network Manager

This post about Azure Virtual Network Manager is a part of the online community event, Azure Back To School 2024. In this post, I will discuss how you can use Azure Virtual Network Manager (AVNM) to centrally manage large numbers of Azure virtual networks in a rapidly changing/agile and/or static environment.

Challenges

Organisations around the globe have a common experience: dealing with a large number of networks that rapidly appear/disappear is very hard. If those networks are centrally managed then there is a lot of re-work. If the networks are managed by developers/operators then there is a lot of governance/verification work.

You need to ensure that networks are connected and are routed according to organisation requirements. Mandatory security rules must be put in place to either allow required traffic or to block undesired flows.

That wasn’t a big deal in the old days when there were maybe 3-4 huge overly trusting subnets in the data centre. Network designs change when we take advantage of the ability to transform when deploying to the cloud; we break those networks down into much smaller Azure virtual networks and implement micro-segmentation. This approach introduces simplified governance and a superior security model that can reliably build barriers to advanced persistent threats. Things sound better until you realise that there are no many more networks and subnets that there ever were in the on-premises data centre, and each one requires management.

This is what Azure Virtual Network Manager was created to help with.

Introducing Azure Virtual Network Manager

AVNM is not a new product but it has not gained a lot of traction yet – I’ll get into that a little later. Spoiler alert: things might be changing!

The purpose of AVNM is to centralise configuration of Azure virtual networks and to introduce some level of governance. Don’t get me wrong, AVNM does not replace Azure Policy. In fact, AVNM uses Azure Policy to do some of the leg work. The concept is to bring a network-specialist toolset to the centralised control of networks instead of using a generic toolset (Azure Policy) that can be … how do I say this politely … hmm … mysterious and a complete pain in the you-know-what to troubleshoot.

AVNM has a growing set of features to assist us:

Network groups: A way to identify virtual networks or subnets that we want to manage.
Connectivity configurations: Manage how multiple virtual networks are connected.
Security admin rules: Enforce security rules at the point of subnet connection (the NIC).
Routing configurations: Deploy user-defined routes by policy.
Verifier: Verify the networks can allow required flows.

Deployment Methodology

The approach is pretty simple:

Identify a collection of networks/subnets you want to configure by creating a Network Group.
Build a configuration, such as connectivity, security admin rules, or routing.
Deploy the configuration targeting a Network Group and one or more Azure regions.

The configuration you build will be deployed to the network group members in the selected region(s).

Network Groups

Part of a scalable configuration feature of AVNM is network groups. You will probably build several or many network groups, each collecting a set of subnets or networks that have some common configuration requirement. This means that you can have ea large collection of targets for one configuration deployment.

Network Groups can be:

Static: You manually add specific networks to the group. This is ideal for a limited and (normally) unchanging set of targets to receive a configuration.
Dynamic: You will define a query based on one or more parameters to automatically discover current and future networks. The underlying mechanism that is used for this discovery is Azure Policy – the query is created as a policy and assigned to the scope of the query.

Dynamic groups are what you should end up using most of the time. For example, in a governed environment, Azure resources are often tagged. One can query virtual networks with specific tags and in specific Azure regions and have them automatically appear in a network group. If a developer/operator creates a new network, governance will kick in and tag those networks. Azure Policy will discover the networks and instantly inform AVNM that a new group member was discovered – any configurations applied to the group will be immediately deployed to the new network. That sounds pretty nice, right?

Connectivity Configurations

Before we continue, I want you to understand that virtual network peering is not some magical line or pipe. It’s simply an instruction to the Azure network fabric to say “A collection of NICs A can now talk with a collection of NICs B”.

We often want to either simplify the connectivity of networks or to automate desired connectivity. Doing this at scale can be done using code, but doing it in an agile environment requires trust. Failure usually happens between the chair and the keyboard, so we want to automate desired connectivity, especially when that connectivity enables integration or plays a role in security/compliance.

Connectivity Configurations enable three types of network architecture:

Hub-and-spoke: This is the most common design I see being required and the only one I’ve ever implemented for mid-large clients. A central regional hub is deployed for security/transit. Workloads/data are placed in spokes and are peered only with the hub (the network core). A router/firewall is normally (not always) the next hop to leave a spoke.
Full mesh: Every virtual network is connected directly to every other virtual network.
Hub-and-spoke with mesh: All spokes are connected to the hub. All spokes are connected to each other. Traffic to/from the outside world must go through the hub. Traffic to other spokes goes directly to the destination.

Mesh is interesting. Why would one use it? Normally one would not – a firewall in the hub is a desirable thing to implement micro-segmentation and advanced security features such as Intrusion Detection and Prevention System (IDPS). But there are business requirements that can override security for limited scenarios. Imagine you have a collection of systems that must integrate with minimised latency. If you force a hop through a firewall then latency will potentially be doubled. If that firewall is deemed an unnecessary security barrier for these limited integrations by the business, then this is a scenario where a full mesh can play a role.

This is why I started off discussing peering. Whether a system is in the same subnet/network or not, it doesn’t matter. The physical distance matters, not the virtual distance. Peering is not a cable or a connection – it’s just an instruction.

However, Virtual Network Peering is not even used in mesh! It’s something different that can handle the scale of many virtual networks being interconnected called a Connected Group. One configuration inter-connects all the virtual networks without having to create 1-1 peerings between many virtual networks.

A very nice option with this configuration is the ability to automatically remove pre-existing peering connections to clean up unwanted previous designs.

Security Admin Rules

What is a Network Security Group (NSG) rule? It’s a Hyper-V port ACL that is implemented at the NIC of the virtual machine (yours or in the platform hosting your PaaS service). The subnet or NIC association is simply a scaling/targeting system; the rules are always implemented at the NIC where the virtual switch port is located.

NSGs do not scale well. Imagine you need to deploy a rule to all subnets/NICs to allow/block a flow. How many edits will you need to do? And how much time will you waste on prioritising rules to ensure that your rule is processed first?

Security Admin Rules are also implemented using Port ACLs but they are always processed first. You can create a rule or a set or rules and deploy it to a Network Group. All NICs will be updated and your rules will always be processed first.

Tip: Consider using VNet Flow Logs to troubleshoot Security Admin Rules.

Routing Configurations

This is one of the newer features in AVNM and was a technical blocker for me until it was introduced. Routing plays a huge role in a security design, forcing traffic from the spoke through a firewall in the hub. Typically, in VNet-based hub deployments, we place one user-defined route (UDR) in each subnet to make that flow happen. That doesn’t scale well and relies on trust. Some have considered using BGP routing to accomplish this but that can be easily overridden after quite a bit of effort/cost to get the route propagated in the first place.

AVNM introduced a preview to centrally configure UDRs and deploy them to Network Groups with just a few clicks. There are a few variations on this concept to decide how granular you want the resulting Route Tables to be:

One is shared with virtual networks.
One is shared with all subnets in a virtual network.
One per subnet.

Verification

This is a feature that I’m a little puzzled about and I am left wondering if it will play a role in some other future feature. The idea is that you can test your configurations to ensure that they are working. There is a LOT of cross-over with Network Watcher and there is a common limitation: it only works with virtual machines.

What’s The Bad News?

Once routing configurations go generally available, I would want to use AVNM in every deployment that I do in the future. But there is a major blocker: pricing. AVNM is priced per subscription at $73/month. For those of you with a handful of subscriptions, that’s not much at all. But for those of us who saw that the subscription is a natural governance boundary and use LOTS of subscriptions (like in Microsoft Cloud Adoption Framework), this is a huge deal – it can make AVNM the most expensive thing we do in Azure!

The good news is that the message has gotten through to Microsoft and some folks in Azure networking have publicly commented that they are considering changes to the way that the pricing of AVNM is calculated.

The other bit bad news is an oldie: Azure Policy. Dynamic network group membership is built by Azure Policy. If a new virtual network is created by a developer, it can be hours before policy detects it and informs AVNM. In my testing, I’ve verified that once AVNM sees the new member, it triggers the deployment immediately, but the use of Azure Policy does create latency, enabling some bad practices to be implemented in the meantime.

Summary

I was a downer on AVNM early on. But recent developments and some of the ideas that the team is working on have won me over. The only real blocker is pricing, but I think that the team is serious about fixing that. I stated earlier that AVNM hasn’t gotten a lot of traction. I think that this should change once pricing is fixed and routing configurations are GA.

I recently demonstrated using AVNM to build out the connectivity and routing of a hub-and-spoke with micro-segmentation at a conference. Using Azure Portal, the entire configuration probably took less than 10 minutes. Imagine that: 10 minutes to build out your security and compliance model for now and for the future.

Azure Route Server Saves The Day

In this post, I will discuss a recent scenario where we used Azure Route Server branch-to-branch routing to rescue a client.

The Original Network Design

This client is a large organisation with a global footprint. They had a previous WAN design that was out of scope for our engagement. The heart of the design was Meraki SD-WAN, connecting their global locations. I like Meraki – it’s relatively simple and it just works – that’s coming from me, an Azure networking person with little on-premises networking experience.

The client started using the services of a cloud provider (not Microsoft). The client followed the guidance of the vendor and deployed a leased line connection to a cloud region that was close to their headquarters and to their own main data centre. The leased line provides low latency connectivity between applications hosted on-premises and applications/data hosted in the other cloud.

Adding Azure

The customer wanted to start using Azure for general compute/data tasks. My employer was engaged to build the original footprint and to get them started on their journey.

I led the platform build-out, delegating most of the hands-on and focusing on the design. We did some research and determined the best approach to integrate with the other cloud vendor was via ExpressRoute. The Azure footprint was placed in an Azure region very close to the other vendor’s region.

An ExpressRoute circuit was deployed between a VNet-based hub in Azure – always my preference because of the scalability, security/governance concepts, and the superiority over Virtual WAN hub when it comes to flexibility and troubleshooting. The Meraki solution from the Azure Marketplace was added to the hub to connect Azure to the SD-WAN and BGP propagation with Azure was enabled using Azure Route Server. To be honest – that was relatively simple.

The customer had two clouds:

The other vendor via a leased line.
Azure via SD-WAN.
And an interconnect between Azure and the other cloud via ExpressRoute.

Along Came a Digger

My day-to-day involvement with the client was over months previously. I got a message early one morning from a colleague. The client was having a serious networking issue and could I get online. The issue was that an excavator/digger had torn up the lines that provided connectivity between the client’s data centre and the other cloud.

Critical services in the other Cloud were unavailable:

App integration and services with the on-premises data centre.
App availability to end users in the global offices.

I thought about it for a short while and checked out my theory online. One of the roles of Azure Route server is to enable branch to branch connectivity between “on-premises” locations between ExpressRoute/VPN.

Forget that the other cloud is a cloud – think of the other cloud’s region as an on-premises site that is connected via ExpressRoute and the above Microsoft diagram makes sense – we can interconnect the two locations via BGP propagation through Azure Route Server:

The “on-premises” location via ExpressRoute
The SD-WAN via the Meraki which is already peered with Azure Route Server

I presented the idea to the client. They processed the information quickly and the plan was implemented quickly. How quickly? It’s one setting in Azure Route Server!

The Solution

The workaround was to use Azure as a temporary route to the other Cloud. The client had routes from their data centre and global offices to Azure via the Meraki SD-WAN. BGP routes were propagating between the SD-WAN connected locations, thanks to the peering between the Meraki NVA in the Azure hub and Azure Route Server.

BGP routes were also propagating between the other cloud and Azure thanks to ExpressRoute.

The BGP routes that did exist between the SD-WAN and the other cloud were gone because the leased line was down – and was going to be down for some time.

We wanted to fill the gap – get routes from the other cloud and the SD-WAN to propagate through Azure. If we did that then the SD-WAN locations and the other cloud could route via the Meraki and the ExpressRoute gateway in the Azure Hub – Azure would become the gateway between the SD-WAN and the other cloud.

The solution was very simple: enable branch-to-branch connectivity in Azure Route Server. There’s a little wait when you do that and then you run a command to check the routes that are being advertised to the Route Server peer (the Meraki NVA in this case).

The result was near instant. Routes were advertised. We checked Azure Monitor metrics on the ExpressRoute circuit and could see a spike in traffic that coincided with the change. The plan had worked.

The Results

I had not heard anything in a while. This morning I heard that the client was happy with the fix. In fact, user experience was faster.

Go back to the original diagram before Azure and I can explain. Users are located in the branch offices around the world. Their client applications are connecting to services/data in the other cloud. Their route is a “backhaul”:

SD-WAN to central data centre
Leased line over long distance to the other cloud

When we introduced the “Azure bypass” after the leased line failure, a new route appeared for end users:

SD-WAN to Azure
A very short distance hop over ExpressRoute

Latency was reduced quite a bit so user experience improved. On the contrary, latency between the on-premises data centre and the other cloud has increased because the SD-WAN is a new hop but at least the path is available. The original leased line is still down after a few weeks – this is not the fault of the client!

Some Considerations

Ideally one would have two leased lines in place for failover. That incurs costs and it was not possible. What about Azure ExpressRoute Metro? That is still in preview at this time and is not available in the Azure metro in question.

However, this workaround has offered a triangle of connectivity. When the lease line in repaired, I will recommend that the triangle becomes their failover – if any one path fails, the other two will take the place, bringing the automatic recoverability that was part of the concept of the original ARPANET.

The other change is that the other cloud should become another site in the Meraki SD-WAN to improve the user app experience.

If we do keep branch-to-branch connectivity then we need to consider “what is the best path”? For example, we want the data centre to route directly to the other cloud when the leased line is available because that offers the lowest latency. But what if a route via Azure is accidentally preferred? We need control.

In Azure Route Server, we have the option to control connectivity from the Azure perspective (my focus):

(Default) Prefer ExpressRoute: Any routes received over ExpressRoute will be used. This would offer sub-optimal routes because on-premises prefixes will be received from the other cloud.
Prefer VPN: Any routes received over VPN will be used. This would offer sub-optimal routes because other cloud prefixes will be received from on-premises.
Use AS path: Let the admin/network advertise a preferred path. This would offer the desired control – “use this path unless something goes wrong”.

Azure’s Software Defined Networking

In this post, I will explain why Azure’s software-defined networking (virtual networks) differs from the cable-defined networking of on-premises networks.

Background

Why am I writing this post? I guess that this one has been a long time coming. I noticed a trend early in my working days with Azure. Most of the people who work with Azure from the infrastructure/platform point of view are server admins. Their work includes doing all of the resource stuff you’d expect, such as Azure SQL, VMs, App Services, … virtual networks, Network Security Groups, Azure Firewall, routing, … wait … isn’t that networking stuff? Why isn’t the network admin doing that?

I think the answer to that question is complicated. A few years ago I added a question to the audience to some of my presentations on Azure networking. I asked who was a ON-PREMISES networking admin versus an ON-PREMISES something-else. And then I said “the ‘server admins’ are going to understand what I will tech more easily than the network admins will”. I could see many heads nodding in agreement. Network admins typically struggle with Azure networking because it is very different.

Cable-Defined Networking

Normally, on-premises networking is “cable-defined”. That phrase means that packets go from source to destination based on physical connections. Those connections might be indirect:

Appliances such as routers decide what turn to take at a junction point
Firewalls either block or allow packets
Other appliances might convert signals from electrons to photons or radio waves.

A connection is always there and, more often than not, it’s a cable. Cables make packet flow predictable.

Look at the diagram of your typical on-premises firewall. It will have ethernet ports for different types of networks:

External
Management
Site-to-site connectivity
DMZ
Internal
Secure zone

Each port connects to a subnet that is a certain network. Each subnet has one or more switches that only connect to servers in that subnet. The switches have uplinks to the appropriate port in the firewall, thus defining the security context of that subnet. It also means that a server in the DMZ network must pass through the firewall, via the cable to the firewall, to get to another subnet.

In short, if a cable does not make the connection, then the connection is not possible. That makes things very predictable – you control the security and performance model by connecting or not connecting cables.

Software-Defined Networking

Azure is a cloud, and as a cloud, it must enable self-service. Imagine being a cloud subscriber, and having to open a support call to create a network or a subnet. Maybe they need to wait 3 days while some operators plug in cables and run Cisco commands. Or they need to order more switches because they’ve run out of capacity and you might need to wait weeks. Is this the hosting of the 2000’s or is it The Cloud?

Azure’s software-defined networking enables the customer to run a command themselves (via the Portal, script, infrastructure-as-code, or API) to create and configure networks without any involvement from Microsoft staff. If I need a new network, a subnet, a firewall, a WAF, or almost anything networking in Azure (with the exception of a working ExpressRoute circuit) then I don’t need any human interaction from a support staff member – I do it and have the resource anywhere from a few seconds to 45 minutes later, depending on the resource type.

This is because the physical network of Azure is overlayed with a software-defined network based on VXLAN. In simple terms, you have no visibility of the physical network. You use simulated networks that hide the underlying complexities, scale, and addressing. You create networks of your own address/prefix choice and use them. Your choice of addresses affects only your networks because they actually have nothing to do with how packets route at the physical layer – that’s handled by traditional networking at the physical layer – but that’s a matter only for the operators of the Microsoft global network/Azure.

A diagram helps … and here’s one that I use in my Azure networking presentations.

In this diagram, we see a source and a destination running in Azure. In case you were not aware:

Just about everything in Azure runs in a virtual machine, even so-called serverless computing. That virtual machine might be hidden in the platform but it is there. Exceptions might include some very expensive SKUs for SAP services and Azure VMware hosts.
The hosts for those virtual machines are running (drumroll please) Hyper-V, which as one may now be forced to agree, is scalable 😀

The source wants to send a packet to a destination. The source is connected to a Virtual Network and has the address of 10.0.1.4. The destination is connected to another virtual network (the virtual networks are peered) and has an address of 10.10.1.4. The virtual machine guest OS sends the packet to the NIC where the Azure fabric takes over. The fabric knows what hosts the source and destination are running on. The packet is encapsulated by the fabric – the letter is put into a second envelope. The envelope has a new source address, that of the source host, and a new destination, the address of the destination host. This enables the packet to traverse the physical network of Microsoft’s data centres even if 1000s of tenants are using the 10.x.x.x prefixes. The packet reaches the destination host where it is decapsulated, unpacking the original packet and enabling the destination host to inject the packet into the NIC of the destination.

This is why you cannot implement GRE networking in Azure.

Virtual Networks Aren’t What You Think

The software-defined networking in Azure maintains a mapping. When you create a virtual network, a new map is created. It tells Azure that NICs (your explicitly created NICs or those of platform resources that are connected to your network) that connect to the virtual network are able to talk to each other. The map also tracks what Hyper-V hosts the NICs are running on. The purpose of the virtual network is to define what NICs are allowed to talk to each other – to enforce the isolation that is required in a multi-tenant cloud.

What happens when you peer two virtual networks? Does a cable monkey run out with some CAT6 and create a connection? Is the cable monkey creating a virtual connection? Does that connection create a bottleneck?

The answer to the second question is a hint as to what happens when you implement virtual network peering. The speed of connections between a source and destination in different virtual networks is the potential speed of their NICs – the slowest NIC (actually the slowest VM, based on things like RSS/VMQ/SR-IOV) in any source/destination flow is the bottleneck.

VNet peering does not create a “connection”. Instead, the mapping that is maintained by the fabric is altered. Think of it being like a Venn Diagram. Once you implement peering, the loops that define what can talk to what has a new circle. VNet1 has a circle encompassing its NICs. VNet2 has a circle encompassing its NICs. Now a new circle is created that encompasses VNet1 and VNet2 – any source in VNet1 can talk directly, using encapsulation/decapsulation) to any destination in VNet2 and vice versa without going through some resource in the virtual networks.

You might have noticed before now that you cannot ping the default gateway in an Azure virtual network. It doesn’t exist because there is no cable to a subnet appliance to reach other subnets.

You also might have noticed that tools like traceroute are pretty useless in Azure. That’s because the expected physical hops are not there. This is why using tools like test-netconnection (Windows PowerShell) or Network Watcher Connection Troubleshoot/Connection Monitor are very important.

Direct Connections

Now you know what’s happening under the covers. What does that mean? When a packet goes from source to destination, there is no hop. Have a look at the diagram below.

It’s not an unusual diagram. There’s an on-prem network on the left that connects to Azure virtual networks using a VPN tunnel that is terminated in Azure by a VPN Gateway. The VPN Gateway is deployed into a hub VNet. There’s some stuff in the hub, including a firewall. Services/data are deployed into spoke VNets – the spoke VNets are peered with the hub.

One can immediately see that the firewall, in the middle, is intended to protect the Azure VNets from the on-premises network(s). That’s all good. But this is where the problems begin. Many will look at that diagram and think that this protection will just work.

If we take what I’ve explained above we’ll understand really what will happen. The VPN Gateway is implemented in the platform as two Azure virtual machines. Packets will come in over the tunnel to one of those VMs. Then the packets will hit the NIC of the VM to route to a spoke VNet. What path will those packets take? There’s a firewall in the pretty diagram. The firewall is placed right in the middle! And that firewall is ignored. That’s because packets leaving the VPN Gateway VM will be encapsulated and go straight to the NIC of the destination NIC in one of the spokes as if it were teleported.

To get the flow that you require for security purposes you need to understand Azure routing and either implement the flow via BGP or User-Defined Routing.

Now have a look at this diagram of a virtual appliance firewall running in Azure from Palo Alto.

Look at all those pretty subnets. What is the purpose of them? Oh I know that there’s public, management, VPN, etc. But why are they all connecting to different NICs? Are there physical cables to restrict/control the flow of packets between some spoke virtual network and a DMZ virtual network? Nope. What forces packets to the firewall? Azure routing does. So those NICs in the firewall do what? They don’t isolate, they complicate! They aren’t for performance, because the VM size controls overall NIC throughput and speed. They don’t add performance, they complicate!

The real reason for all those NICs is to simulate eth0, eth1, etc that are referenced by the Palo Alto software. It enables Palo Alto to keep the software consistent between on-prem appliances and their Azure Marketplace appliance. That’s it – it saves Palo Alto some money. Meanwhile, Azure Firewall using a single IP address on the virtual network (via the Standard tier load balancer, but you might notice each compute instance IP as a source) and there is no sacrifice in security.

Wrapping Up

There have been countless times over the years when having some level of understanding of what is happening under the covers has helped me. If you grasp the fundamentals of how packets rally get from A to B then you are better prepared to design, deploy, operate, or troubleshoot Azure networking.