Virtual WAN Is Not Required For SD-WAN

Did you know that you do not need to use Virtual WAN to implement an SD-WAN with Azure? In fact, contrary to the recommendations from Microsoft, Virtual WAN might be the worst way to add Azure networks to an SD-WAN.

My History With Virtual WAN

You might think that the introduction of this post paints me as a complete hater who has never given Virtual WAN a chance. I have. In fact, I can point out features that some of my 1:1 feedback calls probably contributed to. I’ve implemented Virtual WAN with customers.

However, I’ve seen the problems. I’ve seen that the hype doesn’t always work. I’ve personally experienced the lack of troubleshooting capabilities that depended on my deep understanding of the hidden networking. I’ve seen colleagues struggle with the complexity. I’ve seen how some customers’ routing requirements cannot be met with Virtual WAN. And many architectural features that some organisations require cannot be deployed with Virtual WAN.

I concluded that my time with Virtual WAN was over during a proof of concept that I insisted a customer do. They had previously used Virtual WAN without a firewall. I was asked to build a new multi-region Azure environment (multiple hubs) with firewalls. I was not sure that it would go well – this was before routing intent was in preview. I tested and confirmed that Virtual WAN was not going to work; the customer implemented a Meraki SD-WAN using Virtual Network-based hubs and lost no functionality. In fact, they gained functionality.

In an older case, I convinced a customer to go with Virtual WAN. I regret this one. There was a lot of hype. They used Meraki. There was a solution from Meraki to integrate with the Virtual WAN VPN Gateway. We found bugs in the script and fixed them. But the most annoying thing about that solution was that every time the customer changed anything in the SD-WAN, every VPN tunnel to Azure was torn down and recreated. I heard recently that the customer is looking to remove SD-WAN. I don’t blame them, and I regret ever recommending it to them.

The Microsoft Claims

The Azure Cloud Adoption Framework incorrectly states the following:

Use a Virtual WAN topology if any of the following requirements apply to your organization:

  • Your organization intends to deploy resources across several Azure regions and requires global connectivity between virtual networks in these Azure regions and multiple on-premises locations.
  • Your organization intends to use a software-defined WAN (SD-WAN) deployment to integrate a large-scale branch network directly into Azure, or requires more than 30 branch sites for native IPSec termination.
  • You require transitive routing between a virtual private network (VPN) and Azure ExpressRoute. For example, if you use a site-to-site VPN to connect remote branches or a point-to-site VPN to connect remote users, you might need to connect the VPN to an ExpressRoute-connected DC through Azure.

I will burst those bubbles one by one.

Several Regions & Global Connectivity

Do you want to deploy across multiple regions? Not a problem. You can very easily do that with Virtual Network-based hubs. I’ve done it again and again.

Do you want to connect the spokes in different regions? Yup, also easy:

  • Build each hub-and-spoke from a single IP prefix.
  • Your spokes already route via the hub.
  • Peer the hubs.
  • Create User-Defined Routes in each firewall subnet (you will be using firewalls in this day and age) to route to remote hub-and-spoke IP prefixes via the remote hub firewalls.

Job done! The only additional steps were:

  • Peer the hubs
  • Add UDRs to each firewall subnet for each remote hub-and-spoke IP prefix

You do that once. Once!

How about connecting the remote sites? Simples: you connect them as usual.

There is some marketing material about how we can use the Microsoft WAN as the company WAN using vWAN. Yes, in theory. The concept is that the Microsoft Global WAN is amazing. You VPN from site A (let’s say Oslo, Norway) to a local Azure region and you VPN from site B (let’s say Houston, Texas) to a local Azure region. Then vWAN automatically enables Oslo <> Texas connectivity over the Microsoft Global Network. Yes, it does. And the performance should be amazing. I did a proof-of-concept in 2 hours with a customer. The performance of VPN directly between Oslo <> Houston was much better. Don’t buy the hype! Question it and test. And by the way, we can build this with VNets too – I was told by an MS partner that they did this solution between two sites on different continents years before vWAN existed.

SD-WAN

Microsoft suggests that you can only add Azure networks to an SD-WAN if you use Virtual WAN.

Here’s some truth. Under the covers, vWAN hub is built on a traditional Virtual Network. Then you can use (don’t) a VPN Gateway or a third-party SD-WAN appliance for connectivity.

The list of partners supporting vWAN was greatly increased recently – I remember looking for Meraki support a few months ago, and it was not there (it is now). But guess what, I bet you that everyone one of those partners offers the exact same solution for Virtual Networks via the Marketplace. And I bet:

  • There are more partner options
  • There are no trade-offs
  • The resilience is just the same

I have done Azure/Meraki SD-WAN twice since the above customer X. In both cases, we went with the Azure Marketplace and Virtual Network. And in both cases, it was:

  • Dead simple to set up.
  • It worked the first time.

Transitive Routing

Virtual WAN is powered by a feature that is hidden unless you do an ARM export. That feature is Azure Route Server. Did you know:

  • You can deploy Azure Route Server to a Virtual Network. The deployment is a next-next-net.
  • It can be easily BGP peered with a third-party networking appliance.
  • The Azure Route server will learn remote site prefixes from the networking appliance/SD-WAN.
  • The Azure Route Server will advertise routes to the networking appliance/SD-WAN.

Azure Route Server BGP propagation is managed using the same VNet peering settings as Virtual Network Gateway.

There is a single checkbox (true/false property) to enable transitive routing between VPN/ExpressRoute remote sites. And that setting is amazing.

I signed in to work one day and was asked a question. I had built out the environment for a large customer with an HQ in Oslo:

  • Remote sites around the world with a Meraki SD-WAN.
  • Leased line to Oracle Cloud – the global sites backhauled through Oslo.
  • The VNet-based hub in Azure was added to the SD-WAN. All offices wre connected directly to Azure via VPN.
  • Azure Route Server was added and peered to the Meraki SD-WAN.
  • Azure had an ExpressRoute connection (Oracle Cloud Interconnect) to Oracle Cloud.

An excavator has torn up the leased line to Oracle. The essential services in Oracle Cloud were unavailable. I was asked if the Azure connection to Oracle Cloud coule be leveraged to get the business back online? I thought for 30 seconds and said, “Yes, give me 5 minutes”. Here’s what I did:

  1. I check the box to enable transitive routing in Azure Route Server.
  2. I clicked Save/Apply and waited a few minutes for the update task
  3. I asked the client to test.

And guess what? Contrary to the above CAF text, the client was back online. A few weeks later, I was told that not only did they get back online, but the SD-WAN connection to the VIRUTAL NETWORK-BASED hub in Azure gave the global branch offices lower latency connections than their backhaul through Oslo to Oracle Cloud. Whoda-thunk-it?

vWAN is PaaS

One of the arguments for the vWAN hub is that it pushes complexity down into the platform; it’s a PaaS sub-resource.

Yes, it’s a PaaS sub-resource. Is a well-designed hub complex? A hub should contain very few resources, based around:

  • Remote connectivity resource
  • Firewall
  • Maybe Azure Bastion

There’s not much more to a hub than that if you value security. What exactly am I saving with the more-expensive vWAN?

Limitations of vWAN

Let’s start with performance. A hub in Virtual WAN has a throughput limitation of 50 Gbps. I thought that was a theoretical limit … until I did a network review for a client a few years ago. They had a single workload that pushed 29Gbps through the hub, 1 Gbps shy of the limit for a Standard tier Azure Firewall. I recommended an increase to the 100 Gbps Premium tier, but warned that the bottleneck was always going to be the vWAN hub.

The architectural limitations of vWAN are many – so many that I will miss some:

  • No VNet Flow Logs
  • Impossible to troubleshoot routing/connectivity in a real way
  • No support for Azure Bastion in the hub
  • No support for NAT Gateway for firewall egress traffic (SNAT port exhaustion)
  • Secured traffic between different secured (firewall) hubs requires Routing Intent
  • No Forced Tunnelling in Azure Firewall without Routing Intent
  • Routing Intent is overly simplistic – everything goes through the firewall
  • No support for IP Prefix for the firewall
  • Azure Firewall cannot use Route Server Integration (auto-configuration of non-RFC1918 usage in private networks)
  • Hub Route Tables are a complexity nightmare

Impossible Solution

Anyone who has deployed more than a couple of Azure networks has heard the following statement made regarding failing connections over site-to-site networking:

The Azure network is broken

A new site-to-site appliance or firewall has been placed in Azure, and the root cause of the issue is “never the remote network“.

Proving that the issue isn’t the firewall can be tricky. That’s because firewall appliances are black boxes. I updated my standard hub design last year to assist with this:

  • Add a subnet with identical routing configuration (BGP propagation and user-defined routes) as the (private) firewall subnet.
  • Add a low-spec B-series VM to this subnet with an autoshutdown. This VM is used only for diagnostics.

The design allows an Azure admin to log into the VM. The VM mimics the connectivity of the firewall and allows tests to be done against failing connections. If the test fails from the VM, it proves that the firewall is not at fault.

No other compute resources are placed in the hub.

Here’s the gotcha. I can do this in a VNet hub. I cannot do this in a vWAN hub. The vWAN hub Virtual Network is in a Microsoft-managed tenant/subscription. You have no access, and you cannot troubleshoot it. You are entirely at the mercy of Azure support – and, sadly, we know how that process will go.

Virtual WAN In Summary

You do not need Virtual WAN for connectivity or SD-WAN. So why would one adopt it instead of VNet-based hubs, especially when you consider costs and the loss of functionality? I just do not understand (a) why Microsoft continues to push Virtual WAN and (b) why it continues to exist.

Tracing Packets in Azure Networks

While I’m on the topic of troubleshooting, I thought that I would add some tips on how to trace packets in Microsoft Azure.

The Problem

Here are a few scenarios, in descending order from most common, that I’ve been through over the years:

Remote Desktop To New VMs

You’ve just established a new site-to-site connection between a remote location and a (probably) new Azure network. The remote site admin complains:

No packets are getting through. Your Azure network is broken.

You know that everything in Azure is in good order, and you’re pretty sure the remote site firewall is blocking the traffic. However, many systems administrators jump to “the new Azure network is broken” when something doesn’t work – even if they configured their firewall to block that damned RDP traffic (it’s nearly always RDP in the first tests!).

Connecting to PaaS Services

You’ve deployed some PaaS services in Azure. Something can’t connect to them. The client might be in the same Virtual Network. Maybe it’s in another spoke that must route through a hub? Or maybe the client is in a remote site? The developer or operator is going to say:

We’re getting timeouts when we connect. Your network is broken.

So many things could be wrong here.

SSL Goes Wrong

SSL/PKI feels to me like a dinosaur technology that:

  • Most never learn
  • Few who did learn it never completely mastered it (I’m here, I’d estimate)
  • Those of us who did learn it have forgotten most of it

And modern application/network security is built on this deck of cards (from a knowledge perspective). I’ve seen a few scenarios:

My app gets a weird response from the database when it attempts to connect.

That one’s probably because something is reverse proxying the connection and something is going wrong in the connection – see Application Rules NATing east-west connections in Azure Firewall, causing the client IP to change.

How about this one I saw recently:

My application is failing to connect to a remote server.

When I dug into it, I saw that the TLS handshake was failing and the TCP connection was cleanly terminated. A self-signed certificate was to blame. Other scenarios I’ve seen are where Linux-based appliances fail the same handshake because the server cert doesn’t contain the full keychain. Tip: Windows LIES to you when it shows the whole keychain which it self-builds from the trusted publishers store on the machine. Most appliances require the full keychain in the cert, which many online CAs do not do by default. You’d be amazed how many weeks are wasted and repeated discusssions are had because of this.

But how have I proven this?

Complex Routing

Not everyone builds a simple hub-and-spoke. Sometimes there is a need for complexity. I had one of these a few years ago, where a customer required an ExpressRoute connection to a third-party data provider. The data provider mandated:

  • An ExpressRoute connection
  • The use of SNAT

The ExpressRoute Gateway doesn’t offer SNAT (unlike the VPN Gateway), so I had to conjure an interesting design. Luckily, I know Azure routing pretty well, and I tested this design in a lab. I was sure it would work – it did. But what if something went wrong? I would have had to troubleshoot what was happening.

The Need

What we need is:

  1. The ability to prove that packets are routing to confirm the infrastructure’s ability.
  2. Check how a PaaS resource has responded to connections.
  3. The ability to see inside those packets to investigate application-layer issues.

Folks, most of this is basic logging/querying. But there are a few tricks.

Packet Travel

I want to confirm that a packet reached A, then went to B, then got to the destination. For example:

  1. A packet from a remote client entered the hub and went through the firewall.
  2. It then routed across peering – GAH! More on this in a moment.
  3. And the packet routed through the destination spoke Virtual Network to reach the destination server – double GAH!

Before we proceed, I literally get session audiences to repeat the following 3 times each to enforce some basic knowledge:

Virtual Networks do not exist.

Subnets do not exist

Peering does not exist

Packets go directly from the source to the destination

This is why tracert is useless in Azure.

Understanding the above is halfway to mastering Azure networking. Please read this post before asking me questions or attempting to debate me on this topic of existence.

By the way, if you are using Azure Firewall, then (PowerShell) test-networkconnection is useful only to generate logs. The result may not be the actual result. Azure Firewall feeds “200” results from application rules, even when denying traffic. I always advise: generate the traffic and then check the logs.

Back to the topic …

The basic tool we need is a log of a packet or flow (a series of packets in a “conversation” between the client and server). Fortunately, we have a few sources of those.

The first is your firewall. Azure Firewall’s diagnostics settings send logs to your preferred destination. I prefer Log Analytics. You might prefer Splunk or similar. Potatoe Potahtoh. In Azure Firewall, the “decision making logs” include:

  • Threat Intelligence (an under-appreciated and oh-so useful feature)
  • IDPS
  • Network rules
  • Application rules

Log Analytics has a built-in query to search all those logs in a union. I can search for any combination of source IP, source port (not typically useful), protocol, destination IP, and destination port (very useful).

A third-party firewall has similar logs, often locked away in the previous grip of the firewall administrator. Sorry, I’m binge-watching Lord of the Rings, and I couldn’t help myself, firewall admins 🙂 Some firewalls can make those logs more available to other Azure operators. For example, the Palo Alto Cloud NGFW has the ability to route logs, via Application Insights, into Log Analytics, where queries, dashboards, and workbooks can share that data. Nice!

The firewall logs will show me:

  • If packets entered the firewall
  • If those packets were allowed or denied

The simple mention of a flow from a client to a server in the firewall log means that packets made it there:

  • A spoke routed via the firewall to another spoke or a remote site.
  • Packets from a remote site passed successfully over a site-to-site network connection.

The firewall log is often my first port of call. Sometimes, however, it doesn’t go deep enough. There have been a number of times where I’ve been told something along the lines of:

I can ping VM X in Azure, but I cannot make a HTTPS connection to it.

I know from experience that they have made a successful connectionless ping (ICMP). But they have failed to make a connection-oriented (TCP) HTTPS request. The stateful firewall is blocking a response to the connection request because it never saw the original SYN. Thank you to my 3rd-year networking lecturer – I can picture the guy demonstrating a luggable PC to us around 1993, but I don’t remember his name. Experience has taught me that:

  • A route for the spoke network prefix is missing from the GatewaySubnet, and the request is bypassing the firewall.
  • A Private Endpoint has added a /32 route to the GatewaySunet (see network policies for Private Endpoint), and routing “long prefix match” has chosen that system route over your User-Defined Route for the spoke prefix.

For these crazy situations, you need to dig a little deeper into the firewall logs. I cannot speak for third-party firewalls here. Azure Firewall doesn’t capture dropped connections such as these. For that deep dive, we need Flow Trace logs to be enabled. Note that:

  • Enabling the logs does not enable the feature; this must be enabled using PowerShell.
  • The logs will be very detailed – and expensive to ingest into your monitoring solution. Only leave this feature enabled while troubleshooting the issue – set a calendar entry to unset it.

I haven’t had the opportunity to use this one in the real world personally, but I wonder if I sent those JSON logs to blob storage, could I download them to Copilot and get a reasonable response to my queries? Note to self.

Did the packet traverse a Virtual Network? Now you should know that’s a dumb question. The Azure fabric takes packets from source NICs and drops them into destination NICs. The correct question is: Did a packet reach the destination NIC?

The correct solution to answer that question today is Virtual Network Flow Logs with Traffic Analytics.

The wrong answer is the deprecated NSG Flow Logs. Virtual Network Flow Logs are current and capture much better data, including Private Endpoints.

Flow Logs will tell me about:

  • Outbound flows – did a packet leave a client?
  • Inbound flows – did a packet reach a server?
  • NSG Rules – what rule allowed/denied a connection?

Now I know if a connection:

  • Left an Azure client
  • Reached an Azure server
  • Was allowed or denied by an NSG rule

Flow Logs take time to generate:

  1. The logs will take 30+ seconds to be written to blob storage. Honestly, I’ve seen this take longer during the pandemic. I think MSFT might throttle monitoring when CPU usage is in high demand.
  2. Traffic Analytics is configured to run every 10 or 60 minutes. I prefer the 10-minute option.
  3. Log Analytics will take time to process the data. I was told many years ago to allow up to 15 minutes for NSG Flow Logs to be processed.

Between the firewall logs and the Virtual Network Flow Logs, I have visibility of the traffic. Or some of it.

PaaS Resources

A PaaS resource may be deployed with:

  • Public endpoint: Firewall or Virtual Network Flow Logs will show my traffic leaving my network, but not the last mile.
  • Private Endpoint: Private Endpoint NICs fool us, because the packet is sent directly by the fabric from the client NIC to the NIC of the machine hosting the PaaS resource instance. Virtual Network Flow Logs show us “connectability” but not the full connection.
  • VNet Injection and VNet Integration: The PaaS resources don’t really live in our Virtual Network. I know that it’s confusing.

Let me give you a working example. You have an App Service wth VNet Integration that is attempting to talk to a Key Vault with a Private Endpoint. We can see the flows in the previously discussed logs. But are the packets really getting to the Key Vault? What happens when the App Service attempts to access a secret?

The only answer to this is to enable the diagnostics settings in the Key Vault. Querying those logs in Log Analytics, Splunk, etc, will tell you exactly what’s going on:

  • Was there a connection?
  • Was the connection successful?
  • Why did the connection fail?

Packet Capture

Don’t get scared! I promise that packet capture is easier than ever now. I’ll explain later.

The results of a packet capture show you the contents (as much as encryption allows) of packets in a flow between a client and a server. This is super useful for investigating further. Let me explain two scenarios:

In the first scenario, we have proven that packets get from A to B, but the customer/developer/operator doesn’t accept that because their application is failing. If we know the packets from client to server, then we know the error is further up the stack – it’s an application configuration or authoring issue. The only way to prove to the other person is to show them the actual packets.

Network Watcher provides a feature called Packet Capture. The only place you need the free Wireshark client is on your PC to open the capture. Network Watcher will automatically add an Azure extension (agent) to the client/server VM, based on your Azure rights over that VM. You can capture all or filtered packets and save the resulting .CAP file to blob storage. Unfortunately, this ability is limited to VMs.

The second scenario is where we have a remote admin complaining about their failing RDP connection (it’s always this) over site-to-site networking. You’ve proven the traffic doesn’t reach the firewall/Azure VM. You know their firewall is blocking the outbound connection, but they won’t accept that. You have to prove that the traffic never crossed the site-to-site connection. You can enable packet capture on a VPN Virtual Network Gateway or a Virtual WAN VPN Gateway. This will ultimately prove that packets never got across the tunnel, and the remote admin must face the mirror.

Back to the scary part about packet captures. Who the heck can read those things? Not many of us can. I understand some basics, such as control flags like SYN, SYN-ACK, and RST. But what would I do if I had to really understand a packet capture? Enter Copilot or another AI:

  1. In Wireshark, click File > Export Packet Dissections > As JSON and select “Packet Range: All packets” and “Packet Format: JSON”. JSON is nice for AI to parse.
  2. Upload the capture to your AI and ask it your questions.

You’ll get an answer that you can work with. I used this recently for an application issue to help a (good guy) developer get to the root cause of an issue.

By the way:

  • ExpressRoute does not offer packet capture, but Traffic Collector provides a Flow Log experience.
  • Azure Firewall with a Management NIC (recommended by me for the last 1.5+ years) has packet capture.

Some Other Tricks

Network Watcher can be useful for doing some basic diagnostics:

  • IP flow verify: Checks whether a specific traffic flow would be allowed or denied by NSG rules.
  • NSG diagnostics: Analyse NSG rules across hops to identify which rule permits or blocks traffic.
  • Next hop: Identifies the next routing hop a packet will take from a selected VM.
  • Effective security rules: Displays the combined, active security rules applied to a network interface after all NSGs are merged.
  • VPN troubleshoot: Diagnoses issues with Azure VPN gateways and site‑to‑site or point‑to‑site tunnels.
  • Packet capture: Captures packet data directly from a VM’s network stack for deep traffic analysis.
  • Connection troubleshoot: Tests end‑to‑end connectivity between a VM and a target to identify routing or NSG issues.

Connection Troubleshoot is especially nice:

We can send a bunch of probe packets from a source to a destination and see if the connection was successful. If not, the tool gives you some indication why – keep in mind that remote destinations will result in vague failure reasoning because Azure doesn’t control remote locations.

The sources can be:

  • VMs and VM Scale Sets: Using the Network Watcher extension.
  • Application Gateway: Great for figuring out those pesky backend health issues and proving that the CA-provided cert (lacking the complete trust chain) is the cause of the failure.
  • Bastion Host: Bastion-to-VM connections can be a head-wrecker.

If there is a connection that is working but you consider to be critical, then I recommend using Connection Monitor in Network Watcher:

  • Works with any mix of Arc agents (non-Azure VMs) and Azure VMs – consider those remote site connections!
  • Model application connections.
  • Tests success and speed (latency).
  • Can trigger an alert/Action Group.

I used this a few years ago for a SaaS company that was using Placement Proximity Groups as a part of their need to minimise latency. I wanted proof of the platform performance, just in case. My colleague who wrote the Terraform for modelling the application in Connection Monitor probably didn’t like me for requiring this 😉 I started seeing alerts one day, so I let the customer know that I was opening a support ticket with Microsoft. We found out that there was a physical issue with one network appliance, and Microsoft fixed it. Wow – not only were we monitoring our infrastructure and the application’s networking, but we were monitoring Azure’s physical network too!

Last Tool

The last tool is you, not Copilot. Honestly, Azure Copilot is not good at this stuff. I’ve tested it in my build labs, and it hasn’t a clue (thankfully for us IT pros). You need a combination of:

  • Experience: What’s most common?
  • Intuition: Listen to the customer – did they just mention a cert, for example?
  • Knowledge: Understanding how Azure networks function is critical – did you know that not setting the network policies for Private Endpoint in your subnet causes asynchronous routing in the firewall?

Using your tools will better prepare you to use the above Azure tools.

If You Liked This …

Maybe you liked this post and are wondering: “Could Aidan help me?” Maybe I can through my company, Cloud Mechanix. Whether you need a review, design something, figure out some issue, do a large deployment, or figure out why the cloud is not working for your organisation, I can help – and other things too. Cloud Mechanix works with large and small organisations and service providers throughout Europe. Check out the site, and contact me if you are interested.

What’s The Deal With Azure Virtual Network Routing Appliance?

I’ve seen a lot of chatter about the new Azure Virtual Network Routing Appliance that has just gone into preview. Here are my thoughts.

My Opinion

In summary: huh?

Based on the single page of lightweight content, this appears to be a router, powered by physical hardware, that enables high-bandwidth routing. I’m being careful with my words here. I avoided saying “high speed” because speed can mean one of two things:

  • Latency
  • Bandwidth

Using hardware rather than software for a router will minimise latency, but I cannot imagine the difference will be much. 99% of customers won’t care about that difference. The main cause of latency in The Cloud is the distance between a client and a server – always remember that (without Placement Proximity Groups) a client and server in the same region could be in different physical buildings, which may even be kilometres or miles apart. For example, North Europe (Dublin) is in Grangecastle in West Dublin (search for Cuisine De France). Microsoft is planning to expand the region with new data centres in Newhall, near Naas, about 20 minutes (at midnight) down the road from Grangecastle. Switching from software to hardware to route between the client and server won’t make much difference there.

The other thing that I’ve noted in the skimpy doc is that this “router” doesn’t replace the firewall in a hub. If you use the firewall in the hub to isolate landing zones/spokes, then the firewall is the router:

  • Next hop to leave the spoke
  • Next hop to enter the Azure networks from remote locations

So that means we must have a software router. There is no role for the Virtual Network Routing Appliance in a regular secured Azure network. So what the heck are Microsoft up to?

Odd Azure Announcements

Weird feature announcements, such as the Virtual Network Routing Appliance, are not unusual in Azure. I have a slightly informed suspicion as to who the target customer is. This announcement fits a pattern: Azure often releases features primarily meant to solve Microsoft’s own internal challenges.

Who are Azure’s customers? There are the likes of your employer/organisation. And then there is Microsoft – probably Azure’s single biggest customer. Think about it; Storage is used by Office 365. The Standard Load Balancer is used by just about every PaaS resource there is (if not all of them). Many of the things that Azure creates are used by other Azure features and other Microsoft cloud services.

Azure Networking is a perfect example of that. They build not only for us, but to provide connectivity for Microsoft’s services, which are built on Azure.

I teach attendees of my network conference sessions and training courses that everything is a VM, even so-called “serverless” computing. There are rare exceptions, such as the Virtual Network Routing Appliance, the Xbox appliance, or the hosts in Azure VMware Services. Somewhere in Azure, a VM is hosting a service. That VM is part of a pool. That VM is on a network. That network in an Azure Virtual Network. That network requires routing.

Now let’s get back to the Virtual Network Routing Appliance. Why does it exist? What has been the biggest talking point in IT for the past few years? What has Microsoft focused their attention on, to the detriment of customers and business, in my opinion? Yes, AI.

We know that AI is all about bigger, faster, better. Every new iteration of ChatGPT/Copilot requires more. The demand to get these “HPC” clusters talking faster must be incredible for Azure Networking – thousands of GPU-enabled machines across many networks, all working in unison.

I think that the Virtual Network Routing Appliance was created for AI in Microsoft. Imagine the scale of an AI HPC cluster. There must be a need to create routes between many VNets, and they have sacrificed the isolation of a hub firewall, opting to lean on NSGs or (more likely) AVNM Security Admin Rules.

I believe that AVNM was originally created for Azure’s configuration of Virtual Networks that are used by PaaS services. The original release and associated marketing made no sense to us Azure customers. But over time, the product shaped into something that I now think is a “must have”. I don’t know that that’s what the future has for the Virtual Network Routing Appliance, but I’m pretty sure that my guess is right: this is designed for Microsoft’s unique needs, and few of us will find it useful.

Takeaway

I’m sorry for the buzzkill. The Virtual Network Routing Appliance sounds interesting, but that’s all. We might need to know about it for an exam. But I really do not expect it to be a factor in network designs for many outside of Microsoft.

Building A Hub & Spoke Using Azure Virtual Network Manager

In this post, I will show how to use Azure Virtual Network Manager (AVNM) to enforce peering and routing policies in a zero-trust hub-and-spoke Azure network. The goal will be to deliver ongoing consistency of the connectivity and security model, reduce operational friction, and ensure standardisation over time.

Quick Overview

AVNM is a tool that has been evolving and continues to evolve from something that I considered overpriced and under-featured, to something that I would want to deploy first in my networking architecture with its recently updated pricing. In summary, AVNM offers:

  • Network/subnet discovery and grouping
  • IP Address Management (IPAM)
  • Connectivity automation
  • Routing automation

There is (and will be) more to AVNM, but I want to focus on the above features because together they simplify the task of building out Azure platform and application landing zones.

The Environment

One can manage virtual networks using static groups but that ignores the fact that The Cloud is a dynamic and agile place. Developers, operators, and (other) service providers will be deploying virtual networks. Our goal will be to discover and manage those networks. An organisation might be simple, and there will be a one-size-fits-all policy. However, we might need to engineer for complexity. We can reduce that complexity by organising:

  • Adopt the Cloud Adoption Framework and Zero Trust recommendations of 1 subscription/virtual network per workload.
  • Organising subscriptions (workloads) using Management Groups.
  • Designing a Management Group hierarchy based on policy/RBAC inheritance instead of basing it on an organisation chart.
  • Using tags to denote roles for virtual networks.

I have built a demo lab where I am creating a hub & spoke in the form of a virtual data centre (an old term used by Microsoft). This concept will use a hub to connect and segment workloads in an Azure region. Based on Route Table limitations, the hub will support up to 400 networked workloads placed in spoke virtual networks. The spokes will be peered to the hub.

A Management Group has been created for dub01. All subscriptions for the hub and workloads in the dub01 environment will be placed into the dub01 Management Group.

Each workload will be classified based on security, compliance, and any other requirements that the organisation may have. Three policies have been predefined and named gold, silver, and bronze. Each of these classifications has a Management Group inside dub01, called dub01gold, dub01silver, and dub01bronze. Workloads are placed into the appropriate Management Group based on their classification and are subject to Azure Policy initiatives that are assigned to dub01 (regional policies) and to the classification Management Groups.

You can see two subscriptions above. The platform landing zone, p-dub01, is going to be the hub for the network architecture. It has therefore been classified as gold. The workload (application landing zone) called p-demo01 has been classified as silver and is placed in the appropriate Management Group. Both gold and silver workloads should be networked and use private networking only where possible, meaning that p-demo01 will have a spoke virtual network for its resources. Spoke virtual networks in dub01 will be connected to the hub virtual network in p-dub01.

Keep in mind that no virtual networks exist at this time.

AVNM Resource

AVNM is based on an Azure resource and subresources for the features/configurations. The AVNM resource is deployed with a management scope; this means that a single AVNM resource can be created to manage a certain scope of virtual networks. One can centrally manage all virtual networks. Or one can create many AVNM resources to delegate management (and the cost) of managing various sets of virtual networks.

I’m going to keep this simple and use one AVNM resource as most organisations that aren’t huge will do. I will place the AVNM resource in a subscription at the top of my Management Group hierarchy so that it can offer centralised management of many hub-and-spoke deployments, even if we only plan to have 1 now; plans change! This also allows me to have specialised RBAC for managing AVNM.

Note that AVNM can manage virtual networks across many regions so my AVNM resource will, for demonstration purposes, be in West Europe while my hub and spoke will be in North Europe. I have enabled the Connectivity, Security Admin, and User-Defined Routing features.

AVNM has one or more management scopes. This is a central AVNM for all networks, so I’m setting the Tenant Root Group as the top of the scope. In a lab, you might use a single subscription or a dedicated Management Group.

Defining Network Groups

We use Network Groups to assign a single configuration to many virtual networks at once. There are two kinds of members:

  • Static: You add/remove members to or from the group
  • Dynamic: You use a friendly wizard to define an Azure Policy to automatically find virtual networks and add/remove them for you. Keep in mind that Azure Policy might take a while to discover virtual networks because of how irregularly it runs. However, once added, the configuration deployment is immediately triggered by AVNM.

There are two kinds of members in a group:

  • Virtual networks: The virtual network and contained subnets are subject to the policy. Virtual networks may be static or dynamic members.
  • Subnets: Only the subnet is targeted by the configuration. Subnets are only static members.

Keep in mind that something like peering only targets a virtual network and User-Defined Routes target subnets.

I want to create a group to target all virtual networks in the dub01 scope. This group will be the basis for configuring any virtual network (except the hub) to be a secured spoke virtual network.

I created a Network Group called dub01spokes with a member type of Virtual Networks.

I then opened the Network Group and configured dynamic membership using this Azure Policy editor:

Any discovered virtual network that is not in the p-dub01 subscription and is in North Europe will be automatically added to this group.

The resulting policy is visible in Azure Policy with a category of Azure Virtual Network Manager.

IP Address Management

I’ve been using an approach of assigning a /16 to all virtual networks in a hub & spoke for years. This approach blocks the prefix in the organisation and guarantees IP capacity for all workloads in the future. It also simplifies routing and firewall rules. For example, a single route will be needed in other hubs if we need to interconnect multiple hub-and-spoke deployments.

I can reserve this capacity in AVNM IP Address Management. You can see that I have reserved 10.1.0.0/16 for dub01:

Every virtual network in dub01 will be created from this pool.

Creating The Hub Virtual Network

I’m going to save some time/money here by creating a skeleton hub. I won’t deploy a route NVA/Virtual Network Gateway so I won’t be able to share it later. I also won’t deploy a firewall, but the private address of the firewall will be 10.1.0.4.

I’m going to deploy a virtual network to use as the hub. I can use Bicep, Terraform, PowerShell, AZ CLI, or the Azure Portal. The important thing is that I refer to the IP address pool (above) when assigning an address prefix to the new virtual network. A check box called Allocate Using IP Address Pools opens a blade in the Azure Portal. Here you can select the Address Pool to take a prefix from for the new virtual network. All I have to do is select the pool and then use a subnet mask to decide how many addresses to take from the pool (/22 for my hub).

Note that the only time that I’ve had to ask a human for an address was when I created the pool. I can create virtual networks with non-conflicting addresses without any friction.

Create Connectivity Configuration

A Connectivity Configuration is a method of connecting virtual networks. We can implement:

  • Hub-spoke peering: A traditional peering between a hub and a spoke, where the spoke can use the Virtual Network Gateway/Azure Route Server in the hub.
  • Mesh: A mesh using a Connected Group (full mesh peering between all virtual networks). This is used to minimise latency between workloads with the understanding that a hub firewall will not have the opportunity to do deep inspection (performance over security).
  • Hub & spoke with mesh: The targeted VNets are meshed together for interconnectivity. They will route through the hub to communicate with the outside world.

I will create a Connectivity Configuration for a traditional hub-and-spoke network. This means that:

  • I don’t need to add code for VNet peering to my future templates.
  • No matter who deploys a VNet in the scope of dub01, they will get peered with the hub. My design will be implemented, regardless of their knowledge or their willingness to comply with the organisation’s policies.

I created a new Connectivity Configuration called dub01spokepeering.

In Topology I set the type to hub-and-spoke. I select my hub virtual network from the p-dub01 subscription as the hub Virtual Network. I then select my group of networks that I want to peer with the hub by selecting the dub01spokes group. I can configure the peering connections; here I should select Hub As Gateway – I don’t have a Virtual Network Gateway or an Azure Route Server in the hub, so the box is greyed out.

I am not enabling inter-spoke connectivity using the above configuration – AVNM has a few tricks, and this is one of them, where it uses Connected Groups to create a mesh of peering in the fabric. Instead, I will be using routing (later) via a hub firewall for secure transitive connectivity, so I leave Enable Connectivity Within Network Group blank.

Did you notice the checkbox to delete any pre-existing peering configurations? If it isn’t peered to the hub then I’m removing it so nobody uses their rights to bypass by networking design.

I completed the wizard and executed the deployment against the North Europe region. I know that there is nothing to configure, but this “cleans up” the GUI.

Create Routing Configuration

Folks who have heard me discuss network security in Azure should have learned that the most important part of running a firewall in Azure is routing. We will configure routing in the spokes using AVNM. The hub firewall subnet(s) will have full knowledge of all other networks by design:

  • Spokes: Using system routes generated by peering.
  • Remote networks: Using BGP routes. The VPN Local Network Gateway creates BGP routes in the Azure Virtual Networks for “static routes” when BGP is not used in VPN tunnels. Azure Route Server will peer with NVA routers (SD-WAN, for example) to propagate remote site prefixes using BGP into the Azure Virtual Networks.

The spokes routing design is simple:

  • A Route Table will be created for each subnet in the spoke Virtual Networks. This design for these free resources will allow customised routing for specific scenarios, such as VNet-integrated PaaS resources that require dedicated routes.
  • A single User-Defined Route (UDR) forces traffic leaving a spoke Virtual Network to pass through the hub firewall, where firewall rules will deny all traffic by default.
  • Traffic inside the Virtual Network will flow by default (directly from source to destination) and be subject to NSG rules, depending on support by the source and destination resource types.
  • The spoke subnets will be configured not to accept BGP routes from the hub; this is to prevent the spoke from bypassing the hub firewall when routing to remote sites via the Virtual Network Gateway/NVA.

I created a Routing Configuration called dub01spokerouting. In this Routing Configuration I created a Rule Collection called dub01spokeroutingrules.

A User-Defined Route, known as a Routing Rule, was created called everywhere:

The new UDR will override (deactivate) the System route to 0.0.0.0/0 via Internet and set the hub firewall as the new default next hop for traffic leaving the Virtual Network.

Here you can see the Routing Collection containing the Routing Rule:

Note that Enable BGP Route Propagation is left unchecked and that I have selected dub01spokes as my target.

And here you can see the new Routing Configuration:

Completed Configurations

I now have two configurations completed and configured:

  • The Connectivity Configuration will automatically peer in-scope Virtual Networks with the hub in p-dub01.
  • The Routing Configuration will automatically configure routing for in-scope Virtual Network subnets to use the p-dub01 firewall as the next hop.

Guess what? We have just created a Zero Trust network! All that’s left is to set up spokes with their NSGs and a WAF/WAFs for HTTPS workloads.

Deploy Spoke Virtual Networks

We will create spoke Virtual Networks from the IPAM block just like we did with the hub. Here’s where the magic is going to happen.

The evaluation-style Azure Policy assignments that are created by AVNM will run approximately every 30 minutes. That means a new Virtual Network won’t be discovered straight after creation – but they will be discovered not long after. A signal will be sent to AVNM to update group memberships based on added or removed Virtual Networks, depending on the scope of each group’s Azure Policy. Configurations will be deployed or removed immediately after a Virtual Network is added or removed from the group.

To demonstrate this, I created a new spoke Virtual Network in p-demo01. I created a new Virtual Network called p-demo01-net-vnet in the resource group p-demo01-net:

You can see that I used the IPAM address block to get a unique address space from the dub01 /16 prefix. I added a subnet called CommonSubnet with a /28 prefix. What you don’t see is that I configured the following for the subnet in the subnet wizard:

As you can see, the Virtual Network has not been configured by AVNM yet:

We will have to wait for Azure Policy to execute – or we can force a scan to run against the resource group of the new spoke Virtual Network:

  • Az CLI: az policy state trigger-scan –resource-group <resource group name>
  • PowerShell: Start-AzPolicyComplianceScan -ResourceGroupName <resource group name>

You could add a command like above into your deployment code if you wished to trigger automatic configuration.

This force process is not exactly quick either! 6 minutes after I forced a policy evaluation, I saw that AVNM was informed about a new Virtual Network:

I returned to AVNM and checked out the Network Groups. The dub01spokes group has a new member:

You can see that a Connectivity Configuration was deployed. Note that the summary doesn’t have any information on Routing Configurations – that’s an oversight by the AVNM team, I guess.

The Virtual Network does have a peering connection to the hub:

The routing has been deployed to the subnet:

A UDR has been created in the Route Table:

Over time, more Virtual Networks are added and I can see from the hub that they are automatically configured by AVNM:

Summary

I have done presentations on AVNM and demonstrated the above configurations in 40 minutes at community events. You could deploy the configurations in under 15 minutes. You can also create them using code! With this setup we can take control of our entire Azure networking deployment – and I didn’t even show you the Admin Rules feature for essential “NSG” rules (they aren’t NSG rules but use the same underlying engine to execute before NSG rules).

Want To Learn More?

Check out my company, Cloud Mechanix, where I share this kind of knowledge through:

  • Consulting services for customers and Microsoft partners using a build-with approach.
  • Custom-written and ad-hoc Azure training.

Together, I can educate your team and bring great Azure solutions to your organisation.

Day Two Devops – Azure VNets Don’t Exist

I had the pleasure of chatting with Ned Bellavance and Kyler Middleton on Day Two DevOps one evening recently to discuss the basics of Azure networking, using my line “Azure Virtual Networks Do Not Exist”. I think I talked nearly non-stop for nearly 40 minutes 🙂 Tune in and you’ll hear my explanation of why many people get so much wrong in Azure networking/security.

Designing A Hub And Spoke Infrastructure

How do you plan a hub & spoke architecture? Based on much of what I have witnessed, I think very few people do any planning at all. In this post, I will explain some essential things to plan and how to plan them.

Rules of Engagement

Microsoft has shared some concepts in the Well-Architected Framework (simplicity) and the documentation for networking & Zero Trust (micro-segmentation, resilience, and isolation).

The hub & spoke will contain networks in a single region, following concepts:

  • Resilience & independence: Workloads in a spoke in North Europe should not depend on a hub in West Europe.
  • Micro-segmentation: Workloads in North Europe trying to access workloads in West Europe should go through a secure route via hubs in each region.
  • Performance: Workload A in North Europe should not go through a hub in West Europe to reach Workload B in North Europe.
  • Cost Management: Minimise global VNet peering to just what is necessary. Enable costs of hubs to be split into different parts of the organisation.
  • Delegation of Duty: If there are different network teams, enable each team to manage their hubs.
  • Minimised Resources: The hub has roles only of transit, connectivity, and security. Do not place compute or other resources into the hub; this is to minimise security/networking complexity and increase predictability.

Management Groups

I agree with many things in the Cloud Adoption Framework “Enterprise Scale” and I disagree with some other things.

I agree that we should use Management Groups to organise subscriptions based on Policy architecture and role-based access control (RBAC – granting access to subscriptions via Entra groups).

I agree that each workload (CAF calls them landing zones) should have a dedicated subscription – this simplifies operations and governance like you wouldn’t believe.

I can see why they organise workloads based on their networking status:

  • Corporate: Workloads that are internal only and are connected to the hub for on-premises connectivity. No public IP addresses should be allowed where technically feasible.
  • Online: Workloads that are online only and are not permitted to be connected to the hub.
  • Hybrid: This category is missing from CAF and many have added it themselves – WAN and Internet connectivity are usually not binary exclusive OR decisions.

I don’t like how Enterprise Scale buckets all of those workloads into a single grouping because it fails to acknowledge that a truly large enterprise will have many ownership footprints in a single tenant.

I also don’t like how Enterprise Scale merges all hubs into a single subscription or management group. Yes, many organisations have central networking teams. Large organisations may have many networking teams. I like to separate hub resources (not feasible with Virtual WAN) into different subscriptions and management groups for true scaling and governance simplicity.

Here is an example of how one might achieve this. I am going to have two hub & spoke deployments in this example:

  • DUB01: Located in Azure North Europe
  • AMS01: Located in Azure West Europe

Some of you might notice that I have been inspired by Microsoft’s data centre naming for the naming of these regional footprints. The reasons are:

  • Naming regions after “North Europe” or “East US” is messy when you think about naming network footprints in East US2, West US2, and so on.
  • Microsoft has already done the work for us. The Dublin (North Europe) region data centres are called DUB05-DUB15 and Microsoft uses AMS01, etc for Middenmeer (West Europe).
  • A single virtual network may have up to 500 peers. Once we hit 500 peers then we need to deploy another hub & spoke footprint in the region. The naming allows DUB02, DUB03, etc.

The change from CAF Enterprise Scale is subtle but look how instantly more scalable and isolated everything is. A truly large organisation can delegate duties as necessary.

If an identity responsible for the AMS01 hub & spoke is compromised, the DUB01 hub & spoke is untouched. Resources are in dedicated subscriptions so the blast area of a subscription compromise is limited too.

There is also a logical placement of the resources based on ownership/location.

You don’t need to recreate policy – you can add more associations to your initiatives.

If an enterprise currently has a single networking team, their IDs are simply added to more groups as new hub & spoke deployments are added.

IP Planning

One of the key principles in the design is simplicity: keep it simple stupid (KISS). I’m going to jump ahead a little here and give you a peek into the future. We will implement “Network segmentation: Many ingress/egress cloud micro-perimeters with some micro-segmentation” from the Azure zero-trust guidance.

The only connection that will exist between DUB01 and AMS01 is a global VNet peering connection between the hubs. All traffic between DUB01 and AMS01 mist route via the firewalls in the hubs. This will require some user-defined routing and we want to keep this as simple as possible.

For example, the firewall subnet in DUB01 must have a route(s) to all prefixes in AMS01 via the firewall in the hub of AMS01. The more prefixes there are in AMS01, the more routes we must add to the Route Table associated with the firewall subnet in the hub of DUB01. So we will keep this very simple.

Each hub & spoke will be created from a single IP prefix allocation:

  • DUB01: All virtual networks in DUB01 will be created from 10.1.0.0/16.
  • AMS01: All virtual networks in AMS01 will be created from 10.2.0.0/16.

You might have noticed that Azure Virtual Network Manager uses a default of /16 for an IP address block in the IPAM feature – how convenient!

That means I only have to create one route in the DUB01 firewall subnet to reach all virtual networks in AMS01:

  • Name: AMS01
  • Prefix: 10.2.0.0/16
  • Next Hop Type: VirtualAppliance
  • Next Hop IP Address: The IP address of the AMS01 firewall

A similar route will be created in AMS01 firewall subnet to reach all virtual networks in DUB01:

  • Name: DUB01
  • Prefix: 10.1.0.0/16
  • Next Hop Type: VirtualAppliance
  • Next Hop IP Address: The IP address of the DUB01 firewall

Honestly, that is all that is required. I’ve been doing it for years. It’s beautifully simple.

The firewall(s) are in total control of the flows. This design means that neither location is dependent on the other. Neither AMS01 nor DUB01 trust each other. If a workload is compromised in AMS01 its reach is limited to whatever firewall/NSG rules permit traffic. With threat detection, flow logs, and other features, you might even discover an attack using a security information & event management (SIEM) system before it even has a chance to spread.

Workloads/Landing Zones

Every workload will have a dedicated subscription with the appropriate configurations, such as enabling budgets and configuring Defender for Cloud. Standards should be as automated as possible (Azure Policy). The exact configuration of the subscription should depend on the zone (corp, online or corporate).

When there is a virtual network requirement, then the virtual network will be as small as is required with some spare capacity. For example, a workload with a web VM and a SQL Server doesn’t need a /24 subnet!

Essential Workloads

Are you going to migrate legacy workloads to Azure? Are you going to run Citrix or Azure Virtual Desktop (AVD)? If so, then you are going to require doamin controllers.

You might say “We have a policy of running a single ADDS site and our domain controllers are on-premises”. Lovely, at least it was when Windows Server 2003 came out. Remember that I want my services in Azure to be resilient and not to depend on other locations. What happens to all of your Azure servces when the network connection to on-premises fails? Or what happens if on-premises goes up in a cloud of smoke? I will put domain controllers in Azure.

Then you might say “We will put domain controllers in DUB01 and AMS01 can use them”. What happens if DUB01 goes offline? That does happen from time to time. What happens if DUB01 is compromised? Not only will I put domain controllers in DUB01, but I will also put them in AMS01. They are low end virtual machines and the cost will be minor. I’ll also do some good ADDS Sites & Services stuff to isolate as much as ADDS lets you:

  • Create subnets for each /16 IP prefix.
  • Create an ADDS site for AMS01 and another for DUB01.
  • Associate each site with the related subnet.
  • Create and configure replication links as required.

The placement and resilience of other things like DNS servers/Private DNS Resolver should be similar.

And none of those things will go in the hub!

Micro-Segmentation

The hub will be our transit network, providing:

  • Site-to-site connectivity, if required.
  • Point-to-site connecticity, if required.
  • A firewall for security and routing purposes.
  • A shared Azure Bastion, if required.

The firewall will be the next hop, by default (expect exceptions) for traffic leaving every virtual network. This will be configured for every subnet (expect exceptions) in every workload.

The firewall will be the glue that routes every spoke virtual network to each other and the outside world. The firewall rules will restrict which of those routes is possible and what traffic is possible – in all directions. Don’t be lazy and allow * to Internet; do you want to automatically enable malware to call home for further downloads or discovery/attack/theft instructions?

The firewall will be carefully chosen to ensure that it includes the features that your organisation requires. Too many organisations pick the cheapest firewall option. Few look at the genuine risks that they face and pick something that best defends against those risks. Allow/deny is not enough any more. Consider the features that pay careful attentiont to what must be allowed; these are the firewall ports that attackers are using to compromise their victims.

Every subnet (expect exceptions) will have an NSG. That NSG will have a custom low-priority inbound rule to deny everything; this means that no traffic can enter a NIC (from anywhere, including the same subnet) without being explicityly allowed by a higher priority rule.

“Web” (this covers a lot of HTTPS based services, excluding AVD) applications will not be published on the Internet using the hub firewall. Instead, you will deploy a WAF of some kind (or different kinds depending on architectural/business requirements). If you’re clever, and it is appropriate from a performance perspective, you might route that traffic through your firewall for inspection at layers 4-7 using TLS Inspection and IDPS.

Logging and Alerting

You have placed all the barriers in place. There are two interesting quotes to consider. The first warns us that we must assume a pentration has already taken place or will take place.

Fundamentally, if somebody wants to get in, they’re getting in…accept that. What we tell clients is: Number one, you’re in the fight, whether you thought you were or not. Number two, you almost certainly are penetrated.

Michael Hayden Former Director of NSA & CIA

The second warns us that attackers don’t think like defenders. We build walls expecting a linear attack. Attackers poke, explore, and prod, looking for any way, including very indeirect routes, to get from A to B.

Biggest problem with network defense is that defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.

John Lambert

Each of our walls offers some kind of monitoring. The firewall has logs, which ideally we can either monitor/alert from or forward to a SIEM.

Virtual Networks offer Flow Logs which track traffic at the VNet level. VNet Flow logs are superior to NSG FLow logs because they catch more traffic (Private Endpoint) and include more interesting data. This is more data that we can send to a SIEM.

Defender for Cloud creates data/alerts. Key Vaults do. Azure databases do. The list goes on and on. All of this data that we can use to:

  • Detect an attack
  • Identify exploration
  • Uncover an expansion
  • Understand how an attack started and happened

And it amazes me how many organisations choose not to configure these features in any way at all.

Wrapping Up

There are probably lots of finer details to consider but I think that I have covered the essentials. When I get the chance, I’ll start diving into the fun detailed designs and their variations.

How Many Azure Route Tables Should I Have?

In this Azure Networking deep dive, I’m going to share some of my experience around planning the creation and association of Route Tables in Microsoft Azure.

Quick Recap

The purpose of a Route Table is to apply User-Defined Routes (UDRs). The Route Table is associated with a subnet. The UDRs in the Route Table are applied to the NICs in the subnet. The UDRs override System and/or BGP routes to force routes on outgoing packets to match your desired flows or security patterns.

Remember: There are no subnets or default gateways in Azure; the NIC is the router and packets go directly from the source NIC t the destination NIC. A route can be used to alter that direct flow and force the packets through a desired next hop, such as a firewall, before continuing to the destination.

Route Table Association

A Route Table is associated with one or more subnets. The purpose of this is to cause the UDRs of the Route Table to be deployed to the NICs that are connected to the subnet(s).

Technically speaking, there is nothing wrong with asosciating a single Route Table with more than one subnet. But I would the wisdom of this practice.1:N

1:N Association

The concept here is that one creates a single Route Table that will be used across many subnets. The desire is to reduce effort – there is no cost saving because Route Tables are free:

  1. You create a Route Table
  2. You add all the required UDRs for your subnets
  3. You associate the Route Table with the subnets

It all sounds good until you realise:

  • That individual subnets can require different routes. For example a simple subnet containing some compute might only require a route for 0.0.0.0/0 to use a firewall as a next hop. On the other hand, a subnet containing VNet-integrated API Management might require 60+ routes. Your security model at this point can become complicated, unpredictable, and contradictory.
  • Centrally managing network resources, such as Route Tables, for sharing and “quality control” contradicts one of the main purposes of The Cloud: self-service. Watch how quick the IT staff that the business does listen to (the devs) rebel against what you attempt to force upon them! Cloud is how you work, not where you work.
  • Certain security models won’t work.

1:1 Association

The purpose of 1:1 association is to:

  • Enable granular routing configuration; routes are generated for each subnet depending on the resource/networking/security requirements of the subnet.
  • Enable self-service for developers/operators.

The downside is that you can end up with a lot of subnets – keep in mind that some people create too many subnets. One might argue that this is a lot of effort but I would counter that by saying:

  • I can automate the creation of Route Tables using several means including infrastructure-as-code (IaC), Azure Policy, or even Azure Virtual Network Manager (with it’s new per-VNet pricing model).
  • Most subnets will have just one UDR: 0.0.0.0/0 via the firewall.

What Do I Do & Recommend?

I use the approach of 1:1 association. Each subnet, subject to support, gets its own Route Table. The Route Table is named after the VNet/subnet and is associatded only with that subnet.

I’ve been using that approach for as long as I can remember. It was formalised 6 years ago and it has worked for at scale. As I stated, it’s no effort because the creation/association of the Route Tables is automated. The real benefit is the predictability of the resulting security model.

Routing Is The Security Cabling of Azure

In this post, I want to explain why routing is so important in Microsoft Azure. Without truly understanding routing, and implementing predictable and scaleable routing, you do not have a secure network. What one needs to understand is that routing is the security cabling of Azure.

My Favourite Interview Question

Now and then, I am asked to do a technical interview of a new candidate at my employer. I enjoy doing technical interviews because you get to have a deep tech chat with someone who is on their career journey. Sometimes is a hopeful youngster who is still new to the business but demonstrates an ability and a desire to learn – they’re a great find by the way. Sometimes its a veteran that you learn something from. And sometimes, they fall into the trap of discussing my favourite Azure topic: routing.

Before I continue, I should warn potential interviewees that the thing I dislike most in a candidate is when they talk about things that “happened while I was there” and then they claim to be experts in that stuff.

The candidate will say “I deployed a firewall in Azure”. The little demon on my shoulder says “ask them, ask them, ASK THEM!”. I can’t help myself – “How did you make traffic go through the firewall?”. The wrong answer here is: “it just did”.

The Visio Firewall Fallacy

I love diagrams like this one:

Look at that beauty. You’ve got Azure networks in the middle (hub) and the right (spoke). And on the left is the remote network connected by some kind of site-to-site networking. The deployment even has the rarely used and pricey Network SKU of DDoS protection. Fantastic! Security is important!

And to re-emphasise that security is important, the firewall (it doesn’t matter what brand you choose in this scenario) is slap-bang in the middle of the whole thing. Not only is that firewall important, but all traffic will have to go through it – nothing happens in that network without the firewall controlling it.

Except, that the firewall is seeing absolutely no traffic at all.

Packets Route Directly From Source To Destination

At this point, I’d like you to (re-)read my post, Azure Virtual Networks Do Not Exist. There I explained two things:

  • Everything is a VM in the platform, including NVA routers and Virtual Network Gateways (2 VMs).
  • Packets always route directly from the source NIC to the destination NIC.

In our above firewall scenario, let’s consider two routes:

  • Traffic from a client in the remote site to an Azure service in the spoke.
  • A response from the service in the Azure spoke to the client in the remote site.

The client sends traffic from the remote site across the site-to-site connection. The physical part of that network is the familiar flow that you’d see in tracert. Things change once that packet hits Azure. The site-to-site connection terminates in the NVA/virtual network gateway. Now the packet needs to route to the service in the spoke. The scenario is that the NVA/virtual network gateway is the source (in Azure networking) and the spoke service is the destination. The packet leaves the NIC of the NVA/virtual network and routes directly (via the underlying physical Azure network) directly to the NIC of one of the load-balanced VMs in the spoke. The packet did not route through the firewall. The packet did not go through a default gateway. The packet did not go across some virtual peering wire. Repeat it after me:

Packets route directly from source to destination.

Now for the response. The VM in the spoke is going to send a response. Where will that response go? You might say “The firewall is in the middle of the diagram, Aidan. It’s obvious!”. Remember:

Packets route directly from source to destination.

In this scenario, the destination is the NVA/virtual network gateway. The packet will leave the VM in the spoke and appear in the NIC of the NCA/virtual network gateway.

It doesn’t matter how pretty your Visio is (Draw.io is a million times better, by the way – thanks for the tip, Haakon). It doesn’t matter what your intention was. Packets … route directly from source to destination.

User-Defined Routes – Right?

You might be saying, “Duh, Aidan, User-Defined Routes (UDRs) in Route Tables will solve this”. You’re sort of on the right track – maybe even mostly there. But I know from talking to many people over the years, that they completely overlook that there are two (I’d argue three) other sources of routes in Azure. Those other routes are playing a role here that you’re not appreciating and if you do not configure your UDRs/Route Tables correctly you’ll either change nothing or break your network.

Routing Is The Security Cabling of Azure

In the on-premises world, we use cables to connect network appliances. You can’t get from one top-of-rack switch/VLAN to another without going through a default gateway. That default gateway can be a switch, a switch core, a router, or a firewall. Connections are made possible via cables. Just like water flow is controlled by pipes, packets can only transit cables that you lay down.

If you read my Azure Virtual Networks Do Not Exist post then you should understand that NICs in a VNet or in peered VNets are a mesh of NICs that can route directly to each other. There is no virtual network cabling; this means that we need to control the flows via some other means and that means is routing.

One must understand the end state, how routing works, and how to manipulate routing to end up in the desired end state. That’s the obvious bit – but often overlooked is that the resulting security model should be scaleable, manageable, and predictable.

How Do Network Security Groups Work?

A Greek Phalanx, protected by a shield wall made up of many individuals working under 1 instruction as a unit – like an NSG.

Yesterday, I explained how packets travel in Azure networking while telling you Azure virtual networks do not exist. The purpose was to get readers closer to figuring out how to design good and secure Azure networks without falling into traps of myths and misbeliefs. The next topic I want to tackle is Network Security Groups – I want you to understand how NSGs work … and this will also include Admin Rules from Azure Virtual Network Manager (AVNM).

Port ACLs

In my previous post, Azure Virtual Networks Do Not Exist, I said that Azure was based on Hyper-V. Windows Server 2012 introduced loads of virtual networking features that would go on to become something bigger in Azure. One of them was a mostly overlooked-by-then-customers feature called Port ACLs. I liked Port ACLs; it was mostly unknown, could only be managed using PowerShell and made for great demo content in some TechEd/Ignite sessions that I did back in the day.

Remember: Everything in Azure is a virtual machine somewhere in Azure, even “serverless” functions.

The concept of Port ACLs was it gave you a simple firewall feature controlled through the virtualisation platform – the virtual machine and the guest OS had no control and had to comply. You set up simple rules to allow or deny transport layer (TCP/UDP) traffic on specific ports. For example, I could block all traffic to a NIC by default with a low-priority inbound rule and introduce a high-priority inbound rule to allow TCP 443 (HTTPS). Now I had a web service that could receive HTTPS traffic only, no matter what the guest OS admin/dev/operator did.

Where are Port ACLs implemented? Obviously, it is somewhere in the virtualisation product, but the clue is in the name. Port ACLs are implemented by the virtual switch port. Remember that a virtual machine NIC connects to a virtual switch in the host. The virtual switch connects to the physical NIC in the host and the external physical network.

A virtual machine NIC connects to a virtual switch using a port. You probably know that a physical switch contains several ports with physical cables plugged into them. If a Port ACL is implemented by a switch port and a VM is moved to another host, then what happens to the Port ACL rules? The Hyper-V networking team played smart and implemented the switch port as a property of the NIC! That means that any Port ACL rules that are configured in the switch port move with the NIC and the VM from host to host.

NSG and Admin Rules Are Port ACLs

Along came Azure and the cloud needed a basic rules system. Network Security Groups (NSGs) were released and gave us a pretty interface to manage security at the transport layer; now we can allow or deny inbound or outbound traffic on TCP/UDP/ICMP/Any.

What technology did Azure use? Port ACLs of course. By the way, Azure Virtual Network Manager introduced a new form of basic allow/deny control that is processed before NSG rules called Admin Rules. I believe that this is also implemented using Port ACLs.

A Little About NSG Rules

This is a topic I want to dive deep into later, but let’s talk a little about NSG rules. We can implement inbound (allow or deny traffic coming in) or outbound (allow or deny traffic going out) rules.

A quick aside: I rarely use outbound NSG rules. I prefer using a combination of routing and a hub firewall (dey all by default) to control egress traffic.

When I create a NSG I can associate it with:

  • A NIC: Only that NIC is affected
  • A subnet: All NICs, including Vnet integrated PaaS resources and Private Endpoints, are affected

The association is simply a management scaling feature. When you associate a NSG with a subnet the rules are not processed at the subnet.

Tip: virtual networks do not exist!

Associating a NSG resource with a subnet propagates the rules from the NSG to all NICs that are connected to that subnet. The processing is done by Port ACLs at the NIC.

This means:

  • Inbound rules prevent traffic from entering the virtual machine.
  • Outbound rules prevent traffic from leaving the virtual machine.

Which association should you choose? I advise you to use subnet association. You can see/manage the entire picture in one “interface” and have an easy-to-understand processing scenario.

If you want to micro-manage and have an unpredictable future then go ahead and associate NSGs with each NIC.

If you hate yourself and everyone around you, then use both options at the same time:

  • The subnet NSG is processed first for inbound traffic.
  • The NIC NSG is processed first for outbound traffic.

Keep it simple, stupid (the KISS principle).

Micro-Segmentation

As one might grasp, we can use NSGs to micro-segment a subnet. No matter what the resources do, they cannot bypass the security intent of the NSG rules. That means we don’t need to have different subnets for security zones:

  • We zone using NSG rules.
  • Virtual networks and their subnets do not exist!

The only time we need to create additional subnets is when there are compatibility issues such as NSG/Route table association or a PaaS resource requires a dedicated subnet.

Watch out for more content shortly where I break some myths and hopefully simplify some of this stuff for you. And if I’m doing this right, you might start to look at some Azure networks (like I have) and wonder “Why the heck was that implemented that way?”.

Azure Virtual Networks Do Not Exist

I see many bad designs where people bring cable-oriented designs from physical locations into Azure. I hear lots of incorrect assumptions when people are discussing network designs. For example: “I put shared services in the hub because they will be closer to their clients”. Or my all-time favourite: people assuming that ingress traffic from site-to-site connections will go through a hub firewall because it’s in the middle of their diagram. All this is because of one common mistake – people don’t realise that Azure Virtual Networks do not exist.

Some Background

Azure was designed to be a multi-tenant cloud capable of hosting an “unlimited” number of customers.

I realise that “unlimited” easily leads us to jokes about endless capacity issues 🙂

Traditional hosting (and I’ve worked there) is based on good old fashioned networking. There’s a data centre. At the heart of the data centre network there is a network core. The entire facility has a prefix/prefixes of address space. Every time that a customer is added, the network administrators carve out a prefix for the new customer. It’s hardly self-service and definitely not elastic.

Azure settled on VXLAN to enable software-defined networking – a process where customer networking could be layered upon the physical networks of Microsoft’s physical data centres/global network.

Falling Into The Myth

Virtual Networks make life easy. Do you want a network? There’s no need to open a ticket? You don’t need to hear from a snarky CCIE who snarls when you ask for a /16 address prefix as if you’ve just asked for Flash “RAID10” from a SAN admin. No; you just open up the Portal/PowerShell/VS Code and you deploy a network of whatever size you want. A few seconds later, it’s there and you can start connecting resources to the new network.

You fire up a VM and you get an address from “DHCP”. It’s not really DHCP. How can it be when Azure virtual networks do not support broadcasts or multicasts? You log into that VM and have networking issues so you troubleshoot like you learn how to:

  1. Ping the local host
  2. Ping the default gateway – oh! That doesn’t work.
  3. Traceroute to a remote address – oh! That doesn’t work either.

And then you start to implement stuff just like you would in your own data centre.

How “It Works”

Let me start by stating that I do not know how the Azure fabric works under the covers. Microsoft aren’t keen on telling us how the sausage is made. But I know enough to explain the observable results.

When you click the Create button at the end of the Create Virtual Machine wizard, an operator is given a ticket, they get clearance from data center security, they grab some patch leads, hop on a bike, and they cycle as fast as they can to the patch panel that you have been assigned in Azure.

Wait … no … that was a bad stress dream. But really, from what I see/hear from many people, they think that something like that happens. Even if that was “virtual”, the who thing just would not be scaleable.

Instead, I want you to think of an Azure virtual network as a Venn diagram. The process of creating a virtual network instructs the Azure fabric that any NIC that connects to the virtual network can route to any other NIC in the virtual network.

Two things here:

  1. You should take “route” as meaning a packet can go from the source NIC to the destination NIC. It doesn’t mean that it will make it through either NIC – we’ll cover that topic in another post soon.
  2. Almost everything in Azure is a virtual machine at some level in Azure. For example, “serverless” Functions run in Microsoft-managed VMs in a Microsoft tenant. Microsoft surfaces the Function functionality to you in your tenant. If you connect those PaaS services (like ASE or SQL Manage Instance) to your virtual network then there will be NICs that connect to a subnet.

Connecting a NIC to a virtual network adds the new NIC to the Venn Diagram. The Azure fabric now knows that this new NIC should be able to route to other NICs in the same virtual network and all the previous NICs can route to it.

Adding Virtual Network Peering

Now we create a second virtual network. We peer those virtual networks and then … what happens now? Does a magic/virtual pipe get created? Nope – it’s fault tolerant so two magic/virtual lines connect the virtual networks? Nope. It’s Venn diagram time again.

The Azure fabric learns that the NICs in Virtual Network 1 can now route to the NICs in Virtual Network 2 and vice versa. That’s all. There is no magic connection. From a routing/security perspective, the NICs in Virtual Network 1 are no different to the NICs in Virtual Network 2. You’ve just created a bigger mesh from (at least) two address prefixes.

Repeat after me:

Virtual networks do not exist

How Do Packets Travel?

OK Aidan (or “Joe” if you arrived here from Twitter), how the heck do packets get from one NIC to another?

Let’s melt some VMware fanboy brains – that sends me to my happy place. Aure is built using Windows Server Hyper-V; the same Hyper-V that you get with commercially available Windows Server. Sure, Azure layers a lot of stuff on top of the hypervisor, but if you dig down deep enough, you will find Hyper-V.

Virtual machines, your or those run by Microsoft, are connected to a virtual switch on the host. The virtual switch is connected to a physical ethernet port on the host. The host is addressable on the Microsoft physical network.

You come along and create a virtual network. The fabric knows to track NICs that are being connected. You create a virtual machine and connect it to the virtual network. Azure will place that virtual machine on one host. As far as you are concerned, that virtual machine has an address from your network.

You then create a second VM and connect it to your virtual network. Azure places that machine on a different host – maybe even in a different data centre. The fabric knows that both virtual machines are in the same virtual network so they should be able to reach each other.

You’ve probably use a 10.something address, like most other customers, so how will your packets stay in your virtual network and reach the other virtual machine? We can thank software-defined networking for this.

Let’s use the addresses of my above diagram for this explanation. The source VM has a customer IP address of 10.0.1.4. It is sending a packet to the destination VM with a customer address of 10.0.1.5. The packet will leave the source NIC, 10.0.1.4 and reaches the host’s virtual switch. This is where the magic happens.

The packet is encapsulated, changing the destination address to that of the destination virtual machine’s host. Imagine you are sending a letter (remember those?) to an apartment number. It’s not enough to say “Apartment 1”; you have to include other information to encapsulate it. That’s what the fabric enables by tracking where your NICs are hosted. Encapsulation wraps the customer packet up in an Azure packet that is addressed to the host’s address, capable of travelling over the Microsoft Global Network – supporting single virtual networks and peered (even globally) virtual networks.

The packet routes over the Microsoft physical network unbeknownst to us. It reaches the destination host, and the encapsulation is removed at the virtual switch. The customer packet is dropped into the memory of the destination virtual machine and Bingo! the transmission is complete.

From our perspective, the packet routes directly from source to destination. This is why you can’t ping a default gateway – it’s not there because it plays no role in routing because: the virtual network does not exist.

I want you to repeat this:

Packets go directly from source to destination

Two Most Powerful Pieces Of Knowledge

If you remember …

  • Virtual networks do not exist and
  • Packets go directly from source to destination

… then you are a huge way along the road of mastering Azure networking. You’ve broken free from string theory (cable-oriented networking) and into quantum physics (software-defined networking). You’ll understand that segmenting networks into subnets for security reasons makes no sense. You will appreciate that placing “shared services” in the hub offered no performance gain (and either broke your security model or made it way more complicated).