In this post, I will explain how you can enable Virtual Network (VNet) Flow Logs at scale using a built-in Azure Policy.
Background
Flow logging plays an essential role in Azure networking by recording every flow (and more):
Troubleshooting: Verify that packets get somewhere or pass through an appliance. Check if traffic is allowed by an NSG. And more!
Security: Search for threats by pushing the data into a SIEM, like Microsoft Sentinel, and provide a history of connectivity to investigate a penetration.
Auditing: Have a history of what happened on the network.
There is a potential performance and cross-charging use that I’ve not dug into yet, by using the throughput data that is recorded.
Many of you might have used NSG Flow Logs. Those are deprecated now with an end-of-life date of September 30, 2027. The replacement is VNet Flow Logs, which records more data and requires less configuration – once per VNet instead of once per NSG.
But there is a catch! Modern, zero-trust, Cloud Adoption Framework-compliant designs use many VNets. Each application/workload gets a landing zone, and a landing zone will include a dedicated VNet for every networked workload, probably deployed as a spoke in a hub-and-spoke architecture. A modest organisation might have 50+ VNets with little free admin hours to do configurations. A large, agile organisation might have an ever-increasing huge collection of VNets and struggle with consistency.
Enter Azure Policy
Some security officers and IT staff resist one of the key traits of a cloud: self-service. They see it as insecure and try to lock it down. All that happens, eventually, is that the business gets ticked off that they didn’t get the cloud, and they take their vengeance out on the security officers and/or IT staff that failed to deliver the agile compute and data platform that the business expected – I’ve seen that happen a few times!
Instead, organisations should use the tools that provide a balance between security/control and self-service. One perfect example of this is Azure Policy, which provides curated guardrails against insecure or non-compliant deployments or configurations. For example, you can ban the association of Public IP Addresses with NICs, which the compute marketing team has foisted on everyone via the default options in a virtual machine deployment.
Using Azure Policy With VNet Flow Logs
Our problem:
We will have some/many VNets that we need to deploy Flow Logging to. We might know some of the VNets, but there are many to configure. We need a consistent deployment. We may also have many VNets being created by other parties, either internal or external to our organisation.
This sounds like a perfect scenario for Azure Policy. And we happen to have a built-in policy to deploy VNet Flow Logging called Configure virtual networks to enforce workspace, storage account and retention interval for Flow logs and Traffic Analytics.
The policy takes 5 mandatory parameters:
Virtual Networks Region: A single Azure region that contains the Virtual Networks that will be targeted by this policy.
Storage Account: The storage account that will temporarily store the Flow Logs in blob format. It must be in the same region as the VNets.
Network Watcher: Network Watcher must be configured in the same region as the VNets.
Workspace Resource ID: A Log Analytics Workspace will store the Traffic Analytics data that can be accessed using KQL for queries, visualisations, exported to Microsoft Sentinel, and more.
Workspace Region: The workspace can be in any region. The Workspace can be used for other tasks and with other assignment instances of this policy.
What if you have VNets across three regions? Simple:
Deploy 1 central Workspace.
Deploy 3 Storage Accounts, 1 per region.
Assign the policy 3 times, once per region, for each region.
You will collect VNet Flow Logs from all VNets. The data will be temporarily stored in region-specific Storage Accounts. Eventually, all the data will reside in a single Log Analytics Workspace, providing you with a single view of all VNet flows.
Customisation
It took a little troubleshooting to get this working. The first element was to configure remediation identity during the assignment. Using the GUID of the identity, I was able to grant permanent reader rights to a Management Group that contained all the subscriptions with VNets.
Troubleshooting was conducted using the Activity Log in various subscriptions, and the JSON logs were dumped into regular Copilot to facilitate quick interpretation. ChatGPT or another would probably do as good a job.
The next issue was the Traffic Analytics collection interval. In a manual/coded deployment, one can set it to every 10 or 60 minutes. I prefer the 10-minute option for quicker access (it’s still up to 25 minutes of latency). The parameter for this setting is optional. When I enabled that parameter in the assignment, the save went into a permanent (commonly reported) verifying action without saving the change. My solution was to create a copy of the policy and to change the default option of the parameter from 60 to 10. Job done!
In The Real World
Azure Policy has one failing – it has a huge and unpredictable run interval. There is a serious lag between something being deployed and a mandated deployIfNotExists task running. But this is one of the scenarios where, in the real world, we want it to eventually be correct. Nothing will break if VNet Flow Logs are not enabled for a few hours. And the savings of not having to do this enablement manually are worth the wait.
If You Liked This?
Did you like this topic? Would you like to learn more about designing secure Azure networks, built with zero-trust? If so, then join me on October 20-21 2025 (scheduled for Eastern time zones) for my Cloud Mechanix course, Designing Secure Azure Networks.
I found out yesterday that I was awarded my 18th annual Most Valuable Professional (MVP) award by Microsoft, continuing with the Azure Networking expertise.
It’s been an interesting year since last July, when I received my 17th award. My amount of billable work (the KPI for any consultant) with my then-employer was zero for a long time. I started thinking that the end would eventually come, so I started no plan-B: my own company.
I started my company, Cloud Mechanix, 7 years ago as a side-gig to my previous job. I used personal time to write custom-Azure training and to deliver it at in-person classes. That first year was incredible – I still remember squeezing 22 people into a meeting room in a London hotel that I’d hoped to get 10 people into! Things went well and the feedback was awesome. I’d started to write new content … and then the world changed. I changed my day-job. The COVID19 pandemic happened. And my wife and I welcomed twin girls into the world. There was no time for a side-gig!
I did a little bit with Cloud Mechanix during the lockdown but I didn’t have the time to put a sustained effort in. Then last year, the world started changing again. The twins were 4, in their second year of pre-school, and quite happy to entertain themselves. The pandemic was a distant memory but our way of working had change quite a bit. And my day-job went from too much work to no work. I’ve been around long enough to develop a sense of redundancy smell. My spidey-sense tingles long before anyone else discusses the topic. I talked with my wife and we decided that I had more time to invest in my company, Cloud Mechanix, and my MVP activities.
I started to write new content, focusing first on what I’m best known for these days (Azure Networking) and on another in-demand course (Azure for small-medium businesses). I did the Azure Firewall Deep Dive course online for anyone to sign up and privately. I’ve done the Azure Operations for Small/Medium Businesses class in-person 3 times so far this year for a Microsoft distributor (the attendees were employees of Microsoft partners).
Meanwhile I’ve applied for and spoken at a number of Microsoft community/conference events. I’ve been invited to talk on a number of podcasts – which are always enjoyable … poor Ned and Kyler probably didn’t know what they were in for when I talked non-stop about Azure networking for 39 minutes without stopping to breath. And I wrote a series of blog posts on Azure network design/security to explain why trying to implement on-premises designs make no sense and the resulting complexity breaks the desired goal of better security – simplicity actually offers more security!
The expected happened in June. I was made redundant. I wasn’t sad – I knew that it was coming and I had a plan. The agreed terms meant that I was free from June 28th with no restrictions. I had decided that I would not go job hunting. I have a job; I’m the Manading Director, trainer, and consultant with Cloud Mechanix. Yes, I am going out with my own company and it has expanded into consulting on Azure, including (but not limited to):
Cloud strategy
Reviews
Security
Migration
System design & build
Cloud Adoption by Mentorship
Small/Medium business
Assisting Microsoft partners
Things have started well. I have a decent sales pipe. I have completed two small gigs. And I have developed new training content: Designing Secure Azure Networks.
Back to the award! I’m at the Costa Blanca in Spain with my family for 4 weeks. Cloud Mechanix HQ has temporarily relocated from Ireland for 2 weeks and then I’m on vacation for 2 weeks. I’m spending my time doing some pre-sales stuff (things are going well) and writing some stuff that I will be sharing soon ๐ I was working yesterday afternoon and thinking about going to the pool with the kids, and got to thinking “what day/date is it?” – how one knows that they are relaxed! I asked my wife and she said that it was July 10th! Wait – isn’t that what the MVPs call “F5 day”, the day that we find out if we are renewed or not? I checked Teams and confirmed that it was indeed F5 day. Usually we get the emails at 4PM Irish time, making it 5PM Spanish time. I’d decided I was going to the pool. My phone was in a bag on a bench and I kept an eye on the time. Then from 5PM, I checked my email every few minutes until … there it was:
Year number 18 had begun! To be honest, this was the first time in years that I wasn’t that worried. I had written quite a bit of blog content. I’d done a number of online and in-person things. I also had (I hope) great interactions with the Azure product group. I felt like that the contributions were there … and they are still coming.
I’ve been doing quite a bit this week. It’s the start of something bigger but I hope that the first part will be ready in the coming days – it depends on that pre-sales pipeline and testing results … ooooh it’s technical!
I have two confirmed future events with TechMentor in the USA where I’m doing a panel, breakout sessions, and a post-con all-day class at:
Microsoft HQ 2025 in Redmond, Washington, on August 11-15.
Orlando, Florida, on November 16-21.
I have applied for a number of other events in Europe too. If you’re interested then:
In this post, I will show how to use Azure Virtual Network Manager (AVNM) to enforce peering and routing policies in a zero-trust hub-and-spoke Azure network. The goal will be to deliver ongoing consistency of the connectivity and security model, reduce operational friction, and ensure standardisation over time.
Quick Overview
AVNM is a tool that has been evolving and continues to evolve from something that I considered overpriced and under-featured, to something that I would want to deploy first in my networking architecture with its recently updated pricing. In summary, AVNM offers:
Network/subnet discovery and grouping
IP Address Management (IPAM)
Connectivity automation
Routing automation
There is (and will be) more to AVNM, but I want to focus on the above features because together they simplify the task of building out Azure platform and application landing zones.
The Environment
One can manage virtual networks using static groups but that ignores the fact that The Cloud is a dynamic and agile place. Developers, operators, and (other) service providers will be deploying virtual networks. Our goal will be to discover and manage those networks. An organisation might be simple, and there will be a one-size-fits-all policy. However, we might need to engineer for complexity. We can reduce that complexity by organising:
Adopt the Cloud Adoption Framework and Zero Trust recommendations of 1 subscription/virtual network per workload.
Organising subscriptions (workloads) using Management Groups.
Designing a Management Group hierarchy based on policy/RBAC inheritance instead of basing it on an organisation chart.
Using tags to denote roles for virtual networks.
I have built a demo lab where I am creating a hub & spoke in the form of a virtual data centre (an old term used by Microsoft). This concept will use a hub to connect and segment workloads in an Azure region. Based on Route Table limitations, the hub will support up to 400 networked workloads placed in spoke virtual networks. The spokes will be peered to the hub.
A Management Group has been created for dub01. All subscriptions for the hub and workloads in the dub01 environment will be placed into the dub01 Management Group.
Each workload will be classified based on security, compliance, and any other requirements that the organisation may have. Three policies have been predefined and named gold, silver, and bronze. Each of these classifications has a Management Group inside dub01, called dub01gold, dub01silver, and dub01bronze. Workloads are placed into the appropriate Management Group based on their classification and are subject to Azure Policy initiatives that are assigned to dub01 (regional policies) and to the classification Management Groups.
You can see two subscriptions above. The platform landing zone, p-dub01, is going to be the hub for the network architecture. It has therefore been classified as gold. The workload (application landing zone) called p-demo01 has been classified as silver and is placed in the appropriate Management Group. Both gold and silver workloads should be networked and use private networking only where possible, meaning that p-demo01 will have a spoke virtual network for its resources. Spoke virtual networks in dub01 will be connected to the hub virtual network in p-dub01.
Keep in mind that no virtual networks exist at this time.
AVNM Resource
AVNM is based on an Azure resource and subresources for the features/configurations. The AVNM resource is deployed with a management scope; this means that a single AVNM resource can be created to manage a certain scope of virtual networks. One can centrally manage all virtual networks. Or one can create many AVNM resources to delegate management (and the cost) of managing various sets of virtual networks.
I’m going to keep this simple and use one AVNM resource as most organisations that aren’t huge will do. I will place the AVNM resource in a subscription at the top of my Management Group hierarchy so that it can offer centralised management of many hub-and-spoke deployments, even if we only plan to have 1 now; plans change! This also allows me to have specialised RBAC for managing AVNM.
Note that AVNM can manage virtual networks across many regions so my AVNM resource will, for demonstration purposes, be in West Europe while my hub and spoke will be in North Europe. I have enabled the Connectivity, Security Admin, and User-Defined Routing features.
AVNM has one or more management scopes. This is a central AVNM for all networks, so I’m setting the Tenant Root Group as the top of the scope. In a lab, you might use a single subscription or a dedicated Management Group.
Defining Network Groups
We use Network Groups to assign a single configuration to many virtual networks at once. There are two kinds of members:
Static: You add/remove members to or from the group
Dynamic: You use a friendly wizard to define an Azure Policy to automatically find virtual networks and add/remove them for you. Keep in mind that Azure Policy might take a while to discover virtual networks because of how irregularly it runs. However, once added, the configuration deployment is immediately triggered by AVNM.
There are two kinds of members in a group:
Virtual networks: The virtual network and contained subnets are subject to the policy. Virtual networks may be static or dynamic members.
Subnets: Only the subnet is targeted by the configuration. Subnets are only static members.
Keep in mind that something like peering only targets a virtual network and User-Defined Routes target subnets.
I want to create a group to target all virtual networks in the dub01 scope. This group will be the basis for configuring any virtual network (except the hub) to be a secured spoke virtual network.
I created a Network Group called dub01spokes with a member type of Virtual Networks.
I then opened the Network Group and configured dynamic membership using this Azure Policy editor:
Any discovered virtual network that is not in the p-dub01 subscription and is in North Europe will be automatically added to this group.
The resulting policy is visible in Azure Policy with a category of Azure Virtual Network Manager.
IP Address Management
I’ve been using an approach of assigning a /16 to all virtual networks in a hub & spoke for years. This approach blocks the prefix in the organisation and guarantees IP capacity for all workloads in the future. It also simplifies routing and firewall rules. For example, a single route will be needed in other hubs if we need to interconnect multiple hub-and-spoke deployments.
I can reserve this capacity in AVNM IP Address Management. You can see that I have reserved 10.1.0.0/16 for dub01:
Every virtual network in dub01 will be created from this pool.
Creating The Hub Virtual Network
I’m going to save some time/money here by creating a skeleton hub. I won’t deploy a route NVA/Virtual Network Gateway so I won’t be able to share it later. I also won’t deploy a firewall, but the private address of the firewall will be 10.1.0.4.
I’m going to deploy a virtual network to use as the hub. I can use Bicep, Terraform, PowerShell, AZ CLI, or the Azure Portal. The important thing is that I refer to the IP address pool (above) when assigning an address prefix to the new virtual network. A check box called Allocate Using IP Address Pools opens a blade in the Azure Portal. Here you can select the Address Pool to take a prefix from for the new virtual network. All I have to do is select the pool and then use a subnet mask to decide how many addresses to take from the pool (/22 for my hub).
Note that the only time that I’ve had to ask a human for an address was when I created the pool. I can create virtual networks with non-conflicting addresses without any friction.
Create Connectivity Configuration
A Connectivity Configuration is a method of connecting virtual networks. We can implement:
Hub-spoke peering: A traditional peering between a hub and a spoke, where the spoke can use the Virtual Network Gateway/Azure Route Server in the hub.
Mesh: A mesh using a Connected Group (full mesh peering between all virtual networks). This is used to minimise latency between workloads with the understanding that a hub firewall will not have the opportunity to do deep inspection (performance over security).
Hub & spoke with mesh: The targeted VNets are meshed together for interconnectivity. They will route through the hub to communicate with the outside world.
I will create a Connectivity Configuration for a traditional hub-and-spoke network. This means that:
I don’t need to add code for VNet peering to my future templates.
No matter who deploys a VNet in the scope of dub01, they will get peered with the hub. My design will be implemented, regardless of their knowledge or their willingness to comply with the organisation’s policies.
I created a new Connectivity Configuration called dub01spokepeering.
In Topology I set the type to hub-and-spoke. I select my hub virtual network from the p-dub01 subscription as the hub Virtual Network. I then select my group of networks that I want to peer with the hub by selecting the dub01spokes group. I can configure the peering connections; here I should select Hub As Gateway – I don’t have a Virtual Network Gateway or an Azure Route Server in the hub, so the box is greyed out.
I am not enabling inter-spoke connectivity using the above configuration – AVNM has a few tricks, and this is one of them, where it uses Connected Groups to create a mesh of peering in the fabric. Instead, I will be using routing (later) via a hub firewall for secure transitive connectivity, so I leave Enable Connectivity Within Network Group blank.
Did you notice the checkbox to delete any pre-existing peering configurations? If it isn’t peered to the hub then I’m removing it so nobody uses their rights to bypass by networking design.
I completed the wizard and executed the deployment against the North Europe region. I know that there is nothing to configure, but this “cleans up” the GUI.
Create Routing Configuration
Folks who have heard me discuss network security in Azure should have learned that the most important part of running a firewall in Azure is routing. We will configure routing in the spokes using AVNM. The hub firewall subnet(s) will have full knowledge of all other networks by design:
Spokes: Using system routes generated by peering.
Remote networks: Using BGP routes. The VPN Local Network Gateway creates BGP routes in the Azure Virtual Networks for “static routes” when BGP is not used in VPN tunnels. Azure Route Server will peer with NVA routers (SD-WAN, for example) to propagate remote site prefixes using BGP into the Azure Virtual Networks.
The spokes routing design is simple:
A Route Table will be created for each subnet in the spoke Virtual Networks. This design for these free resources will allow customised routing for specific scenarios, such as VNet-integrated PaaS resources that require dedicated routes.
A single User-Defined Route (UDR) forces traffic leaving a spoke Virtual Network to pass through the hub firewall, where firewall rules will deny all traffic by default.
Traffic inside the Virtual Network will flow by default (directly from source to destination) and be subject to NSG rules, depending on support by the source and destination resource types.
The spoke subnets will be configured not to accept BGP routes from the hub; this is to prevent the spoke from bypassing the hub firewall when routing to remote sites via the Virtual Network Gateway/NVA.
I created a Routing Configuration called dub01spokerouting. In this Routing Configuration I created a Rule Collection called dub01spokeroutingrules.
A User-Defined Route, known as a Routing Rule, was created called everywhere:
The new UDR will override (deactivate) the System route to 0.0.0.0/0 via Internet and set the hub firewall as the new default next hop for traffic leaving the Virtual Network.
Here you can see the Routing Collection containing the Routing Rule:
Note that Enable BGP Route Propagation is left unchecked and that I have selected dub01spokes as my target.
And here you can see the new Routing Configuration:
Completed Configurations
I now have two configurations completed and configured:
The Connectivity Configuration will automatically peer in-scope Virtual Networks with the hub in p-dub01.
The Routing Configuration will automatically configure routing for in-scope Virtual Network subnets to use the p-dub01 firewall as the next hop.
Guess what? We have just created a Zero Trust network! All that’s left is to set up spokes with their NSGs and a WAF/WAFs for HTTPS workloads.
Deploy Spoke Virtual Networks
We will create spoke Virtual Networks from the IPAM block just like we did with the hub. Here’s where the magic is going to happen.
The evaluation-style Azure Policy assignments that are created by AVNM will run approximately every 30 minutes. That means a new Virtual Network won’t be discovered straight after creation – but they will be discovered not long after. A signal will be sent to AVNM to update group memberships based on added or removed Virtual Networks, depending on the scope of each group’s Azure Policy. Configurations will be deployed or removed immediately after a Virtual Network is added or removed from the group.
To demonstrate this, I created a new spoke Virtual Network in p-demo01. I created a new Virtual Network called p-demo01-net-vnet in the resource group p-demo01-net:
You can see that I used the IPAM address block to get a unique address space from the dub01 /16 prefix. I added a subnet called CommonSubnet with a /28 prefix. What you don’t see is that I configured the following for the subnet in the subnet wizard:
Private networking, to proactively disable implied public IP addresses for SNAT.
As you can see, the Virtual Network has not been configured by AVNM yet:
We will have to wait for Azure Policy to execute – or we can force a scan to run against the resource group of the new spoke Virtual Network:
Az CLI: az policy state trigger-scan –resource-group <resource group name>
PowerShell: Start-AzPolicyComplianceScan -ResourceGroupName <resource group name>
You could add a command like above into your deployment code if you wished to trigger automatic configuration.
This force process is not exactly quick either! 6 minutes after I forced a policy evaluation, I saw that AVNM was informed about a new Virtual Network:
I returned to AVNM and checked out the Network Groups. The dub01spokes group has a new member:
You can see that a Connectivity Configuration was deployed. Note that the summary doesn’t have any information on Routing Configurations – that’s an oversight by the AVNM team, I guess.
The Virtual Network does have a peering connection to the hub:
The routing has been deployed to the subnet:
A UDR has been created in the Route Table:
Over time, more Virtual Networks are added and I can see from the hub that they are automatically configured by AVNM:
Summary
I have done presentations on AVNM and demonstrated the above configurations in 40 minutes at community events. You could deploy the configurations in under 15 minutes. You can also create them using code! With this setup we can take control of our entire Azure networking deployment – and I didn’t even show you the Admin Rules feature for essential “NSG” rules (they aren’t NSG rules but use the same underlying engine to execute before NSG rules).
Want To Learn More?
Check out my company, Cloud Mechanix, where I share this kind of knowledge through:
Consulting services for customers and Microsoft partners using a build-with approach.
Custom-written and ad-hoc Azure training.
Together, I can educate your team and bring great Azure solutions to your organisation.
I had the pleasure of chatting with Ned Bellavance and Kyler Middleton on Day Two DevOps one evening recently to discuss the basics of Azure networking, using my line “Azure Virtual Networks Do Not Exist”. I think I talked nearly non-stop for nearly 40 minutes ๐ Tune in and you’ll hear my explanation of why many people get so much wrong in Azure networking/security.
How do you plan a hub & spoke architecture? Based on much of what I have witnessed, I think very few people do any planning at all. In this post, I will explain some essential things to plan and how to plan them.
Rules of Engagement
Microsoft has shared some concepts in the Well-Architected Framework (simplicity) and the documentation for networking & Zero Trust (micro-segmentation, resilience, and isolation).
The hub & spoke will contain networks in a single region, following concepts:
Resilience & independence: Workloads in a spoke in North Europe should not depend on a hub in West Europe.
Micro-segmentation: Workloads in North Europe trying to access workloads in West Europe should go through a secure route via hubs in each region.
Performance: Workload A in North Europe should not go through a hub in West Europe to reach Workload B in North Europe.
Cost Management: Minimise global VNet peering to just what is necessary. Enable costs of hubs to be split into different parts of the organisation.
Delegation of Duty: If there are different network teams, enable each team to manage their hubs.
Minimised Resources: The hub has roles only of transit, connectivity, and security. Do not place compute or other resources into the hub; this is to minimise security/networking complexity and increase predictability.
Management Groups
I agree with many things in the Cloud Adoption Framework “Enterprise Scale” and I disagree with some other things.
I agree that we should use Management Groups to organise subscriptions based on Policy architecture and role-based access control (RBAC – granting access to subscriptions via Entra groups).
I agree that each workload (CAF calls them landing zones) should have a dedicated subscription – this simplifies operations and governance like you wouldn’t believe.
I can see why they organise workloads based on their networking status:
Corporate: Workloads that are internal only and are connected to the hub for on-premises connectivity. No public IP addresses should be allowed where technically feasible.
Online: Workloads that are online only and are not permitted to be connected to the hub.
Hybrid: This category is missing from CAF and many have added it themselves – WAN and Internet connectivity are usually not binary exclusive OR decisions.
I don’t like how Enterprise Scale buckets all of those workloads into a single grouping because it fails to acknowledge that a truly large enterprise will have many ownership footprints in a single tenant.
I also don’t like how Enterprise Scale merges all hubs into a single subscription or management group. Yes, many organisations have central networking teams. Large organisations may have many networking teams. I like to separate hub resources (not feasible with Virtual WAN) into different subscriptions and management groups for true scaling and governance simplicity.
Here is an example of how one might achieve this. I am going to have two hub & spoke deployments in this example:
DUB01: Located in Azure North Europe
AMS01: Located in Azure West Europe
Some of you might notice that I have been inspired by Microsoft’s data centre naming for the naming of these regional footprints. The reasons are:
Naming regions after “North Europe” or “East US” is messy when you think about naming network footprints in East US2, West US2, and so on.
Microsoft has already done the work for us. The Dublin (North Europe) region data centres are called DUB05-DUB15 and Microsoft uses AMS01, etc for Middenmeer (West Europe).
A single virtual network may have up to 500 peers. Once we hit 500 peers then we need to deploy another hub & spoke footprint in the region. The naming allows DUB02, DUB03, etc.
The change from CAF Enterprise Scale is subtle but look how instantly more scalable and isolated everything is. A truly large organisation can delegate duties as necessary.
If an identity responsible for the AMS01 hub & spoke is compromised, the DUB01 hub & spoke is untouched. Resources are in dedicated subscriptions so the blast area of a subscription compromise is limited too.
There is also a logical placement of the resources based on ownership/location.
You don’t need to recreate policy – you can add more associations to your initiatives.
If an enterprise currently has a single networking team, their IDs are simply added to more groups as new hub & spoke deployments are added.
The only connection that will exist between DUB01 and AMS01 is a global VNet peering connection between the hubs. All traffic between DUB01 and AMS01 mist route via the firewalls in the hubs. This will require some user-defined routing and we want to keep this as simple as possible.
For example, the firewall subnet in DUB01 must have a route(s) to all prefixes in AMS01 via the firewall in the hub of AMS01. The more prefixes there are in AMS01, the more routes we must add to the Route Table associated with the firewall subnet in the hub of DUB01. So we will keep this very simple.
Each hub & spoke will be created from a single IP prefix allocation:
DUB01: All virtual networks in DUB01 will be created from 10.1.0.0/16.
AMS01: All virtual networks in AMS01 will be created from 10.2.0.0/16.
You might have noticed that Azure Virtual Network Manager uses a default of /16 for an IP address block in the IPAM feature – how convenient!
That means I only have to create one route in the DUB01 firewall subnet to reach all virtual networks in AMS01:
Name: AMS01
Prefix: 10.2.0.0/16
Next Hop Type: VirtualAppliance
Next Hop IP Address: The IP address of the AMS01 firewall
A similar route will be created in AMS01 firewall subnet to reach all virtual networks in DUB01:
Name: DUB01
Prefix: 10.1.0.0/16
Next Hop Type: VirtualAppliance
Next Hop IP Address: The IP address of the DUB01 firewall
Honestly, that is all that is required. I’ve been doing it for years. It’s beautifully simple.
The firewall(s) are in total control of the flows. This design means that neither location is dependent on the other. Neither AMS01 nor DUB01 trust each other. If a workload is compromised in AMS01 its reach is limited to whatever firewall/NSG rules permit traffic. With threat detection, flow logs, and other features, you might even discover an attack using a security information & event management (SIEM) system before it even has a chance to spread.
Workloads/Landing Zones
Every workload will have a dedicated subscription with the appropriate configurations, such as enabling budgets and configuring Defender for Cloud. Standards should be as automated as possible (Azure Policy). The exact configuration of the subscription should depend on the zone (corp, online or corporate).
When there is a virtual network requirement, then the virtual network will be as small as is required with some spare capacity. For example, a workload with a web VM and a SQL Server doesn’t need a /24 subnet!
Essential Workloads
Are you going to migrate legacy workloads to Azure? Are you going to run Citrix or Azure Virtual Desktop (AVD)? If so, then you are going to require doamin controllers.
You might say “We have a policy of running a single ADDS site and our domain controllers are on-premises”. Lovely, at least it was when Windows Server 2003came out. Remember that I want my services in Azure to be resilient and not to depend on other locations. What happens to all of your Azure servces when the network connection to on-premises fails? Or what happens if on-premises goes up in a cloud of smoke? I will put domain controllers in Azure.
Then you might say “We will put domain controllers in DUB01 and AMS01 can use them”. What happens if DUB01 goes offline? That does happen from time to time. What happens if DUB01 is compromised? Not only will I put domain controllers in DUB01, but I will also put them in AMS01. They are low end virtual machines and the cost will be minor. I’ll also do some good ADDS Sites & Services stuff to isolate as much as ADDS lets you:
Create subnets for each /16 IP prefix.
Create an ADDS site for AMS01 and another for DUB01.
Associate each site with the related subnet.
Create and configure replication links as required.
The placement and resilience of other things like DNS servers/Private DNS Resolver should be similar.
The firewall will be the next hop, by default (expect exceptions) for traffic leaving every virtual network. This will be configured for every subnet (expect exceptions) in every workload.
The firewall will be the glue that routes every spoke virtual network to each other and the outside world. The firewall rules will restrict which of those routes is possible and what traffic is possible – in all directions. Don’t be lazy and allow * to Internet; do you want to automatically enable malware to call home for further downloads or discovery/attack/theft instructions?
The firewall will be carefully chosen to ensure that it includes the features that your organisation requires. Too many organisations pick the cheapest firewall option. Few look at the genuine risks that they face and pick something that best defends against those risks. Allow/deny is not enough any more. Consider the features that pay careful attentiont to what must be allowed;these are the firewall ports that attackers are using to compromise their victims.
Every subnet (expect exceptions) will have an NSG. That NSG will have a custom low-priority inbound rule to deny everything; this means that no traffic can enter a NIC (from anywhere, including the same subnet) without being explicityly allowed by a higher priority rule.
“Web” (this covers a lot of HTTPS based services, excluding AVD) applications will not be published on the Internet using the hub firewall. Instead, you will deploy a WAF of some kind (or different kinds depending on architectural/business requirements). If you’re clever, and it is appropriate from a performance perspective, you might route that traffic through your firewall for inspection at layers 4-7 using TLS Inspection and IDPS.
Logging and Alerting
You have placed all the barriers in place. There are two interesting quotes to consider. The first warns us that we must assume a pentration has already taken place or will take place.
Fundamentally, if somebody wants to get in, theyโre getting inโฆaccept that. What we tell clients is: Number one, youโre in the fight, whether you thought you were or not. Number two, you almost certainly are penetrated.
Michael Hayden Former Director of NSA & CIA
The second warns us that attackers don’t think like defenders. We build walls expecting a linear attack. Attackers poke, explore, and prod, looking for any way, including very indeirect routes, to get from A to B.
Biggest problem with network defense is that defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.
John Lambert
Each of our walls offers some kind of monitoring. The firewall has logs, which ideally we can either monitor/alert from or forward to a SIEM.
Virtual Networks offer Flow Logs which track traffic at the VNet level. VNet Flow logs are superior to NSG FLow logs because they catch more traffic (Private Endpoint) and include more interesting data. This is more data that we can send to a SIEM.
Defender for Cloud creates data/alerts. Key Vaults do. Azure databases do. The list goes on and on. All of this data that we can use to:
Detect an attack
Identify exploration
Uncover an expansion
Understand how an attack started and happened
And it amazes me how many organisations choose not to configure these features in any way at all.
Wrapping Up
There are probably lots of finer details to consider but I think that I have covered the essentials. When I get the chance, I’ll start diving into the fun detailed designs and their variations.
In this post, I am going to share a process for designing a hub virtual network for a hub & spoke secured virtual network deployment in Microsoft Azure.
The process I lay out in this document will not work for everyone.I think, based experience, that very few organisations will find exceptions to this process.
What Is And Is Not In This Post
This post is going to focus on the process of designing a hub virtual network. You will not find a design here … that will come in a later post.
You will also not find any mention of Azure Virtual WAN. You DO NOT need to use Azure Virtual WAN to do SD-WAN, despite the claptrap on Microsoft documentation on this topic. Virtual WAN also:
Restricts your options on architecture, features, and network design.
Is a nightmare to troubleshoot because the underlying virtual network is hidden in a Microsoft tenant.
Rules Of Engagement
The hub will be your network core in a network stamp: a hub & spoke. The hub & spoke will contain networks in a single region, following concepts:
Resilience & independence: Workloads in a spoke in North Europe should not depend on a hub in West Europe.
Micro-segmentation: Workloads in North Europe trying to access workloads in West Europe should go through a secure route via hubs in each region.
Performance: Workload A in North Europe should not go through a hub in West Europe to reach Workload B in North Europe.
Cost Management: Minimise global VNet peering to just what is necessary. Enable costs of hubs to be split into different parts of the organisation.
Delegation of Duty: If there are different network teams, enable each team to manage their hubs.
Minimised Resources: The hub has roles only of transit, connectivity, and security. Do not place compute or other resources into the hub; this is to minimise security/networking complexity and increase predictability.
A Hub Design Process
The core of our Azure network will have very little in the way of resources. What can be (not “must be”)included in that hub can be thought of as functions:
Site-to-site networking: VPN, ExpressRoute, and SD-WAN.
Point-to-site VPN: Enabling individuals to connect to the Azure networks using a VPN client on their device.
Firewall: Providing security for ingress, egress, and inter-workload communications.
Virtual Machines: Reduce costs of secured RDP/SSH by deploying Azure Bastion in the hub.
If we are doing a high-level design, we have a two questions that we will ask about each of thse functions:
Is the function required?
What technology will be used?
We won’t get into tiers/SKUs, features, or configurations just yet; that’s when we get into low-level or detailed design.
One can use the following flow chart to figure out what to use – it’s a bit of an eye test so you might need to open the image in another tab:
Site-to-Site (S2S) Networking
While it is very commonly used, not every organisation requires site-to-site connectivity to Azure.
For example, I had a migration customer that was (correctly) modernising to the “top tier” of cloud computing by migrating from legacy apps to SaaS. They wanted to re-implement an SD-WAN for over 100 offices to connect their new and small Azure footprint. I was the lead designer so I knew their connectivity requirements – they were going to use Azure Virtual Desktop (AVD) only to connect to their remaining legacy apps. AVD doesn’t need a site-to-site connection. I was able to save that organisation from entering into a costly managed SD-WAN services contract and instead focus on Internet connectivity – not long later they shutdown their Azure footprint when SaaS aleternatives were found for the the last legacy applications.
If we establish that site-to-site connectivity is required then we must ask the first question:
Are latency and SLA important?
If the answer to either of these items is “yes” then there is no choice: An ExpressRoute Virtual Network Gateway is required.
If the answer is no, then we are looking at some kind of VPN connectivity. We can ask another question to determine the type of solution:
Will there be a small number of VPN connections?
If a small number of VPN connections is required, the Azure VPN Virtual Network Gateway is suitable – consider the SKUs/sizes and complexities of management to determine what “a small number” is.
If you determine that the VPN Virtual Network Gateway is unsuitable then an SD-WAN network virtual appliance (NVA) should be used. Note that it would be recommended to deploy Azure Route Server with a third-party VPN/SD-WAN appliance to enable propagation network prefixes:
Azure > SD-WAN
SD-WAN > Azure
You may find that you need one or more of the above solutions! For example:
Some ExpressRoute customers may opt to deploy a parallel VPN tunnel with an identical routing configuration over a completely different ISP. This enables automatic failover from ExpressRoute to VPN in the event of a circuit failure.
An SD-WAN customer may also have ExpressRoute for some offices/workloads where SLA or latency are important. Another consideration may be that one workload has other technical requirements that only ExpressRoute (Direct) can service such as very high throughput.
You have one more question to ask after you have picked the site-to-site component(s):
Will you require site-to-site transit through Azure via the site-to-site network connections?
In other words, should Remote Site A be able to route to Remote Site B using your Azure site-to-site connections? If the answer is yes then you must deploy Azure Route Server to enable that routing.
Point-To-Site (P2S) VPN
I personally have not deployed very much of this solution but I do hear it being discussed quite a bit. Some organisations must enable users (or external suppliers) to create a VPN connection from their individual devices to Azure. If this is required then you must ask:
Is the scenario(s) simple?
I’ve kept that vague because the problem is vague. There are two solutions with one being overly-simplistic in capabilities and the other being more fully-featured.
The Azure VPN Gateway (also used for site-to-site VPN) offers a very available (Azure resource) solution for P2S VPN. It offers different configuration for authentication and device support. But it is very limited. For example, it has no routing rules to restrict which users get access to which networks. This means that if you grant network (firewall/NSG) access to one user via the VPN address pool, you must grant the same access to all users, which is clearly pretty poor if you have many types/roles of remote VPN clients (IT, developer of workload X, developer of workload Y, Vendor A, Vendor B, etc).
In such scenarios, one should consider a third-party NVA for point-to-site networking. Third-party NVAs may offer more features for P2S VPN than the VPN Virtual Network Gateway.
A P2S NVA may reside in the same hub as a VPN Virtual Network Gateway (and other S2S solutions).
It’s not in the diagram but you should also consider Entra Global Secure Accessas an alternative to P2S VPN. The Private Network Connector would be deployed in a spoke(s), not the hub.
Firewall
Is a firewall required? The correct answer for anyone considering a hub & spoke architecutre should be “of course it is”. But you might not like security, so we’ll ask that question anyway.
Once you determine that security is important to your employer, you must ask yourself:
Shall I use a native PaaS firewall?
The native PaaS solution in Azure is Azure Firewall. I have many technical reasons to prefer Azure Firewall over third-party alternatives. For consultants, a useful attribute of Azure Firewall is that you can skill up on one solution that you can implement/use/manage for many customers and projects (migrations) won’t face repeated delays as you wait on others to implement rules in third-party firewalls.
If you want to use a different firewall then you are free to do so.
If you are using Azure Firewall then there is a follow-up question if there will be S2S network connections:
Are the remote networks using non-RFC1918 address prefixes?
In other words, do the remote networks use address prefixes outside of:
192.168.0.0/16
172.16.0.0/12
10.0.0.0/8
If they do then Azure Firewal requires some configuration because traffic to non-RFC1918 prefixes is forced to the Internet by default – they are Internet addresses after all! You can statically configure the prefixes if they do not change. Or …
If you are using Azure Route Server
The prefixes can change a lot thanks to scenarios such as acquisition or rapid growth
… you can (in preview today) configure integration between Azure Firewall and Azure Route Server so the firewall dynamically learns the address prefixes from the remote networks.
Will any of the workloads in your spoke virtual networks have virtual machines?
You will have virtual machines even if you “ban” virtual machines – I guarantee that they will eventually appear for things like security solutions, self-hosted agents, Azure Virtual Desktop, AKS, and so on.
Unfortunately, many consider secure remote access (SSH/RDP) to be opening a port in the firewall for TCP 22/3389. That is not considered secure because those protocols can be and have been attacked. In the past, those who took security seriously used a dedicated “jump box” or “bastion host” to isolate vulnerable on-premises machines from assets in the data centre. We can use the same process with Azure Bastion where there is no IaaS requirement – we leverage Entra security features to authenticate the connection request and the guest OS credentials to verify VM access.
One can deploy Bastion in a spoke – that is perfectly valid for some scenarios. However, many important features are only in the paid-for SKUs so you might wish to deploy a shared Azure Bastion. Unfortunately, routing restrictions by Bastion prevent deploying a shared Bastion in a spoke, so we have no choice but to deploy a shared Azure Bastion in a hub. If you wish to have a share an Azure Bastion across workloads then it will be the final component in the hub.
If/when Azure Bastion supports route tables in the AzureBastionSubnet I will recommend moving shared Bastion deployments to a spoke – yes, I know that we can do that with Azure Virtual WAN but there are many things that we cannot do with Azure Virtual WAN.
You could consider a third-party alterantive or a DIY bastion solution. If so, place that into a spoke because it will be compute-based.
Wrapping Up
As you can see, the high-level design of the hub is very simple.
There are few functions in it because when you understand Azure virtual networks, routing, and NSGs, then you understand that designing a secure network should not be complex. Complexity is the natural predator of manageability and dependable security. There is a little more detail when we get into a low-level or detailed design, but that’s a topic for another day.
In this post, I want to discuss the importance of designing and implementing micro-segmentation in Azure networks.
Repeating The Same Mistakes
In 2002-2003, the world was being hammered by malware. So much so, that Microsoft did a reset on their Windows development processes and effectively built a new version of Windows XP with Windows XP Service Pack 2. The main security feature of that release was the Windows Firewall – the purpose of this was to isolate each Windows machine in the network by default. It’s a pity that nearly every Windows admin then used Group Policy to disable the Windows Firewall!
Times have moved on and so have the bad guys. Malware isn’t just an anarchist or hobby activity. Malware is a billion-dollar business (ransomware/data theft) and a military activity. Naturally, defences have evolved .. wait .. no … most admins/consultants are still deploying networks that your Daddy/Mommy deployed 22 years ago but I’ll deal with that in another post.
Instead, I want to discuss a part of the defensive solution: micro-segmentation.
Assume Penetration
We must assume that the attacker will always find a way in. Not every attack will be by Sandra Bullock clicking some magical symbol on a website to penetrate the firewall. Most attacks have relatively simple vectors such as stealing a password, hash highjacking, or getting an accountant to open a PDF. Determined attackers aren’t just “driving by”; they will look for an entry. Maybe it’s malware in vendor software that you will deploy! Maybe, it’s a vulnerability in open-source software that your developers will deploy via GitHub? Maybe a managed service provider’s Entra ID tenant has been penetrated and they have Lighthouse access to your Azure subscriptions? Each of those examples bypasses your firewall and any advanced scanning features that it may have. How do you stop them?
Micro-Segmentation
Let me conjure an image for you. A submarine is on patrol. It has a wartime mission. The submarine is always under orders to continue that mission. The submarine is detected by the enemy and is attacked. The attack causes damage which creates a flood. If left unchecked, the flood will sink the ship. What happens? The crew is trained to isolate the flood by sealing the leaking compartment – doors are slammed, seals are locked, and the water is contained in that compartment. Sure, the sailors and ship functions in that compartment are dead, but the ship can continue its mission.
Minimize the extent of the damage and how fast it spreads.
Increase the difficulty of compromising your cloud footprint.
Let’s expand on that a little.
Be Ready
You will be ready for an attack because you assume that you already are under attack. You don’t wait to deploy security systems and configurations; you design them with your workloads. You deploy security with your workloads. You maintain security with your workloads.
Increase The Difficulty of Compromising Your Cloud Footprint
You should put in the defences that are appropriate to your actual risks and ability to install/manage. A bad example is a medical organisation choosing a more affordable firewall to save a few bucks – this is the sort of organisation that will be targeted.
Minimise The Extent of Damage
This can also be referred to as minimising the blast zone. You want to limit how much damage the bad guys cause, just like the submarine limited flooding to the damaged compartment. This means that we make it harder to get from any one point on the network to the next.
It’s one thing to put in the security defences, but you must also:
Enable/configure the security features: it shocks me how many organisations/consultants opt not to or don’t know how to enable essential features in their security solution.
Monitor your security systems: If we assume that the attacker will get in, then we should monitor our security features to detect and shut down the attack. Again, I’m shocked every time I see security features in Azure that have no logging or alerting enabled.
Applications are partitioned to different Azure Virtual Networks (VNets) and connected using a hub-spoke model
Microsoft uses the term “application”. I prefer the term “workload”. Some, like ITIL, might use the term “service”. A workload is a collection of resources that work together to provide a service to or for the organisation. Maybe it’s a bunch of Azure resources that create a retail site. Maybe it’s a CRM system. Maybe it’s an identity management & governance workload.
The pattern that Microsoft is recommending is one that I have been promoting through my employer for the last 6 years. Each workload gets a dedicated “small” virtual network. The workload VNet is peered with a hub (and only the hub by default). The hub firewall provides isolation and deeper inspection than NSGs can offer.
Step 4 tells us:
Fully distributed ingress/egress cloud micro-perimeters and deeper micro-segmentation
NSGs micro-segment the single or small set of subnet(s) in the VNet, restriocting resource-to-resource connections to just what is required. Isolation is now done centrally and at the NIC, thanks to NSGs. You should also consider network protections on PaaS resources such as Storage Accounts or Key Vaults.
If we revisit the submarine comparison, the workload-specific virtual network is one of the compartments in the boat. If there is a leak (an attack), the NSGs limit or slow down expansion in the subnet(s). The firewall isolates the workload/compartment from other workloads/compartments and the Internet by default to prevent command and control or downloads by the attacker. Deeper firewall inspection searches for attack patterns.
Don’t Forget Monitoring
Microsoft zero-trust has more than just networking. One other step I want to highlight is monitoring/alerting because it ties into the micro-segmentation features of networking. Consider the mechanisms we can put in place:
Paas resource firewalls with logging
NSG with VNet Flow Logging
(Azure) Firewall with logging for firewall rules and deep inspection features (Azure Firewall has Threat Intelligence and IDPS).
Each of those barriers or detection systems can be thought of as a string with a bell on it. The attacker will tickle or trip over those strings. If the bell rings, we should be paying attention. When you fail to put in the barriers or configure monitoring then you don’t know that the attacker is there doing something – and we assume that the attacker will get in and do something – so aren’t we failing to do our job?
It’s Not Just Me Telling You
You can say “There goes Aidan, rattling on about micro-segmentation. Why should I listen to him?”. It would be one thing if it were just me sharing my opinion on Azure network security but what if others told you to do the same things?
Microsoft tells you to implement micro-segmentation. The US NSA tells you to do it. The Canadian Centre for Cyber Security tells you to do it. The UK NCSC tells you to do it. I could keep googling (binging, of course) national security agencies and I’d find the same recommendation with each result. If you are not implementing this security technique designed for today’s threats (not for the Blaster worm of 2003) then you are not only not doing your job but you are choosing to leave the door open for attackers; that could be viewed very poorly by employers, by shareholders, or by informed compliance auditors.
In this Azure Networking deep dive, I’m going to share some of my experience around planning the creation and association of Route Tables in Microsoft Azure.
Quick Recap
The purpose of a Route Table is to apply User-Defined Routes (UDRs). The Route Table is associated with a subnet. The UDRs in the Route Table are applied to the NICs in the subnet. The UDRs override System and/or BGP routes to force routes on outgoing packets to match your desired flows or security patterns.
Remember: There are no subnets or default gateways in Azure; the NIC is the router and packets go directly from the source NIC t the destination NIC. A route can be used to alter that direct flow and force the packets through a desired next hop, such as a firewall, before continuing to the destination.
Route Table Association
A Route Table is associated with one or more subnets. The purpose of this is to cause the UDRs of the Route Table to be deployed to the NICs that are connected to the subnet(s).
Technically speaking, there is nothing wrong with asosciating a single Route Table with more than one subnet. But I would the wisdom of this practice.1:N
1:N Association
The concept here is that one creates a single Route Table that will be used across many subnets. The desire is to reduce effort – there is no cost saving because Route Tables are free:
You create a Route Table
You add all the required UDRs for your subnets
You associate the Route Table with the subnets
It all sounds good until you realise:
That individual subnets can require different routes. For example a simple subnet containing some compute might only require a route for 0.0.0.0/0 to use a firewall as a next hop. On the other hand, a subnet containing VNet-integrated API Management might require 60+ routes. Your security model at this point can become complicated, unpredictable, and contradictory.
Centrally managing network resources, such as Route Tables, for sharing and “quality control” contradicts one of the main purposes of The Cloud: self-service. Watch how quick the IT staff that the business does listen to (the devs) rebel against what you attempt to force upon them! Cloud is how you work, not where you work.
Certain security models won’t work.
1:1 Association
The purpose of 1:1 association is to:
Enable granular routing configuration; routes are generated for each subnet depending on the resource/networking/security requirements of the subnet.
Enable self-service for developers/operators.
The downside is that you can end up with a lot of subnets – keep in mind that some people create too many subnets. One might argue that this is a lot of effort but I would counter that by saying:
I can automate the creation of Route Tables using several means including infrastructure-as-code (IaC), Azure Policy, or even Azure Virtual Network Manager (with it’s new per-VNet pricing model).
Most subnets will have just one UDR: 0.0.0.0/0 via the firewall.
What Do I Do & Recommend?
I use the approach of 1:1 association. Each subnet, subject to support, gets its own Route Table. The Route Table is named after the VNet/subnet and is associatded only with that subnet.
I’ve been using that approach for as long as I can remember. It was formalised 6 years ago and it has worked for at scale. As I stated, it’s no effort because the creation/association of the Route Tables is automated. The real benefit is the predictability of the resulting security model.
In this post, I want to explain why routing is so important in Microsoft Azure. Without truly understanding routing, and implementing predictable and scaleable routing, you do not have a secure network. What one needs to understand is that routing is the security cabling of Azure.
My Favourite Interview Question
Now and then, I am asked to do a technical interview of a new candidate at my employer. I enjoy doing technical interviews because you get to have a deep tech chat with someone who is on their career journey. Sometimes is a hopeful youngster who is still new to the business but demonstrates an ability and a desire to learn – they’re a great find by the way. Sometimes its a veteran that you learn something from. And sometimes, they fall into the trap of discussing my favourite Azure topic: routing.
Before I continue, I should warn potential interviewees that the thing I dislike most in a candidate is when they talk about things that “happened while I was there” and then they claim to be experts in that stuff.
The candidate will say “I deployed a firewall in Azure”. The little demon on my shoulder says “ask them, ask them, ASK THEM!”. I can’t help myself – “How did you make traffic go through the firewall?”. The wrong answer here is: “it just did”.
Look at that beauty. You’ve got Azure networks in the middle (hub) and the right (spoke). And on the left is the remote network connected by some kind of site-to-site networking. The deployment even has the rarely used and pricey Network SKU of DDoS protection. Fantastic! Security is important!
And to re-emphasise that security is important, the firewall (it doesn’t matter what brand you choose in this scenario) is slap-bang in the middle of the whole thing. Not only is that firewall important, but all traffic will have to go through it – nothing happens in that network without the firewall controlling it.
Except, that the firewall is seeing absolutely no traffic at all.
Everything is a VM in the platform, including NVA routers and Virtual Network Gateways (2 VMs).
Packets always route directly from the source NIC to the destination NIC.
In our above firewall scenario, let’s consider two routes:
Traffic from a client in the remote site to an Azure service in the spoke.
A response from the service in the Azure spoke to the client in the remote site.
The client sends traffic from the remote site across the site-to-site connection. The physical part of that network is the familiar flow that you’d see in tracert. Things change once that packet hits Azure. The site-to-site connection terminates in the NVA/virtual network gateway. Now the packet needs to route to the service in the spoke. The scenario is that the NVA/virtual network gateway is the source (in Azure networking) and the spoke service is the destination. The packet leaves the NIC of the NVA/virtual network and routes directly (via the underlying physical Azure network) directly to the NIC of one of the load-balanced VMs in the spoke. The packet did not route through the firewall. The packet did not go through a default gateway. The packet did not go across some virtual peering wire. Repeat it after me:
Packets route directly from source to destination.
Now for the response. The VM in the spoke is going to send a response. Where will that response go? You might say “The firewall is in the middle of the diagram, Aidan. It’s obvious!”. Remember:
Packets route directly from source to destination.
In this scenario, the destination is the NVA/virtual network gateway. The packet will leave the VM in the spoke and appear in the NIC of the NCA/virtual network gateway.
It doesn’t matter how pretty your Visio is (Draw.io is a million times better, by the way – thanks for the tip, Haakon). It doesn’t matter what your intention was. Packets … route directly from source to destination.
User-Defined Routes – Right?
You might be saying, “Duh, Aidan, User-Defined Routes (UDRs) in Route Tables will solve this”. You’re sort of on the right track – maybe even mostly there. But I know from talking to many people over the years, that they completely overlook that there are two (I’d argue three) other sources of routes in Azure. Those other routes are playing a role here that you’re not appreciating and if you do not configure your UDRs/Route Tables correctly you’ll either change nothing or break your network.
Routing Is The Security Cabling of Azure
In the on-premises world, we use cables to connect network appliances. You can’t get from one top-of-rack switch/VLAN to another without going through a default gateway. That default gateway can be a switch, a switch core, a router, or a firewall. Connections are made possible via cables. Just like water flow is controlled by pipes, packets can only transit cables that you lay down.
If you read my Azure Virtual Networks Do Not Exist post then you should understand that NICs in a VNet or in peered VNets are a mesh of NICs that can route directly to each other. There is no virtual network cabling; this means that we need to control the flows via some other means and that means is routing.
One must understand the end state, how routing works, and how to manipulate routing to end up in the desired end state. That’s the obvious bit – but often overlooked is that the resulting security model should be scaleable, manageable, and predictable.
A Greek Phalanx, protected by a shield wall made up of many individuals working under 1 instruction as a unit – like an NSG.
Yesterday, I explained how packets travel in Azure networking while telling you Azure virtual networks do not exist. The purpose was to get readers closer to figuring out how to design good and secure Azure networks without falling into traps of myths and misbeliefs. The next topic I want to tackle is Network Security Groups – I want you to understand how NSGs work … and this will also include Admin Rules from Azure Virtual Network Manager (AVNM).
Port ACLs
In my previous post, Azure Virtual Networks Do Not Exist, I said that Azure was based on Hyper-V. Windows Server 2012 introduced loads of virtual networking features that would go on to become something bigger in Azure. One of them was a mostly overlooked-by-then-customers feature called Port ACLs. I liked Port ACLs; it was mostly unknown, could only be managed using PowerShell and made for great demo content in some TechEd/Ignite sessions that I did back in the day.
Remember: Everything in Azure is a virtual machine somewhere in Azure, even “serverless” functions.
The concept of Port ACLs was it gave you a simple firewall feature controlled through the virtualisation platform – the virtual machine and the guest OS had no control and had to comply. You set up simple rules to allow or deny transport layer (TCP/UDP) traffic on specific ports. For example, I could block all traffic to a NIC by default with a low-priority inbound rule and introduce a high-priority inbound rule to allow TCP 443 (HTTPS). Now I had a web service that could receive HTTPS traffic only, no matter what the guest OS admin/dev/operator did.
Where are Port ACLs implemented? Obviously, it is somewhere in the virtualisation product, but the clue is in the name. Port ACLs are implemented by the virtual switch port. Remember that a virtual machine NIC connects to a virtual switch in the host. The virtual switch connects to the physical NIC in the host and the external physical network.
A virtual machine NIC connects to a virtual switch using a port. You probably know that a physical switch contains several ports with physical cables plugged into them. If a Port ACL is implemented by a switch port and a VM is moved to another host, then what happens to the Port ACL rules? The Hyper-V networking team played smart and implemented the switch port as a property of the NIC! That means that any Port ACL rules that are configured in the switch port move with the NIC and the VM from host to host.
NSG and Admin Rules Are Port ACLs
Along came Azure and the cloud needed a basic rules system. Network Security Groups (NSGs) were released and gave us a pretty interface to manage security at the transport layer; now we can allow or deny inbound or outbound traffic on TCP/UDP/ICMP/Any.
What technology did Azure use? Port ACLs of course. By the way, Azure Virtual Network Manager introduced a new form of basic allow/deny control that is processed before NSG rules called Admin Rules. I believe that this is also implemented using Port ACLs.
A Little About NSG Rules
This is a topic I want to dive deep into later, but let’s talk a little about NSG rules. We can implement inbound (allow or deny traffic coming in) or outbound (allow or deny traffic going out) rules.
A quick aside: I rarely use outbound NSG rules. I prefer using a combination of routing and a hub firewall (dey all by default) to control egress traffic.
When I create a NSG I can associate it with:
A NIC: Only that NIC is affected
A subnet: All NICs, including Vnet integrated PaaS resources and Private Endpoints, are affected
The association is simply a management scaling feature. When you associate a NSG with a subnet the rules are not processed at the subnet.
Associating a NSG resource with a subnet propagates the rules from the NSG to all NICs that are connected to that subnet. The processing is done by Port ACLs at the NIC.
This means:
Inbound rules prevent traffic from entering the virtual machine.
Outbound rules prevent traffic from leaving the virtual machine.
Which association should you choose? I advise you to use subnet association. You can see/manage the entire picture in one “interface” and have an easy-to-understand processing scenario.
If you want to micro-manage and have an unpredictable future then go ahead and associate NSGs with each NIC.
The subnet NSG is processed first for inbound traffic.
The NIC NSG is processed first for outbound traffic.
Keep it simple, stupid (the KISS principle).
Micro-Segmentation
As one might grasp, we can use NSGs to micro-segment a subnet. No matter what the resources do, they cannot bypass the security intent of the NSG rules. That means we don’t need to have different subnets for security zones:
We zone using NSG rules.
Virtual networks and their subnets do not exist!
The only time we need to create additional subnets is when there are compatibility issues such as NSG/Route table association or a PaaS resource requires a dedicated subnet.
Watch out for more content shortly where I break some myths and hopefully simplify some of this stuff for you. And if I’m doing this right, you might start to look at some Azure networks (like I have) and wonder “Why the heck was that implemented that way?”.