Building A Hub & Spoke Using Azure Virtual Network Manager

In this post, I will show how to use Azure Virtual Network Manager (AVNM) to enforce peering and routing policies in a zero-trust hub-and-spoke Azure network. The goal will be to deliver ongoing consistency of the connectivity and security model, reduce operational friction, and ensure standardisation over time.

Quick Overview

AVNM is a tool that has been evolving and continues to evolve from something that I considered overpriced and under-featured, to something that I would want to deploy first in my networking architecture with its recently updated pricing. In summary, AVNM offers:

  • Network/subnet discovery and grouping
  • IP Address Management (IPAM)
  • Connectivity automation
  • Routing automation

There is (and will be) more to AVNM, but I want to focus on the above features because together they simplify the task of building out Azure platform and application landing zones.

The Environment

One can manage virtual networks using static groups but that ignores the fact that The Cloud is a dynamic and agile place. Developers, operators, and (other) service providers will be deploying virtual networks. Our goal will be to discover and manage those networks. An organisation might be simple, and there will be a one-size-fits-all policy. However, we might need to engineer for complexity. We can reduce that complexity by organising:

  • Adopt the Cloud Adoption Framework and Zero Trust recommendations of 1 subscription/virtual network per workload.
  • Organising subscriptions (workloads) using Management Groups.
  • Designing a Management Group hierarchy based on policy/RBAC inheritance instead of basing it on an organisation chart.
  • Using tags to denote roles for virtual networks.

I have built a demo lab where I am creating a hub & spoke in the form of a virtual data centre (an old term used by Microsoft). This concept will use a hub to connect and segment workloads in an Azure region. Based on Route Table limitations, the hub will support up to 400 networked workloads placed in spoke virtual networks. The spokes will be peered to the hub.

A Management Group has been created for dub01. All subscriptions for the hub and workloads in the dub01 environment will be placed into the dub01 Management Group.

Each workload will be classified based on security, compliance, and any other requirements that the organisation may have. Three policies have been predefined and named gold, silver, and bronze. Each of these classifications has a Management Group inside dub01, called dub01gold, dub01silver, and dub01bronze. Workloads are placed into the appropriate Management Group based on their classification and are subject to Azure Policy initiatives that are assigned to dub01 (regional policies) and to the classification Management Groups.

You can see two subscriptions above. The platform landing zone, p-dub01, is going to be the hub for the network architecture. It has therefore been classified as gold. The workload (application landing zone) called p-demo01 has been classified as silver and is placed in the appropriate Management Group. Both gold and silver workloads should be networked and use private networking only where possible, meaning that p-demo01 will have a spoke virtual network for its resources. Spoke virtual networks in dub01 will be connected to the hub virtual network in p-dub01.

Keep in mind that no virtual networks exist at this time.

AVNM Resource

AVNM is based on an Azure resource and subresources for the features/configurations. The AVNM resource is deployed with a management scope; this means that a single AVNM resource can be created to manage a certain scope of virtual networks. One can centrally manage all virtual networks. Or one can create many AVNM resources to delegate management (and the cost) of managing various sets of virtual networks.

I’m going to keep this simple and use one AVNM resource as most organisations that aren’t huge will do. I will place the AVNM resource in a subscription at the top of my Management Group hierarchy so that it can offer centralised management of many hub-and-spoke deployments, even if we only plan to have 1 now; plans change! This also allows me to have specialised RBAC for managing AVNM.

Note that AVNM can manage virtual networks across many regions so my AVNM resource will, for demonstration purposes, be in West Europe while my hub and spoke will be in North Europe. I have enabled the Connectivity, Security Admin, and User-Defined Routing features.

AVNM has one or more management scopes. This is a central AVNM for all networks, so I’m setting the Tenant Root Group as the top of the scope. In a lab, you might use a single subscription or a dedicated Management Group.

Defining Network Groups

We use Network Groups to assign a single configuration to many virtual networks at once. There are two kinds of members:

  • Static: You add/remove members to or from the group
  • Dynamic: You use a friendly wizard to define an Azure Policy to automatically find virtual networks and add/remove them for you. Keep in mind that Azure Policy might take a while to discover virtual networks because of how irregularly it runs. However, once added, the configuration deployment is immediately triggered by AVNM.

There are two kinds of members in a group:

  • Virtual networks: The virtual network and contained subnets are subject to the policy. Virtual networks may be static or dynamic members.
  • Subnets: Only the subnet is targeted by the configuration. Subnets are only static members.

Keep in mind that something like peering only targets a virtual network and User-Defined Routes target subnets.

I want to create a group to target all virtual networks in the dub01 scope. This group will be the basis for configuring any virtual network (except the hub) to be a secured spoke virtual network.

I created a Network Group called dub01spokes with a member type of Virtual Networks.

I then opened the Network Group and configured dynamic membership using this Azure Policy editor:

Any discovered virtual network that is not in the p-dub01 subscription and is in North Europe will be automatically added to this group.

The resulting policy is visible in Azure Policy with a category of Azure Virtual Network Manager.

IP Address Management

I’ve been using an approach of assigning a /16 to all virtual networks in a hub & spoke for years. This approach blocks the prefix in the organisation and guarantees IP capacity for all workloads in the future. It also simplifies routing and firewall rules. For example, a single route will be needed in other hubs if we need to interconnect multiple hub-and-spoke deployments.

I can reserve this capacity in AVNM IP Address Management. You can see that I have reserved 10.1.0.0/16 for dub01:

Every virtual network in dub01 will be created from this pool.

Creating The Hub Virtual Network

I’m going to save some time/money here by creating a skeleton hub. I won’t deploy a route NVA/Virtual Network Gateway so I won’t be able to share it later. I also won’t deploy a firewall, but the private address of the firewall will be 10.1.0.4.

I’m going to deploy a virtual network to use as the hub. I can use Bicep, Terraform, PowerShell, AZ CLI, or the Azure Portal. The important thing is that I refer to the IP address pool (above) when assigning an address prefix to the new virtual network. A check box called Allocate Using IP Address Pools opens a blade in the Azure Portal. Here you can select the Address Pool to take a prefix from for the new virtual network. All I have to do is select the pool and then use a subnet mask to decide how many addresses to take from the pool (/22 for my hub).

Note that the only time that I’ve had to ask a human for an address was when I created the pool. I can create virtual networks with non-conflicting addresses without any friction.

Create Connectivity Configuration

A Connectivity Configuration is a method of connecting virtual networks. We can implement:

  • Hub-spoke peering: A traditional peering between a hub and a spoke, where the spoke can use the Virtual Network Gateway/Azure Route Server in the hub.
  • Mesh: A mesh using a Connected Group (full mesh peering between all virtual networks). This is used to minimise latency between workloads with the understanding that a hub firewall will not have the opportunity to do deep inspection (performance over security).
  • Hub & spoke with mesh: The targeted VNets are meshed together for interconnectivity. They will route through the hub to communicate with the outside world.

I will create a Connectivity Configuration for a traditional hub-and-spoke network. This means that:

  • I don’t need to add code for VNet peering to my future templates.
  • No matter who deploys a VNet in the scope of dub01, they will get peered with the hub. My design will be implemented, regardless of their knowledge or their willingness to comply with the organisation’s policies.

I created a new Connectivity Configuration called dub01spokepeering.

In Topology I set the type to hub-and-spoke. I select my hub virtual network from the p-dub01 subscription as the hub Virtual Network. I then select my group of networks that I want to peer with the hub by selecting the dub01spokes group. I can configure the peering connections; here I should select Hub As Gateway – I don’t have a Virtual Network Gateway or an Azure Route Server in the hub, so the box is greyed out.

I am not enabling inter-spoke connectivity using the above configuration – AVNM has a few tricks, and this is one of them, where it uses Connected Groups to create a mesh of peering in the fabric. Instead, I will be using routing (later) via a hub firewall for secure transitive connectivity, so I leave Enable Connectivity Within Network Group blank.

Did you notice the checkbox to delete any pre-existing peering configurations? If it isn’t peered to the hub then I’m removing it so nobody uses their rights to bypass by networking design.

I completed the wizard and executed the deployment against the North Europe region. I know that there is nothing to configure, but this “cleans up” the GUI.

Create Routing Configuration

Folks who have heard me discuss network security in Azure should have learned that the most important part of running a firewall in Azure is routing. We will configure routing in the spokes using AVNM. The hub firewall subnet(s) will have full knowledge of all other networks by design:

  • Spokes: Using system routes generated by peering.
  • Remote networks: Using BGP routes. The VPN Local Network Gateway creates BGP routes in the Azure Virtual Networks for “static routes” when BGP is not used in VPN tunnels. Azure Route Server will peer with NVA routers (SD-WAN, for example) to propagate remote site prefixes using BGP into the Azure Virtual Networks.

The spokes routing design is simple:

  • A Route Table will be created for each subnet in the spoke Virtual Networks. This design for these free resources will allow customised routing for specific scenarios, such as VNet-integrated PaaS resources that require dedicated routes.
  • A single User-Defined Route (UDR) forces traffic leaving a spoke Virtual Network to pass through the hub firewall, where firewall rules will deny all traffic by default.
  • Traffic inside the Virtual Network will flow by default (directly from source to destination) and be subject to NSG rules, depending on support by the source and destination resource types.
  • The spoke subnets will be configured not to accept BGP routes from the hub; this is to prevent the spoke from bypassing the hub firewall when routing to remote sites via the Virtual Network Gateway/NVA.

I created a Routing Configuration called dub01spokerouting. In this Routing Configuration I created a Rule Collection called dub01spokeroutingrules.

A User-Defined Route, known as a Routing Rule, was created called everywhere:

The new UDR will override (deactivate) the System route to 0.0.0.0/0 via Internet and set the hub firewall as the new default next hop for traffic leaving the Virtual Network.

Here you can see the Routing Collection containing the Routing Rule:

Note that Enable BGP Route Propagation is left unchecked and that I have selected dub01spokes as my target.

And here you can see the new Routing Configuration:

Completed Configurations

I now have two configurations completed and configured:

  • The Connectivity Configuration will automatically peer in-scope Virtual Networks with the hub in p-dub01.
  • The Routing Configuration will automatically configure routing for in-scope Virtual Network subnets to use the p-dub01 firewall as the next hop.

Guess what? We have just created a Zero Trust network! All that’s left is to set up spokes with their NSGs and a WAF/WAFs for HTTPS workloads.

Deploy Spoke Virtual Networks

We will create spoke Virtual Networks from the IPAM block just like we did with the hub. Here’s where the magic is going to happen.

The evaluation-style Azure Policy assignments that are created by AVNM will run approximately every 30 minutes. That means a new Virtual Network won’t be discovered straight after creation – but they will be discovered not long after. A signal will be sent to AVNM to update group memberships based on added or removed Virtual Networks, depending on the scope of each group’s Azure Policy. Configurations will be deployed or removed immediately after a Virtual Network is added or removed from the group.

To demonstrate this, I created a new spoke Virtual Network in p-demo01. I created a new Virtual Network called p-demo01-net-vnet in the resource group p-demo01-net:

You can see that I used the IPAM address block to get a unique address space from the dub01 /16 prefix. I added a subnet called CommonSubnet with a /28 prefix. What you don’t see is that I configured the following for the subnet in the subnet wizard:

As you can see, the Virtual Network has not been configured by AVNM yet:

We will have to wait for Azure Policy to execute – or we can force a scan to run against the resource group of the new spoke Virtual Network:

  • Az CLI: az policy state trigger-scan –resource-group <resource group name>
  • PowerShell: Start-AzPolicyComplianceScan -ResourceGroupName <resource group name>

You could add a command like above into your deployment code if you wished to trigger automatic configuration.

This force process is not exactly quick either! 6 minutes after I forced a policy evaluation, I saw that AVNM was informed about a new Virtual Network:

I returned to AVNM and checked out the Network Groups. The dub01spokes group has a new member:

You can see that a Connectivity Configuration was deployed. Note that the summary doesn’t have any information on Routing Configurations – that’s an oversight by the AVNM team, I guess.

The Virtual Network does have a peering connection to the hub:

The routing has been deployed to the subnet:

A UDR has been created in the Route Table:

Over time, more Virtual Networks are added and I can see from the hub that they are automatically configured by AVNM:

Summary

I have done presentations on AVNM and demonstrated the above configurations in 40 minutes at community events. You could deploy the configurations in under 15 minutes. You can also create them using code! With this setup we can take control of our entire Azure networking deployment – and I didn’t even show you the Admin Rules feature for essential “NSG” rules (they aren’t NSG rules but use the same underlying engine to execute before NSG rules).

Want To Learn More?

Check out my company, Cloud Mechanix, where I share this kind of knowledge through:

  • Consulting services for customers and Microsoft partners using a build-with approach.
  • Custom-written and ad-hoc Azure training.

Together, I can educate your team and bring great Azure solutions to your organisation.

Day Two Devops – Azure VNets Don’t Exist

I had the pleasure of chatting with Ned Bellavance and Kyler Middleton on Day Two DevOps one evening recently to discuss the basics of Azure networking, using my line “Azure Virtual Networks Do Not Exist”. I think I talked nearly non-stop for nearly 40 minutes đŸ™‚ Tune in and you’ll hear my explanation of why many people get so much wrong in Azure networking/security.

Designing A Hub And Spoke Infrastructure

How do you plan a hub & spoke architecture? Based on much of what I have witnessed, I think very few people do any planning at all. In this post, I will explain some essential things to plan and how to plan them.

Rules of Engagement

Microsoft has shared some concepts in the Well-Architected Framework (simplicity) and the documentation for networking & Zero Trust (micro-segmentation, resilience, and isolation).

The hub & spoke will contain networks in a single region, following concepts:

  • Resilience & independence: Workloads in a spoke in North Europe should not depend on a hub in West Europe.
  • Micro-segmentation: Workloads in North Europe trying to access workloads in West Europe should go through a secure route via hubs in each region.
  • Performance: Workload A in North Europe should not go through a hub in West Europe to reach Workload B in North Europe.
  • Cost Management: Minimise global VNet peering to just what is necessary. Enable costs of hubs to be split into different parts of the organisation.
  • Delegation of Duty: If there are different network teams, enable each team to manage their hubs.
  • Minimised Resources: The hub has roles only of transit, connectivity, and security. Do not place compute or other resources into the hub; this is to minimise security/networking complexity and increase predictability.

Management Groups

I agree with many things in the Cloud Adoption Framework “Enterprise Scale” and I disagree with some other things.

I agree that we should use Management Groups to organise subscriptions based on Policy architecture and role-based access control (RBAC – granting access to subscriptions via Entra groups).

I agree that each workload (CAF calls them landing zones) should have a dedicated subscription – this simplifies operations and governance like you wouldn’t believe.

I can see why they organise workloads based on their networking status:

  • Corporate: Workloads that are internal only and are connected to the hub for on-premises connectivity. No public IP addresses should be allowed where technically feasible.
  • Online: Workloads that are online only and are not permitted to be connected to the hub.
  • Hybrid: This category is missing from CAF and many have added it themselves – WAN and Internet connectivity are usually not binary exclusive OR decisions.

I don’t like how Enterprise Scale buckets all of those workloads into a single grouping because it fails to acknowledge that a truly large enterprise will have many ownership footprints in a single tenant.

I also don’t like how Enterprise Scale merges all hubs into a single subscription or management group. Yes, many organisations have central networking teams. Large organisations may have many networking teams. I like to separate hub resources (not feasible with Virtual WAN) into different subscriptions and management groups for true scaling and governance simplicity.

Here is an example of how one might achieve this. I am going to have two hub & spoke deployments in this example:

  • DUB01: Located in Azure North Europe
  • AMS01: Located in Azure West Europe

Some of you might notice that I have been inspired by Microsoft’s data centre naming for the naming of these regional footprints. The reasons are:

  • Naming regions after “North Europe” or “East US” is messy when you think about naming network footprints in East US2, West US2, and so on.
  • Microsoft has already done the work for us. The Dublin (North Europe) region data centres are called DUB05-DUB15 and Microsoft uses AMS01, etc for Middenmeer (West Europe).
  • A single virtual network may have up to 500 peers. Once we hit 500 peers then we need to deploy another hub & spoke footprint in the region. The naming allows DUB02, DUB03, etc.

The change from CAF Enterprise Scale is subtle but look how instantly more scalable and isolated everything is. A truly large organisation can delegate duties as necessary.

If an identity responsible for the AMS01 hub & spoke is compromised, the DUB01 hub & spoke is untouched. Resources are in dedicated subscriptions so the blast area of a subscription compromise is limited too.

There is also a logical placement of the resources based on ownership/location.

You don’t need to recreate policy – you can add more associations to your initiatives.

If an enterprise currently has a single networking team, their IDs are simply added to more groups as new hub & spoke deployments are added.

IP Planning

One of the key principles in the design is simplicity: keep it simple stupid (KISS). I’m going to jump ahead a little here and give you a peek into the future. We will implement “Network segmentation: Many ingress/egress cloud micro-perimeters with some micro-segmentation” from the Azure zero-trust guidance.

The only connection that will exist between DUB01 and AMS01 is a global VNet peering connection between the hubs. All traffic between DUB01 and AMS01 mist route via the firewalls in the hubs. This will require some user-defined routing and we want to keep this as simple as possible.

For example, the firewall subnet in DUB01 must have a route(s) to all prefixes in AMS01 via the firewall in the hub of AMS01. The more prefixes there are in AMS01, the more routes we must add to the Route Table associated with the firewall subnet in the hub of DUB01. So we will keep this very simple.

Each hub & spoke will be created from a single IP prefix allocation:

  • DUB01: All virtual networks in DUB01 will be created from 10.1.0.0/16.
  • AMS01: All virtual networks in AMS01 will be created from 10.2.0.0/16.

You might have noticed that Azure Virtual Network Manager uses a default of /16 for an IP address block in the IPAM feature – how convenient!

That means I only have to create one route in the DUB01 firewall subnet to reach all virtual networks in AMS01:

  • Name: AMS01
  • Prefix: 10.2.0.0/16
  • Next Hop Type: VirtualAppliance
  • Next Hop IP Address: The IP address of the AMS01 firewall

A similar route will be created in AMS01 firewall subnet to reach all virtual networks in DUB01:

  • Name: DUB01
  • Prefix: 10.1.0.0/16
  • Next Hop Type: VirtualAppliance
  • Next Hop IP Address: The IP address of the DUB01 firewall

Honestly, that is all that is required. I’ve been doing it for years. It’s beautifully simple.

The firewall(s) are in total control of the flows. This design means that neither location is dependent on the other. Neither AMS01 nor DUB01 trust each other. If a workload is compromised in AMS01 its reach is limited to whatever firewall/NSG rules permit traffic. With threat detection, flow logs, and other features, you might even discover an attack using a security information & event management (SIEM) system before it even has a chance to spread.

Workloads/Landing Zones

Every workload will have a dedicated subscription with the appropriate configurations, such as enabling budgets and configuring Defender for Cloud. Standards should be as automated as possible (Azure Policy). The exact configuration of the subscription should depend on the zone (corp, online or corporate).

When there is a virtual network requirement, then the virtual network will be as small as is required with some spare capacity. For example, a workload with a web VM and a SQL Server doesn’t need a /24 subnet!

Essential Workloads

Are you going to migrate legacy workloads to Azure? Are you going to run Citrix or Azure Virtual Desktop (AVD)? If so, then you are going to require doamin controllers.

You might say “We have a policy of running a single ADDS site and our domain controllers are on-premises”. Lovely, at least it was when Windows Server 2003 came out. Remember that I want my services in Azure to be resilient and not to depend on other locations. What happens to all of your Azure servces when the network connection to on-premises fails? Or what happens if on-premises goes up in a cloud of smoke? I will put domain controllers in Azure.

Then you might say “We will put domain controllers in DUB01 and AMS01 can use them”. What happens if DUB01 goes offline? That does happen from time to time. What happens if DUB01 is compromised? Not only will I put domain controllers in DUB01, but I will also put them in AMS01. They are low end virtual machines and the cost will be minor. I’ll also do some good ADDS Sites & Services stuff to isolate as much as ADDS lets you:

  • Create subnets for each /16 IP prefix.
  • Create an ADDS site for AMS01 and another for DUB01.
  • Associate each site with the related subnet.
  • Create and configure replication links as required.

The placement and resilience of other things like DNS servers/Private DNS Resolver should be similar.

And none of those things will go in the hub!

Micro-Segmentation

The hub will be our transit network, providing:

  • Site-to-site connectivity, if required.
  • Point-to-site connecticity, if required.
  • A firewall for security and routing purposes.
  • A shared Azure Bastion, if required.

The firewall will be the next hop, by default (expect exceptions) for traffic leaving every virtual network. This will be configured for every subnet (expect exceptions) in every workload.

The firewall will be the glue that routes every spoke virtual network to each other and the outside world. The firewall rules will restrict which of those routes is possible and what traffic is possible – in all directions. Don’t be lazy and allow * to Internet; do you want to automatically enable malware to call home for further downloads or discovery/attack/theft instructions?

The firewall will be carefully chosen to ensure that it includes the features that your organisation requires. Too many organisations pick the cheapest firewall option. Few look at the genuine risks that they face and pick something that best defends against those risks. Allow/deny is not enough any more. Consider the features that pay careful attentiont to what must be allowed; these are the firewall ports that attackers are using to compromise their victims.

Every subnet (expect exceptions) will have an NSG. That NSG will have a custom low-priority inbound rule to deny everything; this means that no traffic can enter a NIC (from anywhere, including the same subnet) without being explicityly allowed by a higher priority rule.

“Web” (this covers a lot of HTTPS based services, excluding AVD) applications will not be published on the Internet using the hub firewall. Instead, you will deploy a WAF of some kind (or different kinds depending on architectural/business requirements). If you’re clever, and it is appropriate from a performance perspective, you might route that traffic through your firewall for inspection at layers 4-7 using TLS Inspection and IDPS.

Logging and Alerting

You have placed all the barriers in place. There are two interesting quotes to consider. The first warns us that we must assume a pentration has already taken place or will take place.

Fundamentally, if somebody wants to get in, they’re getting in…accept that. What we tell clients is: Number one, you’re in the fight, whether you thought you were or not. Number two, you almost certainly are penetrated.

Michael Hayden Former Director of NSA & CIA

The second warns us that attackers don’t think like defenders. We build walls expecting a linear attack. Attackers poke, explore, and prod, looking for any way, including very indeirect routes, to get from A to B.

Biggest problem with network defense is that defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.

John Lambert

Each of our walls offers some kind of monitoring. The firewall has logs, which ideally we can either monitor/alert from or forward to a SIEM.

Virtual Networks offer Flow Logs which track traffic at the VNet level. VNet Flow logs are superior to NSG FLow logs because they catch more traffic (Private Endpoint) and include more interesting data. This is more data that we can send to a SIEM.

Defender for Cloud creates data/alerts. Key Vaults do. Azure databases do. The list goes on and on. All of this data that we can use to:

  • Detect an attack
  • Identify exploration
  • Uncover an expansion
  • Understand how an attack started and happened

And it amazes me how many organisations choose not to configure these features in any way at all.

Wrapping Up

There are probably lots of finer details to consider but I think that I have covered the essentials. When I get the chance, I’ll start diving into the fun detailed designs and their variations.

How Many Azure Route Tables Should I Have?

In this Azure Networking deep dive, I’m going to share some of my experience around planning the creation and association of Route Tables in Microsoft Azure.

Quick Recap

The purpose of a Route Table is to apply User-Defined Routes (UDRs). The Route Table is associated with a subnet. The UDRs in the Route Table are applied to the NICs in the subnet. The UDRs override System and/or BGP routes to force routes on outgoing packets to match your desired flows or security patterns.

Remember: There are no subnets or default gateways in Azure; the NIC is the router and packets go directly from the source NIC t the destination NIC. A route can be used to alter that direct flow and force the packets through a desired next hop, such as a firewall, before continuing to the destination.

Route Table Association

A Route Table is associated with one or more subnets. The purpose of this is to cause the UDRs of the Route Table to be deployed to the NICs that are connected to the subnet(s).

Technically speaking, there is nothing wrong with asosciating a single Route Table with more than one subnet. But I would the wisdom of this practice.1:N

1:N Association

The concept here is that one creates a single Route Table that will be used across many subnets. The desire is to reduce effort – there is no cost saving because Route Tables are free:

  1. You create a Route Table
  2. You add all the required UDRs for your subnets
  3. You associate the Route Table with the subnets

It all sounds good until you realise:

  • That individual subnets can require different routes. For example a simple subnet containing some compute might only require a route for 0.0.0.0/0 to use a firewall as a next hop. On the other hand, a subnet containing VNet-integrated API Management might require 60+ routes. Your security model at this point can become complicated, unpredictable, and contradictory.
  • Centrally managing network resources, such as Route Tables, for sharing and “quality control” contradicts one of the main purposes of The Cloud: self-service. Watch how quick the IT staff that the business does listen to (the devs) rebel against what you attempt to force upon them! Cloud is how you work, not where you work.
  • Certain security models won’t work.

1:1 Association

The purpose of 1:1 association is to:

  • Enable granular routing configuration; routes are generated for each subnet depending on the resource/networking/security requirements of the subnet.
  • Enable self-service for developers/operators.

The downside is that you can end up with a lot of subnets – keep in mind that some people create too many subnets. One might argue that this is a lot of effort but I would counter that by saying:

  • I can automate the creation of Route Tables using several means including infrastructure-as-code (IaC), Azure Policy, or even Azure Virtual Network Manager (with it’s new per-VNet pricing model).
  • Most subnets will have just one UDR: 0.0.0.0/0 via the firewall.

What Do I Do & Recommend?

I use the approach of 1:1 association. Each subnet, subject to support, gets its own Route Table. The Route Table is named after the VNet/subnet and is associatded only with that subnet.

I’ve been using that approach for as long as I can remember. It was formalised 6 years ago and it has worked for at scale. As I stated, it’s no effort because the creation/association of the Route Tables is automated. The real benefit is the predictability of the resulting security model.

Routing Is The Security Cabling of Azure

In this post, I want to explain why routing is so important in Microsoft Azure. Without truly understanding routing, and implementing predictable and scaleable routing, you do not have a secure network. What one needs to understand is that routing is the security cabling of Azure.

My Favourite Interview Question

Now and then, I am asked to do a technical interview of a new candidate at my employer. I enjoy doing technical interviews because you get to have a deep tech chat with someone who is on their career journey. Sometimes is a hopeful youngster who is still new to the business but demonstrates an ability and a desire to learn – they’re a great find by the way. Sometimes its a veteran that you learn something from. And sometimes, they fall into the trap of discussing my favourite Azure topic: routing.

Before I continue, I should warn potential interviewees that the thing I dislike most in a candidate is when they talk about things that “happened while I was there” and then they claim to be experts in that stuff.

The candidate will say “I deployed a firewall in Azure”. The little demon on my shoulder says “ask them, ask them, ASK THEM!”. I can’t help myself – “How did you make traffic go through the firewall?”. The wrong answer here is: “it just did”.

The Visio Firewall Fallacy

I love diagrams like this one:

Look at that beauty. You’ve got Azure networks in the middle (hub) and the right (spoke). And on the left is the remote network connected by some kind of site-to-site networking. The deployment even has the rarely used and pricey Network SKU of DDoS protection. Fantastic! Security is important!

And to re-emphasise that security is important, the firewall (it doesn’t matter what brand you choose in this scenario) is slap-bang in the middle of the whole thing. Not only is that firewall important, but all traffic will have to go through it – nothing happens in that network without the firewall controlling it.

Except, that the firewall is seeing absolutely no traffic at all.

Packets Route Directly From Source To Destination

At this point, I’d like you to (re-)read my post, Azure Virtual Networks Do Not Exist. There I explained two things:

  • Everything is a VM in the platform, including NVA routers and Virtual Network Gateways (2 VMs).
  • Packets always route directly from the source NIC to the destination NIC.

In our above firewall scenario, let’s consider two routes:

  • Traffic from a client in the remote site to an Azure service in the spoke.
  • A response from the service in the Azure spoke to the client in the remote site.

The client sends traffic from the remote site across the site-to-site connection. The physical part of that network is the familiar flow that you’d see in tracert. Things change once that packet hits Azure. The site-to-site connection terminates in the NVA/virtual network gateway. Now the packet needs to route to the service in the spoke. The scenario is that the NVA/virtual network gateway is the source (in Azure networking) and the spoke service is the destination. The packet leaves the NIC of the NVA/virtual network and routes directly (via the underlying physical Azure network) directly to the NIC of one of the load-balanced VMs in the spoke. The packet did not route through the firewall. The packet did not go through a default gateway. The packet did not go across some virtual peering wire. Repeat it after me:

Packets route directly from source to destination.

Now for the response. The VM in the spoke is going to send a response. Where will that response go? You might say “The firewall is in the middle of the diagram, Aidan. It’s obvious!”. Remember:

Packets route directly from source to destination.

In this scenario, the destination is the NVA/virtual network gateway. The packet will leave the VM in the spoke and appear in the NIC of the NCA/virtual network gateway.

It doesn’t matter how pretty your Visio is (Draw.io is a million times better, by the way – thanks for the tip, Haakon). It doesn’t matter what your intention was. Packets … route directly from source to destination.

User-Defined Routes – Right?

You might be saying, “Duh, Aidan, User-Defined Routes (UDRs) in Route Tables will solve this”. You’re sort of on the right track – maybe even mostly there. But I know from talking to many people over the years, that they completely overlook that there are two (I’d argue three) other sources of routes in Azure. Those other routes are playing a role here that you’re not appreciating and if you do not configure your UDRs/Route Tables correctly you’ll either change nothing or break your network.

Routing Is The Security Cabling of Azure

In the on-premises world, we use cables to connect network appliances. You can’t get from one top-of-rack switch/VLAN to another without going through a default gateway. That default gateway can be a switch, a switch core, a router, or a firewall. Connections are made possible via cables. Just like water flow is controlled by pipes, packets can only transit cables that you lay down.

If you read my Azure Virtual Networks Do Not Exist post then you should understand that NICs in a VNet or in peered VNets are a mesh of NICs that can route directly to each other. There is no virtual network cabling; this means that we need to control the flows via some other means and that means is routing.

One must understand the end state, how routing works, and how to manipulate routing to end up in the desired end state. That’s the obvious bit – but often overlooked is that the resulting security model should be scaleable, manageable, and predictable.

How Do Network Security Groups Work?

A Greek Phalanx, protected by a shield wall made up of many individuals working under 1 instruction as a unit – like an NSG.

Yesterday, I explained how packets travel in Azure networking while telling you Azure virtual networks do not exist. The purpose was to get readers closer to figuring out how to design good and secure Azure networks without falling into traps of myths and misbeliefs. The next topic I want to tackle is Network Security Groups – I want you to understand how NSGs work … and this will also include Admin Rules from Azure Virtual Network Manager (AVNM).

Port ACLs

In my previous post, Azure Virtual Networks Do Not Exist, I said that Azure was based on Hyper-V. Windows Server 2012 introduced loads of virtual networking features that would go on to become something bigger in Azure. One of them was a mostly overlooked-by-then-customers feature called Port ACLs. I liked Port ACLs; it was mostly unknown, could only be managed using PowerShell and made for great demo content in some TechEd/Ignite sessions that I did back in the day.

Remember: Everything in Azure is a virtual machine somewhere in Azure, even “serverless” functions.

The concept of Port ACLs was it gave you a simple firewall feature controlled through the virtualisation platform – the virtual machine and the guest OS had no control and had to comply. You set up simple rules to allow or deny transport layer (TCP/UDP) traffic on specific ports. For example, I could block all traffic to a NIC by default with a low-priority inbound rule and introduce a high-priority inbound rule to allow TCP 443 (HTTPS). Now I had a web service that could receive HTTPS traffic only, no matter what the guest OS admin/dev/operator did.

Where are Port ACLs implemented? Obviously, it is somewhere in the virtualisation product, but the clue is in the name. Port ACLs are implemented by the virtual switch port. Remember that a virtual machine NIC connects to a virtual switch in the host. The virtual switch connects to the physical NIC in the host and the external physical network.

A virtual machine NIC connects to a virtual switch using a port. You probably know that a physical switch contains several ports with physical cables plugged into them. If a Port ACL is implemented by a switch port and a VM is moved to another host, then what happens to the Port ACL rules? The Hyper-V networking team played smart and implemented the switch port as a property of the NIC! That means that any Port ACL rules that are configured in the switch port move with the NIC and the VM from host to host.

NSG and Admin Rules Are Port ACLs

Along came Azure and the cloud needed a basic rules system. Network Security Groups (NSGs) were released and gave us a pretty interface to manage security at the transport layer; now we can allow or deny inbound or outbound traffic on TCP/UDP/ICMP/Any.

What technology did Azure use? Port ACLs of course. By the way, Azure Virtual Network Manager introduced a new form of basic allow/deny control that is processed before NSG rules called Admin Rules. I believe that this is also implemented using Port ACLs.

A Little About NSG Rules

This is a topic I want to dive deep into later, but let’s talk a little about NSG rules. We can implement inbound (allow or deny traffic coming in) or outbound (allow or deny traffic going out) rules.

A quick aside: I rarely use outbound NSG rules. I prefer using a combination of routing and a hub firewall (dey all by default) to control egress traffic.

When I create a NSG I can associate it with:

  • A NIC: Only that NIC is affected
  • A subnet: All NICs, including Vnet integrated PaaS resources and Private Endpoints, are affected

The association is simply a management scaling feature. When you associate a NSG with a subnet the rules are not processed at the subnet.

Tip: virtual networks do not exist!

Associating a NSG resource with a subnet propagates the rules from the NSG to all NICs that are connected to that subnet. The processing is done by Port ACLs at the NIC.

This means:

  • Inbound rules prevent traffic from entering the virtual machine.
  • Outbound rules prevent traffic from leaving the virtual machine.

Which association should you choose? I advise you to use subnet association. You can see/manage the entire picture in one “interface” and have an easy-to-understand processing scenario.

If you want to micro-manage and have an unpredictable future then go ahead and associate NSGs with each NIC.

If you hate yourself and everyone around you, then use both options at the same time:

  • The subnet NSG is processed first for inbound traffic.
  • The NIC NSG is processed first for outbound traffic.

Keep it simple, stupid (the KISS principle).

Micro-Segmentation

As one might grasp, we can use NSGs to micro-segment a subnet. No matter what the resources do, they cannot bypass the security intent of the NSG rules. That means we don’t need to have different subnets for security zones:

  • We zone using NSG rules.
  • Virtual networks and their subnets do not exist!

The only time we need to create additional subnets is when there are compatibility issues such as NSG/Route table association or a PaaS resource requires a dedicated subnet.

Watch out for more content shortly where I break some myths and hopefully simplify some of this stuff for you. And if I’m doing this right, you might start to look at some Azure networks (like I have) and wonder “Why the heck was that implemented that way?”.

Azure Virtual Networks Do Not Exist

I see many bad designs where people bring cable-oriented designs from physical locations into Azure. I hear lots of incorrect assumptions when people are discussing network designs. For example: “I put shared services in the hub because they will be closer to their clients”. Or my all-time favourite: people assuming that ingress traffic from site-to-site connections will go through a hub firewall because it’s in the middle of their diagram. All this is because of one common mistake – people don’t realise that Azure Virtual Networks do not exist.

Some Background

Azure was designed to be a multi-tenant cloud capable of hosting an “unlimited” number of customers.

I realise that “unlimited” easily leads us to jokes about endless capacity issues đŸ™‚

Traditional hosting (and I’ve worked there) is based on good old fashioned networking. There’s a data centre. At the heart of the data centre network there is a network core. The entire facility has a prefix/prefixes of address space. Every time that a customer is added, the network administrators carve out a prefix for the new customer. It’s hardly self-service and definitely not elastic.

Azure settled on VXLAN to enable software-defined networking – a process where customer networking could be layered upon the physical networks of Microsoft’s physical data centres/global network.

Falling Into The Myth

Virtual Networks make life easy. Do you want a network? There’s no need to open a ticket? You don’t need to hear from a snarky CCIE who snarls when you ask for a /16 address prefix as if you’ve just asked for Flash “RAID10” from a SAN admin. No; you just open up the Portal/PowerShell/VS Code and you deploy a network of whatever size you want. A few seconds later, it’s there and you can start connecting resources to the new network.

You fire up a VM and you get an address from “DHCP”. It’s not really DHCP. How can it be when Azure virtual networks do not support broadcasts or multicasts? You log into that VM and have networking issues so you troubleshoot like you learn how to:

  1. Ping the local host
  2. Ping the default gateway – oh! That doesn’t work.
  3. Traceroute to a remote address – oh! That doesn’t work either.

And then you start to implement stuff just like you would in your own data centre.

How “It Works”

Let me start by stating that I do not know how the Azure fabric works under the covers. Microsoft aren’t keen on telling us how the sausage is made. But I know enough to explain the observable results.

When you click the Create button at the end of the Create Virtual Machine wizard, an operator is given a ticket, they get clearance from data center security, they grab some patch leads, hop on a bike, and they cycle as fast as they can to the patch panel that you have been assigned in Azure.

Wait … no … that was a bad stress dream. But really, from what I see/hear from many people, they think that something like that happens. Even if that was “virtual”, the who thing just would not be scaleable.

Instead, I want you to think of an Azure virtual network as a Venn diagram. The process of creating a virtual network instructs the Azure fabric that any NIC that connects to the virtual network can route to any other NIC in the virtual network.

Two things here:

  1. You should take “route” as meaning a packet can go from the source NIC to the destination NIC. It doesn’t mean that it will make it through either NIC – we’ll cover that topic in another post soon.
  2. Almost everything in Azure is a virtual machine at some level in Azure. For example, “serverless” Functions run in Microsoft-managed VMs in a Microsoft tenant. Microsoft surfaces the Function functionality to you in your tenant. If you connect those PaaS services (like ASE or SQL Manage Instance) to your virtual network then there will be NICs that connect to a subnet.

Connecting a NIC to a virtual network adds the new NIC to the Venn Diagram. The Azure fabric now knows that this new NIC should be able to route to other NICs in the same virtual network and all the previous NICs can route to it.

Adding Virtual Network Peering

Now we create a second virtual network. We peer those virtual networks and then … what happens now? Does a magic/virtual pipe get created? Nope – it’s fault tolerant so two magic/virtual lines connect the virtual networks? Nope. It’s Venn diagram time again.

The Azure fabric learns that the NICs in Virtual Network 1 can now route to the NICs in Virtual Network 2 and vice versa. That’s all. There is no magic connection. From a routing/security perspective, the NICs in Virtual Network 1 are no different to the NICs in Virtual Network 2. You’ve just created a bigger mesh from (at least) two address prefixes.

Repeat after me:

Virtual networks do not exist

How Do Packets Travel?

OK Aidan (or “Joe” if you arrived here from Twitter), how the heck do packets get from one NIC to another?

Let’s melt some VMware fanboy brains – that sends me to my happy place. Aure is built using Windows Server Hyper-V; the same Hyper-V that you get with commercially available Windows Server. Sure, Azure layers a lot of stuff on top of the hypervisor, but if you dig down deep enough, you will find Hyper-V.

Virtual machines, your or those run by Microsoft, are connected to a virtual switch on the host. The virtual switch is connected to a physical ethernet port on the host. The host is addressable on the Microsoft physical network.

You come along and create a virtual network. The fabric knows to track NICs that are being connected. You create a virtual machine and connect it to the virtual network. Azure will place that virtual machine on one host. As far as you are concerned, that virtual machine has an address from your network.

You then create a second VM and connect it to your virtual network. Azure places that machine on a different host – maybe even in a different data centre. The fabric knows that both virtual machines are in the same virtual network so they should be able to reach each other.

You’ve probably use a 10.something address, like most other customers, so how will your packets stay in your virtual network and reach the other virtual machine? We can thank software-defined networking for this.

Let’s use the addresses of my above diagram for this explanation. The source VM has a customer IP address of 10.0.1.4. It is sending a packet to the destination VM with a customer address of 10.0.1.5. The packet will leave the source NIC, 10.0.1.4 and reaches the host’s virtual switch. This is where the magic happens.

The packet is encapsulated, changing the destination address to that of the destination virtual machine’s host. Imagine you are sending a letter (remember those?) to an apartment number. It’s not enough to say “Apartment 1”; you have to include other information to encapsulate it. That’s what the fabric enables by tracking where your NICs are hosted. Encapsulation wraps the customer packet up in an Azure packet that is addressed to the host’s address, capable of travelling over the Microsoft Global Network – supporting single virtual networks and peered (even globally) virtual networks.

The packet routes over the Microsoft physical network unbeknownst to us. It reaches the destination host, and the encapsulation is removed at the virtual switch. The customer packet is dropped into the memory of the destination virtual machine and Bingo! the transmission is complete.

From our perspective, the packet routes directly from source to destination. This is why you can’t ping a default gateway – it’s not there because it plays no role in routing because: the virtual network does not exist.

I want you to repeat this:

Packets go directly from source to destination

Two Most Powerful Pieces Of Knowledge

If you remember …

  • Virtual networks do not exist and
  • Packets go directly from source to destination

… then you are a huge way along the road of mastering Azure networking. You’ve broken free from string theory (cable-oriented networking) and into quantum physics (software-defined networking). You’ll understand that segmenting networks into subnets for security reasons makes no sense. You will appreciate that placing “shared services” in the hub offered no performance gain (and either broke your security model or made it way more complicated).

Azure Image Builder Job Fails With TCP 60000, 5986 or 22 Errors

In this post, I will explain how to solve the situation when an Azure Image Builder job fails with the following errors:

[ERROR] connection error: unknown error Post “https://10.1.10.9:5986/wsman&#8221;: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.9:5986/wsman&#8221;: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout

[ERROR]connection error: unknown error Post “https://10.1.10.8:5986/wsman&#8221;: context deadline exceeded
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.8:5986/wsman&#8221;: context deadline exceeded

Note that the second error will probably be the following if you are building a Linux VM:

[ERROR]connection error: unknown error Post “https://10.1.10.8:22/ssh&#8221;: context deadline exceeded
[ERROR] SSH connection err: unknown error Post “https://10.1.10.8:22/ssh&#8221;: context deadline exceeded

The Scenario

I’m using Azure Image Builder to prepare a reusable image for Azure Virtual Desktop with some legacy software packages from external vendors. Things like re-packaging (MSIX) will be a total support no-no, so the software must be pre-installed into the image.

I need a secure solution:

  • The virtual machine should not be reachable on the Internet.
  • Software packages will be shared from a storage account with a private endpoint for the blob service.

This scenario requires that I prepare a virtual network and customise the Image Template to use an existing subnet ID. That’s all good. I even looked at the PowerShell example from Microsoft which told me to allow TCP 60000-60001 from Azure Load Balancer to Virtual Network:

I also added my customary DenyAll rule at priority 4000 – the built in Deny rule doesn’t deny all that much!

I did that … and the job failed, initially with the first of the errors above, related to TCP 60000. Weird!

Troubleshooting Time

Having migrated countless legacy applications with missing networking documentation into micro-segmented Azure networks, I knew what my next steps were:

  1. Deploy Log Analytics and a Storage Account for logs
  2. Enable VNet Flow Logs with Traffic Analytics on the staging subnet (where the build is happening) NSG
  3. Recreate the problem (do a new build)
  4. Check the NTANetAnalytics table in Log Analytics

And that’s what I did. Immediately I found that there were comms problems between the Private Endpoint (Azure Container Instance) and the proxy VM. TCP 60000 was attempted and denied because the source was not the Azure Load Balancer.

I added a rule to solve the first issue:

I re-ran the test (yes, this is painfully slow) and the job failed.

This time the logs showed failures from the proxy VM to the staging (build) VM on TCP 5986. If you’re building a Linux VM then this will be TCP 22.

I added a third rule:

Now when I test I see the status switch from Running, to Distributing, to Success.

Root Cause?

Adding my DenyAll rule caused the scenario to vary from the Microsoft docs. The built-in AllowVnetInBound rule is too open because it allows all sources in a routed “network”, including other networks in a hub & spoke. So I micro-segment using a low priority DenyAll rule.

The default AllowVnetInBound rule would have allowed the container>proxy and proxy>VM traffic, but I had overridden it. So I need to create rules to allow that traffic.

Azure Back To School 2024 – Govern Azure Networking Using Azure Virtual Network Manager

This post about Azure Virtual Network Manager is a part of the online community event, Azure Back To School 2024. In this post, I will discuss how you can use Azure Virtual Network Manager (AVNM) to centrally manage large numbers of Azure virtual networks in a rapidly changing/agile and/or static environment.

Challenges

Organisations around the globe have a common experience: dealing with a large number of networks that rapidly appear/disappear is very hard. If those networks are centrally managed then there is a lot of re-work. If the networks are managed by developers/operators then there is a lot of governance/verification work.

You need to ensure that networks are connected and are routed according to organisation requirements. Mandatory security rules must be put in place to either allow required traffic or to block undesired flows.

That wasn’t a big deal in the old days when there were maybe 3-4 huge overly trusting subnets in the data centre. Network designs change when we take advantage of the ability to transform when deploying to the cloud; we break those networks down into much smaller Azure virtual networks and implement micro-segmentation. This approach introduces simplified governance and a superior security model that can reliably build barriers to advanced persistent threats. Things sound better until you realise that there are no many more networks and subnets that there ever were in the on-premises data centre, and each one requires management.

This is what Azure Virtual Network Manager was created to help with.

Introducing Azure Virtual Network Manager

AVNM is not a new product but it has not gained a lot of traction yet – I’ll get into that a little later. Spoiler alert: things might be changing!

The purpose of AVNM is to centralise configuration of Azure virtual networks and to introduce some level of governance. Don’t get me wrong, AVNM does not replace Azure Policy. In fact, AVNM uses Azure Policy to do some of the leg work. The concept is to bring a network-specialist toolset to the centralised control of networks instead of using a generic toolset (Azure Policy) that can be … how do I say this politely … hmm … mysterious and a complete pain in the you-know-what to troubleshoot.

AVNM has a growing set of features to assist us:

  • Network groups: A way to identify virtual networks or subnets that we want to manage.
  • Connectivity configurations: Manage how multiple virtual networks are connected.
  • Security admin rules: Enforce security rules at the point of subnet connection (the NIC).
  • Routing configurations: Deploy user-defined routes by policy.
  • Verifier: Verify the networks can allow required flows.

Deployment Methodology

The approach is pretty simple:

  1. Identify a collection of networks/subnets you want to configure by creating a Network Group.
  2. Build a configuration, such as connectivity, security admin rules, or routing.
  3. Deploy the configuration targeting a Network Group and one or more Azure regions.

The configuration you build will be deployed to the network group members in the selected region(s).

Network Groups

Part of a scalable configuration feature of AVNM is network groups. You will probably build several or many network groups, each collecting a set of subnets or networks that have some common configuration requirement. This means that you can have ea large collection of targets for one configuration deployment.

Network Groups can be:

  • Static: You manually add specific networks to the group. This is ideal for a limited and (normally) unchanging set of targets to receive a configuration.
  • Dynamic: You will define a query based on one or more parameters to automatically discover current and future networks. The underlying mechanism that is used for this discovery is Azure Policy – the query is created as a policy and assigned to the scope of the query.

Dynamic groups are what you should end up using most of the time. For example, in a governed environment, Azure resources are often tagged. One can query virtual networks with specific tags and in specific Azure regions and have them automatically appear in a network group. If a developer/operator creates a new network, governance will kick in and tag those networks. Azure Policy will discover the networks and instantly inform AVNM that a new group member was discovered – any configurations applied to the group will be immediately deployed to the new network. That sounds pretty nice, right?

Connectivity Configurations

Before we continue, I want you to understand that virtual network peering is not some magical line or pipe. It’s simply an instruction to the Azure network fabric to say “A collection of NICs A can now talk with a collection of NICs B”.

We often want to either simplify the connectivity of networks or to automate desired connectivity. Doing this at scale can be done using code, but doing it in an agile environment requires trust. Failure usually happens between the chair and the keyboard, so we want to automate desired connectivity, especially when that connectivity enables integration or plays a role in security/compliance.

Connectivity Configurations enable three types of network architecture:

  • Hub-and-spoke: This is the most common design I see being required and the only one I’ve ever implemented for mid-large clients. A central regional hub is deployed for security/transit. Workloads/data are placed in spokes and are peered only with the hub (the network core). A router/firewall is normally (not always) the next hop to leave a spoke.
  • Full mesh: Every virtual network is connected directly to every other virtual network.
  • Hub-and-spoke with mesh: All spokes are connected to the hub. All spokes are connected to each other. Traffic to/from the outside world must go through the hub. Traffic to other spokes goes directly to the destination.

Mesh is interesting. Why would one use it? Normally one would not – a firewall in the hub is a desirable thing to implement micro-segmentation and advanced security features such as Intrusion Detection and Prevention System (IDPS). But there are business requirements that can override security for limited scenarios. Imagine you have a collection of systems that must integrate with minimised latency. If you force a hop through a firewall then latency will potentially be doubled. If that firewall is deemed an unnecessary security barrier for these limited integrations by the business, then this is a scenario where a full mesh can play a role.

This is why I started off discussing peering. Whether a system is in the same subnet/network or not, it doesn’t matter. The physical distance matters, not the virtual distance. Peering is not a cable or a connection – it’s just an instruction.

However, Virtual Network Peering is not even used in mesh! It’s something different that can handle the scale of many virtual networks being interconnected called a Connected Group. One configuration inter-connects all the virtual networks without having to create 1-1 peerings between many virtual networks.

A very nice option with this configuration is the ability to automatically remove pre-existing peering connections to clean up unwanted previous designs.

Security Admin Rules

What is a Network Security Group (NSG) rule? It’s a Hyper-V port ACL that is implemented at the NIC of the virtual machine (yours or in the platform hosting your PaaS service). The subnet or NIC association is simply a scaling/targeting system; the rules are always implemented at the NIC where the virtual switch port is located.

NSGs do not scale well. Imagine you need to deploy a rule to all subnets/NICs to allow/block a flow. How many edits will you need to do? And how much time will you waste on prioritising rules to ensure that your rule is processed first?

Security Admin Rules are also implemented using Port ACLs but they are always processed first. You can create a rule or a set or rules and deploy it to a Network Group. All NICs will be updated and your rules will always be processed first.

Tip: Consider using VNet Flow Logs to troubleshoot Security Admin Rules.

Routing Configurations

This is one of the newer features in AVNM and was a technical blocker for me until it was introduced. Routing plays a huge role in a security design, forcing traffic from the spoke through a firewall in the hub. Typically, in VNet-based hub deployments, we place one user-defined route (UDR) in each subnet to make that flow happen. That doesn’t scale well and relies on trust. Some have considered using BGP routing to accomplish this but that can be easily overridden after quite a bit of effort/cost to get the route propagated in the first place.

AVNM introduced a preview to centrally configure UDRs and deploy them to Network Groups with just a few clicks. There are a few variations on this concept to decide how granular you want the resulting Route Tables to be:

  • One is shared with virtual networks.
  • One is shared with all subnets in a virtual network.
  • One per subnet.

Verification

This is a feature that I’m a little puzzled about and I am left wondering if it will play a role in some other future feature. The idea is that you can test your configurations to ensure that they are working. There is a LOT of cross-over with Network Watcher and there is a common limitation: it only works with virtual machines.

What’s The Bad News?

Once routing configurations go generally available, I would want to use AVNM in every deployment that I do in the future. But there is a major blocker: pricing. AVNM is priced per subscription at $73/month. For those of you with a handful of subscriptions, that’s not much at all. But for those of us who saw that the subscription is a natural governance boundary and use LOTS of subscriptions (like in Microsoft Cloud Adoption Framework), this is a huge deal – it can make AVNM the most expensive thing we do in Azure!

The good news is that the message has gotten through to Microsoft and some folks in Azure networking have publicly commented that they are considering changes to the way that the pricing of AVNM is calculated.

The other bit bad news is an oldie: Azure Policy. Dynamic network group membership is built by Azure Policy. If a new virtual network is created by a developer, it can be hours before policy detects it and informs AVNM. In my testing, I’ve verified that once AVNM sees the new member, it triggers the deployment immediately, but the use of Azure Policy does create latency, enabling some bad practices to be implemented in the meantime.

Summary

I was a downer on AVNM early on. But recent developments and some of the ideas that the team is working on have won me over. The only real blocker is pricing, but I think that the team is serious about fixing that. I stated earlier that AVNM hasn’t gotten a lot of traction. I think that this should change once pricing is fixed and routing configurations are GA.

I recently demonstrated using AVNM to build out the connectivity and routing of a hub-and-spoke with micro-segmentation at a conference. Using Azure Portal, the entire configuration probably took less than 10 minutes. Imagine that: 10 minutes to build out your security and compliance model for now and for the future.

Azure’s Software Defined Networking

In this post, I will explain why Azure’s software-defined networking (virtual networks) differs from the cable-defined networking of on-premises networks.

Background

Why am I writing this post? I guess that this one has been a long time coming. I noticed a trend early in my working days with Azure. Most of the people who work with Azure from the infrastructure/platform point of view are server admins. Their work includes doing all of the resource stuff you’d expect, such as Azure SQL, VMs, App Services, … virtual networks, Network Security Groups, Azure Firewall, routing, … wait … isn’t that networking stuff? Why isn’t the network admin doing that?

I think the answer to that question is complicated. A few years ago I added a question to the audience to some of my presentations on Azure networking. I asked who was a ON-PREMISES networking admin versus an ON-PREMISES something-else. And then I said “the ‘server admins’ are going to understand what I will tech more easily than the network admins will”. I could see many heads nodding in agreement. Network admins typically struggle with Azure networking because it is very different.

Cable-Defined Networking

Normally, on-premises networking is “cable-defined”. That phrase means that packets go from source to destination based on physical connections. Those connections might be indirect:

  • Appliances such as routers decide what turn to take at a junction point
  • Firewalls either block or allow packets
  • Other appliances might convert signals from electrons to photons or radio waves.

A connection is always there and, more often than not, it’s a cable. Cables make packet flow predictable.

Look at the diagram of your typical on-premises firewall. It will have ethernet ports for different types of networks:

  • External
  • Management
  • Site-to-site connectivity
  • DMZ
  • Internal
  • Secure zone

Each port connects to a subnet that is a certain network. Each subnet has one or more switches that only connect to servers in that subnet. The switches have uplinks to the appropriate port in the firewall, thus defining the security context of that subnet. It also means that a server in the DMZ network must pass through the firewall, via the cable to the firewall, to get to another subnet.

In short, if a cable does not make the connection, then the connection is not possible. That makes things very predictable – you control the security and performance model by connecting or not connecting cables.

Software-Defined Networking

Azure is a cloud, and as a cloud, it must enable self-service. Imagine being a cloud subscriber, and having to open a support call to create a network or a subnet. Maybe they need to wait 3 days while some operators plug in cables and run Cisco commands. Or they need to order more switches because they’ve run out of capacity and you might need to wait weeks. Is this the hosting of the 2000’s or is it The Cloud?

Azure’s software-defined networking enables the customer to run a command themselves (via the Portal, script, infrastructure-as-code, or API) to create and configure networks without any involvement from Microsoft staff. If I need a new network, a subnet, a firewall, a WAF, or almost anything networking in Azure (with the exception of a working ExpressRoute circuit) then I don’t need any human interaction from a support staff member – I do it and have the resource anywhere from a few seconds to 45 minutes later, depending on the resource type.

This is because the physical network of Azure is overlayed with a software-defined network based on VXLAN. In simple terms, you have no visibility of the physical network. You use simulated networks that hide the underlying complexities, scale, and addressing. You create networks of your own address/prefix choice and use them. Your choice of addresses affects only your networks because they actually have nothing to do with how packets route at the physical layer – that’s handled by traditional networking at the physical layer – but that’s a matter only for the operators of the Microsoft global network/Azure.

A diagram helps … and here’s one that I use in my Azure networking presentations.

In this diagram, we see a source and a destination running in Azure. In case you were not aware:

  • Just about everything in Azure runs in a virtual machine, even so-called serverless computing. That virtual machine might be hidden in the platform but it is there. Exceptions might include some very expensive SKUs for SAP services and Azure VMware hosts.
  • The hosts for those virtual machines are running (drumroll please) Hyper-V, which as one may now be forced to agree, is scalable đŸ˜€

The source wants to send a packet to a destination. The source is connected to a Virtual Network and has the address of 10.0.1.4. The destination is connected to another virtual network (the virtual networks are peered) and has an address of 10.10.1.4. The virtual machine guest OS sends the packet to the NIC where the Azure fabric takes over. The fabric knows what hosts the source and destination are running on. The packet is encapsulated by the fabric – the letter is put into a second envelope. The envelope has a new source address, that of the source host, and a new destination, the address of the destination host. This enables the packet to traverse the physical network of Microsoft’s data centres even if 1000s of tenants are using the 10.x.x.x prefixes. The packet reaches the destination host where it is decapsulated, unpacking the original packet and enabling the destination host to inject the packet into the NIC of the destination.

This is why you cannot implement GRE networking in Azure.

Virtual Networks Aren’t What You Think

The software-defined networking in Azure maintains a mapping. When you create a virtual network, a new map is created. It tells Azure that NICs (your explicitly created NICs or those of platform resources that are connected to your network) that connect to the virtual network are able to talk to each other. The map also tracks what Hyper-V hosts the NICs are running on. The purpose of the virtual network is to define what NICs are allowed to talk to each other – to enforce the isolation that is required in a multi-tenant cloud.

What happens when you peer two virtual networks? Does a cable monkey run out with some CAT6 and create a connection? Is the cable monkey creating a virtual connection? Does that connection create a bottleneck?

The answer to the second question is a hint as to what happens when you implement virtual network peering. The speed of connections between a source and destination in different virtual networks is the potential speed of their NICs – the slowest NIC (actually the slowest VM, based on things like RSS/VMQ/SR-IOV) in any source/destination flow is the bottleneck.

VNet peering does not create a “connection”. Instead, the mapping that is maintained by the fabric is altered. Think of it being like a Venn Diagram. Once you implement peering, the loops that define what can talk to what has a new circle. VNet1 has a circle encompassing its NICs. VNet2 has a circle encompassing its NICs. Now a new circle is created that encompasses VNet1 and VNet2 – any source in VNet1 can talk directly, using encapsulation/decapsulation) to any destination in VNet2 and vice versa without going through some resource in the virtual networks.

You might have noticed before now that you cannot ping the default gateway in an Azure virtual network. It doesn’t exist because there is no cable to a subnet appliance to reach other subnets.

You also might have noticed that tools like traceroute are pretty useless in Azure. That’s because the expected physical hops are not there. This is why using tools like test-netconnection (Windows PowerShell) or Network Watcher Connection Troubleshoot/Connection Monitor are very important.

Direct Connections

Now you know what’s happening under the covers. What does that mean? When a packet goes from source to destination, there is no hop. Have a look at the diagram below.

It’s not an unusual diagram. There’s an on-prem network on the left that connects to Azure virtual networks using a VPN tunnel that is terminated in Azure by a VPN Gateway. The VPN Gateway is deployed into a hub VNet. There’s some stuff in the hub, including a firewall. Services/data are deployed into spoke VNets – the spoke VNets are peered with the hub.

One can immediately see that the firewall, in the middle, is intended to protect the Azure VNets from the on-premises network(s). That’s all good. But this is where the problems begin. Many will look at that diagram and think that this protection will just work.

If we take what I’ve explained above we’ll understand really what will happen. The VPN Gateway is implemented in the platform as two Azure virtual machines. Packets will come in over the tunnel to one of those VMs. Then the packets will hit the NIC of the VM to route to a spoke VNet. What path will those packets take? There’s a firewall in the pretty diagram. The firewall is placed right in the middle! And that firewall is ignored. That’s because packets leaving the VPN Gateway VM will be encapsulated and go straight to the NIC of the destination NIC in one of the spokes as if it were teleported.

To get the flow that you require for security purposes you need to understand Azure routing and either implement the flow via BGP or User-Defined Routing.

Now have a look at this diagram of a virtual appliance firewall running in Azure from Palo Alto.

Look at all those pretty subnets. What is the purpose of them? Oh I know that there’s public, management, VPN, etc. But why are they all connecting to different NICs? Are there physical cables to restrict/control the flow of packets between some spoke virtual network and a DMZ virtual network? Nope. What forces packets to the firewall? Azure routing does. So those NICs in the firewall do what? They don’t isolate, they complicate! They aren’t for performance, because the VM size controls overall NIC throughput and speed. They don’t add performance, they complicate!

The real reason for all those NICs is to simulate eth0, eth1, etc that are referenced by the Palo Alto software. It enables Palo Alto to keep the software consistent between on-prem appliances and their Azure Marketplace appliance. That’s it – it saves Palo Alto some money. Meanwhile, Azure Firewall using a single IP address on the virtual network (via the Standard tier load balancer, but you might notice each compute instance IP as a source) and there is no sacrifice in security.

Wrapping Up

There have been countless times over the years when having some level of understanding of what is happening under the covers has helped me. If you grasp the fundamentals of how packets rally get from A to B then you are better prepared to design, deploy, operate, or troubleshoot Azure networking.