Designing A Hub And Spoke Infrastructure

How do you plan a hub & spoke architecture? Based on much of what I have witnessed, I think very few people do any planning at all. In this post, I will explain some essential things to plan and how to plan them.

Rules of Engagement

Microsoft has shared some concepts in the Well-Architected Framework (simplicity) and the documentation for networking & Zero Trust (micro-segmentation, resilience, and isolation).

The hub & spoke will contain networks in a single region, following concepts:

  • Resilience & independence: Workloads in a spoke in North Europe should not depend on a hub in West Europe.
  • Micro-segmentation: Workloads in North Europe trying to access workloads in West Europe should go through a secure route via hubs in each region.
  • Performance: Workload A in North Europe should not go through a hub in West Europe to reach Workload B in North Europe.
  • Cost Management: Minimise global VNet peering to just what is necessary. Enable costs of hubs to be split into different parts of the organisation.
  • Delegation of Duty: If there are different network teams, enable each team to manage their hubs.
  • Minimised Resources: The hub has roles only of transit, connectivity, and security. Do not place compute or other resources into the hub; this is to minimise security/networking complexity and increase predictability.

Management Groups

I agree with many things in the Cloud Adoption Framework “Enterprise Scale” and I disagree with some other things.

I agree that we should use Management Groups to organise subscriptions based on Policy architecture and role-based access control (RBAC – granting access to subscriptions via Entra groups).

I agree that each workload (CAF calls them landing zones) should have a dedicated subscription – this simplifies operations and governance like you wouldn’t believe.

I can see why they organise workloads based on their networking status:

  • Corporate: Workloads that are internal only and are connected to the hub for on-premises connectivity. No public IP addresses should be allowed where technically feasible.
  • Online: Workloads that are online only and are not permitted to be connected to the hub.
  • Hybrid: This category is missing from CAF and many have added it themselves – WAN and Internet connectivity are usually not binary exclusive OR decisions.

I don’t like how Enterprise Scale buckets all of those workloads into a single grouping because it fails to acknowledge that a truly large enterprise will have many ownership footprints in a single tenant.

I also don’t like how Enterprise Scale merges all hubs into a single subscription or management group. Yes, many organisations have central networking teams. Large organisations may have many networking teams. I like to separate hub resources (not feasible with Virtual WAN) into different subscriptions and management groups for true scaling and governance simplicity.

Here is an example of how one might achieve this. I am going to have two hub & spoke deployments in this example:

  • DUB01: Located in Azure North Europe
  • AMS01: Located in Azure West Europe

Some of you might notice that I have been inspired by Microsoft’s data centre naming for the naming of these regional footprints. The reasons are:

  • Naming regions after “North Europe” or “East US” is messy when you think about naming network footprints in East US2, West US2, and so on.
  • Microsoft has already done the work for us. The Dublin (North Europe) region data centres are called DUB05-DUB15 and Microsoft uses AMS01, etc for Middenmeer (West Europe).
  • A single virtual network may have up to 500 peers. Once we hit 500 peers then we need to deploy another hub & spoke footprint in the region. The naming allows DUB02, DUB03, etc.

The change from CAF Enterprise Scale is subtle but look how instantly more scalable and isolated everything is. A truly large organisation can delegate duties as necessary.

If an identity responsible for the AMS01 hub & spoke is compromised, the DUB01 hub & spoke is untouched. Resources are in dedicated subscriptions so the blast area of a subscription compromise is limited too.

There is also a logical placement of the resources based on ownership/location.

You don’t need to recreate policy – you can add more associations to your initiatives.

If an enterprise currently has a single networking team, their IDs are simply added to more groups as new hub & spoke deployments are added.

IP Planning

One of the key principles in the design is simplicity: keep it simple stupid (KISS). I’m going to jump ahead a little here and give you a peek into the future. We will implement “Network segmentation: Many ingress/egress cloud micro-perimeters with some micro-segmentation” from the Azure zero-trust guidance.

The only connection that will exist between DUB01 and AMS01 is a global VNet peering connection between the hubs. All traffic between DUB01 and AMS01 mist route via the firewalls in the hubs. This will require some user-defined routing and we want to keep this as simple as possible.

For example, the firewall subnet in DUB01 must have a route(s) to all prefixes in AMS01 via the firewall in the hub of AMS01. The more prefixes there are in AMS01, the more routes we must add to the Route Table associated with the firewall subnet in the hub of DUB01. So we will keep this very simple.

Each hub & spoke will be created from a single IP prefix allocation:

  • DUB01: All virtual networks in DUB01 will be created from 10.1.0.0/16.
  • AMS01: All virtual networks in AMS01 will be created from 10.2.0.0/16.

You might have noticed that Azure Virtual Network Manager uses a default of /16 for an IP address block in the IPAM feature – how convenient!

That means I only have to create one route in the DUB01 firewall subnet to reach all virtual networks in AMS01:

  • Name: AMS01
  • Prefix: 10.2.0.0/16
  • Next Hop Type: VirtualAppliance
  • Next Hop IP Address: The IP address of the AMS01 firewall

A similar route will be created in AMS01 firewall subnet to reach all virtual networks in DUB01:

  • Name: DUB01
  • Prefix: 10.1.0.0/16
  • Next Hop Type: VirtualAppliance
  • Next Hop IP Address: The IP address of the DUB01 firewall

Honestly, that is all that is required. I’ve been doing it for years. It’s beautifully simple.

The firewall(s) are in total control of the flows. This design means that neither location is dependent on the other. Neither AMS01 nor DUB01 trust each other. If a workload is compromised in AMS01 its reach is limited to whatever firewall/NSG rules permit traffic. With threat detection, flow logs, and other features, you might even discover an attack using a security information & event management (SIEM) system before it even has a chance to spread.

Workloads/Landing Zones

Every workload will have a dedicated subscription with the appropriate configurations, such as enabling budgets and configuring Defender for Cloud. Standards should be as automated as possible (Azure Policy). The exact configuration of the subscription should depend on the zone (corp, online or corporate).

When there is a virtual network requirement, then the virtual network will be as small as is required with some spare capacity. For example, a workload with a web VM and a SQL Server doesn’t need a /24 subnet!

Essential Workloads

Are you going to migrate legacy workloads to Azure? Are you going to run Citrix or Azure Virtual Desktop (AVD)? If so, then you are going to require doamin controllers.

You might say “We have a policy of running a single ADDS site and our domain controllers are on-premises”. Lovely, at least it was when Windows Server 2003 came out. Remember that I want my services in Azure to be resilient and not to depend on other locations. What happens to all of your Azure servces when the network connection to on-premises fails? Or what happens if on-premises goes up in a cloud of smoke? I will put domain controllers in Azure.

Then you might say “We will put domain controllers in DUB01 and AMS01 can use them”. What happens if DUB01 goes offline? That does happen from time to time. What happens if DUB01 is compromised? Not only will I put domain controllers in DUB01, but I will also put them in AMS01. They are low end virtual machines and the cost will be minor. I’ll also do some good ADDS Sites & Services stuff to isolate as much as ADDS lets you:

  • Create subnets for each /16 IP prefix.
  • Create an ADDS site for AMS01 and another for DUB01.
  • Associate each site with the related subnet.
  • Create and configure replication links as required.

The placement and resilience of other things like DNS servers/Private DNS Resolver should be similar.

And none of those things will go in the hub!

Micro-Segmentation

The hub will be our transit network, providing:

  • Site-to-site connectivity, if required.
  • Point-to-site connecticity, if required.
  • A firewall for security and routing purposes.
  • A shared Azure Bastion, if required.

The firewall will be the next hop, by default (expect exceptions) for traffic leaving every virtual network. This will be configured for every subnet (expect exceptions) in every workload.

The firewall will be the glue that routes every spoke virtual network to each other and the outside world. The firewall rules will restrict which of those routes is possible and what traffic is possible – in all directions. Don’t be lazy and allow * to Internet; do you want to automatically enable malware to call home for further downloads or discovery/attack/theft instructions?

The firewall will be carefully chosen to ensure that it includes the features that your organisation requires. Too many organisations pick the cheapest firewall option. Few look at the genuine risks that they face and pick something that best defends against those risks. Allow/deny is not enough any more. Consider the features that pay careful attentiont to what must be allowed; these are the firewall ports that attackers are using to compromise their victims.

Every subnet (expect exceptions) will have an NSG. That NSG will have a custom low-priority inbound rule to deny everything; this means that no traffic can enter a NIC (from anywhere, including the same subnet) without being explicityly allowed by a higher priority rule.

“Web” (this covers a lot of HTTPS based services, excluding AVD) applications will not be published on the Internet using the hub firewall. Instead, you will deploy a WAF of some kind (or different kinds depending on architectural/business requirements). If you’re clever, and it is appropriate from a performance perspective, you might route that traffic through your firewall for inspection at layers 4-7 using TLS Inspection and IDPS.

Logging and Alerting

You have placed all the barriers in place. There are two interesting quotes to consider. The first warns us that we must assume a pentration has already taken place or will take place.

Fundamentally, if somebody wants to get in, they’re getting in…accept that. What we tell clients is: Number one, you’re in the fight, whether you thought you were or not. Number two, you almost certainly are penetrated.

Michael Hayden Former Director of NSA & CIA

The second warns us that attackers don’t think like defenders. We build walls expecting a linear attack. Attackers poke, explore, and prod, looking for any way, including very indeirect routes, to get from A to B.

Biggest problem with network defense is that defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.

John Lambert

Each of our walls offers some kind of monitoring. The firewall has logs, which ideally we can either monitor/alert from or forward to a SIEM.

Virtual Networks offer Flow Logs which track traffic at the VNet level. VNet Flow logs are superior to NSG FLow logs because they catch more traffic (Private Endpoint) and include more interesting data. This is more data that we can send to a SIEM.

Defender for Cloud creates data/alerts. Key Vaults do. Azure databases do. The list goes on and on. All of this data that we can use to:

  • Detect an attack
  • Identify exploration
  • Uncover an expansion
  • Understand how an attack started and happened

And it amazes me how many organisations choose not to configure these features in any way at all.

Wrapping Up

There are probably lots of finer details to consider but I think that I have covered the essentials. When I get the chance, I’ll start diving into the fun detailed designs and their variations.

Designing An Azure Hub Virtual Network

In this post, I am going to share a process for designing a hub virtual network for a hub & spoke secured virtual network deployment in Microsoft Azure.

The process I lay out in this document will not work for everyone.I think, based experience, that very few organisations will find exceptions to this process.

What Is And Is Not In This Post

This post is going to focus on the process of designing a hub virtual network. You will not find a design here … that will come in a later post.

You will also not find any mention of Azure Virtual WAN. You DO NOT need to use Azure Virtual WAN to do SD-WAN, despite the claptrap on Microsoft documentation on this topic. Virtual WAN also:

  • Restricts your options on architecture, features, and network design.
  • Is a nightmare to troubleshoot because the underlying virtual network is hidden in a Microsoft tenant.

Rules Of Engagement

The hub will be your network core in a network stamp: a hub & spoke. The hub & spoke will contain networks in a single region, following concepts:

  • Resilience & independence: Workloads in a spoke in North Europe should not depend on a hub in West Europe.
  • Micro-segmentation: Workloads in North Europe trying to access workloads in West Europe should go through a secure route via hubs in each region.
  • Performance: Workload A in North Europe should not go through a hub in West Europe to reach Workload B in North Europe.
  • Cost Management: Minimise global VNet peering to just what is necessary. Enable costs of hubs to be split into different parts of the organisation.
  • Delegation of Duty: If there are different network teams, enable each team to manage their hubs.
  • Minimised Resources: The hub has roles only of transit, connectivity, and security. Do not place compute or other resources into the hub; this is to minimise security/networking complexity and increase predictability.

A Hub Design Process

The core of our Azure network will have very little in the way of resources. What can be (not “must be”)included in that hub can be thought of as functions:

  • Site-to-site networking: VPN, ExpressRoute, and SD-WAN.
  • Point-to-site VPN: Enabling individuals to connect to the Azure networks using a VPN client on their device.
  • Firewall: Providing security for ingress, egress, and inter-workload communications.
  • Virtual Machines: Reduce costs of secured RDP/SSH by deploying Azure Bastion in the hub.

If we are doing a high-level design, we have a two questions that we will ask about each of thse functions:

  • Is the function required?
  • What technology will be used?

We won’t get into tiers/SKUs, features, or configurations just yet; that’s when we get into low-level or detailed design.

One can use the following flow chart to figure out what to use – it’s a bit of an eye test so you might need to open the image in another tab:

Site-to-Site (S2S) Networking

While it is very commonly used, not every organisation requires site-to-site connectivity to Azure.

For example, I had a migration customer that was (correctly) modernising to the “top tier” of cloud computing by migrating from legacy apps to SaaS. They wanted to re-implement an SD-WAN for over 100 offices to connect their new and small Azure footprint. I was the lead designer so I knew their connectivity requirements – they were going to use Azure Virtual Desktop (AVD) only to connect to their remaining legacy apps. AVD doesn’t need a site-to-site connection. I was able to save that organisation from entering into a costly managed SD-WAN services contract and instead focus on Internet connectivity – not long later they shutdown their Azure footprint when SaaS aleternatives were found for the the last legacy applications.

If we establish that site-to-site connectivity is required then we must ask the first question:

Are latency and SLA important?

If the answer to either of these items is “yes” then there is no choice: An ExpressRoute Virtual Network Gateway is required.

If the answer is no, then we are looking at some kind of VPN connectivity. We can ask another question to determine the type of solution:

Will there be a small number of VPN connections?

If a small number of VPN connections is required, the Azure VPN Virtual Network Gateway is suitable – consider the SKUs/sizes and complexities of management to determine what “a small number” is.

If you determine that the VPN Virtual Network Gateway is unsuitable then an SD-WAN network virtual appliance (NVA) should be used. Note that it would be recommended to deploy Azure Route Server with a third-party VPN/SD-WAN appliance to enable propagation network prefixes:

  • Azure > SD-WAN
  • SD-WAN > Azure

You may find that you need one or more of the above solutions! For example:

  • Some ExpressRoute customers may opt to deploy a parallel VPN tunnel with an identical routing configuration over a completely different ISP. This enables automatic failover from ExpressRoute to VPN in the event of a circuit failure.
  • An SD-WAN customer may also have ExpressRoute for some offices/workloads where SLA or latency are important. Another consideration may be that one workload has other technical requirements that only ExpressRoute (Direct) can service such as very high throughput.

You have one more question to ask after you have picked the site-to-site component(s):

Will you require site-to-site transit through Azure via the site-to-site network connections?

In other words, should Remote Site A be able to route to Remote Site B using your Azure site-to-site connections? If the answer is yes then you must deploy Azure Route Server to enable that routing.

Point-To-Site (P2S) VPN

I personally have not deployed very much of this solution but I do hear it being discussed quite a bit. Some organisations must enable users (or external suppliers) to create a VPN connection from their individual devices to Azure. If this is required then you must ask:

Is the scenario(s) simple?

I’ve kept that vague because the problem is vague. There are two solutions with one being overly-simplistic in capabilities and the other being more fully-featured.

The Azure VPN Gateway (also used for site-to-site VPN) offers a very available (Azure resource) solution for P2S VPN. It offers different configuration for authentication and device support. But it is very limited. For example, it has no routing rules to restrict which users get access to which networks. This means that if you grant network (firewall/NSG) access to one user via the VPN address pool, you must grant the same access to all users, which is clearly pretty poor if you have many types/roles of remote VPN clients (IT, developer of workload X, developer of workload Y, Vendor A, Vendor B, etc).

In such scenarios, one should consider a third-party NVA for point-to-site networking. Third-party NVAs may offer more features for P2S VPN than the VPN Virtual Network Gateway.

A P2S NVA may reside in the same hub as a VPN Virtual Network Gateway (and other S2S solutions).

It’s not in the diagram but you should also consider Entra Global Secure Access as an alternative to P2S VPN. The Private Network Connector would be deployed in a spoke(s), not the hub.

Firewall

Is a firewall required? The correct answer for anyone considering a hub & spoke architecutre should be “of course it is”. But you might not like security, so we’ll ask that question anyway.

Once you determine that security is important to your employer, you must ask yourself:

Shall I use a native PaaS firewall?

The native PaaS solution in Azure is Azure Firewall. I have many technical reasons to prefer Azure Firewall over third-party alternatives. For consultants, a useful attribute of Azure Firewall is that you can skill up on one solution that you can implement/use/manage for many customers and projects (migrations) won’t face repeated delays as you wait on others to implement rules in third-party firewalls.

If you want to use a different firewall then you are free to do so.

If you are using Azure Firewall then there is a follow-up question if there will be S2S network connections:

Are the remote networks using non-RFC1918 address prefixes?

In other words, do the remote networks use address prefixes outside of:

  • 192.168.0.0/16
  • 172.16.0.0/12
  • 10.0.0.0/8

If they do then Azure Firewal requires some configuration because traffic to non-RFC1918 prefixes is forced to the Internet by default – they are Internet addresses after all! You can statically configure the prefixes if they do not change. Or …

  • If you are using Azure Route Server
  • The prefixes can change a lot thanks to scenarios such as acquisition or rapid growth

… you can (in preview today) configure integration between Azure Firewall and Azure Route Server so the firewall dynamically learns the address prefixes from the remote networks.

Virtual Machines

Do not put compute in the hub!

This scenario asks:

Will any of the workloads in your spoke virtual networks have virtual machines?

You will have virtual machines even if you “ban” virtual machines – I guarantee that they will eventually appear for things like security solutions, self-hosted agents, Azure Virtual Desktop, AKS, and so on.

Unfortunately, many consider secure remote access (SSH/RDP) to be opening a port in the firewall for TCP 22/3389. That is not considered secure because those protocols can be and have been attacked. In the past, those who took security seriously used a dedicated “jump box” or “bastion host” to isolate vulnerable on-premises machines from assets in the data centre. We can use the same process with Azure Bastion where there is no IaaS requirement – we leverage Entra security features to authenticate the connection request and the guest OS credentials to verify VM access.

One can deploy Bastion in a spoke – that is perfectly valid for some scenarios. However, many important features are only in the paid-for SKUs so you might wish to deploy a shared Azure Bastion. Unfortunately, routing restrictions by Bastion prevent deploying a shared Bastion in a spoke, so we have no choice but to deploy a shared Azure Bastion in a hub. If you wish to have a share an Azure Bastion across workloads then it will be the final component in the hub.

If/when Azure Bastion supports route tables in the AzureBastionSubnet I will recommend moving shared Bastion deployments to a spoke – yes, I know that we can do that with Azure Virtual WAN but there are many things that we cannot do with Azure Virtual WAN.

You could consider a third-party alterantive or a DIY bastion solution. If so, place that into a spoke because it will be compute-based.

Wrapping Up

As you can see, the high-level design of the hub is very simple.

There are few functions in it because when you understand Azure virtual networks, routing, and NSGs, then you understand that designing a secure network should not be complex. Complexity is the natural predator of manageability and dependable security. There is a little more detail when we get into a low-level or detailed design, but that’s a topic for another day.

Micro-Segmentation Security In Azure Networks

In this post, I want to discuss the importance of designing and implementing micro-segmentation in Azure networks.

Repeating The Same Mistakes

In 2002-2003, the world was being hammered by malware. So much so, that Microsoft did a reset on their Windows development processes and effectively built a new version of Windows XP with Windows XP Service Pack 2. The main security feature of that release was the Windows Firewall – the purpose of this was to isolate each Windows machine in the network by default. It’s a pity that nearly every Windows admin then used Group Policy to disable the Windows Firewall!

Times have moved on and so have the bad guys. Malware isn’t just an anarchist or hobby activity. Malware is a billion-dollar business (ransomware/data theft) and a military activity. Naturally, defences have evolved .. wait .. no … most admins/consultants are still deploying networks that your Daddy/Mommy deployed 22 years ago but I’ll deal with that in another post.

Instead, I want to discuss a part of the defensive solution: micro-segmentation.

Assume Penetration

We must assume that the attacker will always find a way in. Not every attack will be by Sandra Bullock clicking some magical symbol on a website to penetrate the firewall. Most attacks have relatively simple vectors such as stealing a password, hash highjacking, or getting an accountant to open a PDF. Determined attackers aren’t just “driving by”; they will look for an entry. Maybe it’s malware in vendor software that you will deploy! Maybe, it’s a vulnerability in open-source software that your developers will deploy via GitHub? Maybe a managed service provider’s Entra ID tenant has been penetrated and they have Lighthouse access to your Azure subscriptions? Each of those examples bypasses your firewall and any advanced scanning features that it may have. How do you stop them?

Micro-Segmentation

Let me conjure an image for you. A submarine is on patrol. It has a wartime mission. The submarine is always under orders to continue that mission. The submarine is detected by the enemy and is attacked. The attack causes damage which creates a flood. If left unchecked, the flood will sink the ship. What happens? The crew is trained to isolate the flood by sealing the leaking compartment – doors are slammed, seals are locked, and the water is contained in that compartment. Sure, the sailors and ship functions in that compartment are dead, but the ship can continue its mission.

That is a way to visualise micro-segmentation.

Microsoft Zero-Trust

Microsoft has a relatively small collection of documentation on zero-trust architecture for Azure. There are 3 useful bullet points:

  • Be ready to handle attacks before they happen.
  • Minimize the extent of the damage and how fast it spreads.
  • Increase the difficulty of compromising your cloud footprint.

Let’s expand on that a little.

Be Ready

You will be ready for an attack because you assume that you already are under attack. You don’t wait to deploy security systems and configurations; you design them with your workloads. You deploy security with your workloads. You maintain security with your workloads.

Increase The Difficulty of Compromising Your Cloud Footprint

You should put in the defences that are appropriate to your actual risks and ability to install/manage. A bad example is a medical organisation choosing a more affordable firewall to save a few bucks – this is the sort of organisation that will be targeted.

Minimise The Extent of Damage

This can also be referred to as minimising the blast zone. You want to limit how much damage the bad guys cause, just like the submarine limited flooding to the damaged compartment. This means that we make it harder to get from any one point on the network to the next.

It’s one thing to put in the security defences, but you must also:

  • Enable/configure the security features: it shocks me how many organisations/consultants opt not to or don’t know how to enable essential features in their security solution.
  • Monitor your security systems: If we assume that the attacker will get in, then we should monitor our security features to detect and shut down the attack. Again, I’m shocked every time I see security features in Azure that have no logging or alerting enabled.

Microsoft lays out a path to zero-trust where step number one is network segmentation. The basic pattern is laid out:

Applications are partitioned to different Azure Virtual Networks (VNets) and connected using a hub-spoke model

Microsoft uses the term “application”. I prefer the term “workload”. Some, like ITIL, might use the term “service”. A workload is a collection of resources that work together to provide a service to or for the organisation. Maybe it’s a bunch of Azure resources that create a retail site. Maybe it’s a CRM system. Maybe it’s an identity management & governance workload.

The pattern that Microsoft is recommending is one that I have been promoting through my employer for the last 6 years. Each workload gets a dedicated “small” virtual network. The workload VNet is peered with a hub (and only the hub by default). The hub firewall provides isolation and deeper inspection than NSGs can offer.

Step 4 tells us:

Fully distributed ingress/egress cloud micro-perimeters and deeper micro-segmentation

NSGs micro-segment the single or small set of subnet(s) in the VNet, restriocting resource-to-resource connections to just what is required. Isolation is now done centrally and at the NIC, thanks to NSGs. You should also consider network protections on PaaS resources such as Storage Accounts or Key Vaults.

If we revisit the submarine comparison, the workload-specific virtual network is one of the compartments in the boat. If there is a leak (an attack), the NSGs limit or slow down expansion in the subnet(s). The firewall isolates the workload/compartment from other workloads/compartments and the Internet by default to prevent command and control or downloads by the attacker. Deeper firewall inspection searches for attack patterns.

Don’t Forget Monitoring

Microsoft zero-trust has more than just networking. One other step I want to highlight is monitoring/alerting because it ties into the micro-segmentation features of networking. Consider the mechanisms we can put in place:

  • Paas resource firewalls with logging
  • NSG with VNet Flow Logging
  • (Azure) Firewall with logging for firewall rules and deep inspection features (Azure Firewall has Threat Intelligence and IDPS).

Each of those barriers or detection systems can be thought of as a string with a bell on it. The attacker will tickle or trip over those strings. If the bell rings, we should be paying attention. When you fail to put in the barriers or configure monitoring then you don’t know that the attacker is there doing something – and we assume that the attacker will get in and do something – so aren’t we failing to do our job?

It’s Not Just Me Telling You

You can say “There goes Aidan, rattling on about micro-segmentation. Why should I listen to him?”. It would be one thing if it were just me sharing my opinion on Azure network security but what if others told you to do the same things?

Microsoft tells you to implement micro-segmentation. The US NSA tells you to do it. The Canadian Centre for Cyber Security tells you to do it. The UK NCSC tells you to do it. I could keep googling (binging, of course) national security agencies and I’d find the same recommendation with each result. If you are not implementing this security technique designed for today’s threats (not for the Blaster worm of 2003) then you are not only not doing your job but you are choosing to leave the door open for attackers; that could be viewed very poorly by employers, by shareholders, or by informed compliance auditors.

How Many Azure Route Tables Should I Have?

In this Azure Networking deep dive, I’m going to share some of my experience around planning the creation and association of Route Tables in Microsoft Azure.

Quick Recap

The purpose of a Route Table is to apply User-Defined Routes (UDRs). The Route Table is associated with a subnet. The UDRs in the Route Table are applied to the NICs in the subnet. The UDRs override System and/or BGP routes to force routes on outgoing packets to match your desired flows or security patterns.

Remember: There are no subnets or default gateways in Azure; the NIC is the router and packets go directly from the source NIC t the destination NIC. A route can be used to alter that direct flow and force the packets through a desired next hop, such as a firewall, before continuing to the destination.

Route Table Association

A Route Table is associated with one or more subnets. The purpose of this is to cause the UDRs of the Route Table to be deployed to the NICs that are connected to the subnet(s).

Technically speaking, there is nothing wrong with asosciating a single Route Table with more than one subnet. But I would the wisdom of this practice.1:N

1:N Association

The concept here is that one creates a single Route Table that will be used across many subnets. The desire is to reduce effort – there is no cost saving because Route Tables are free:

  1. You create a Route Table
  2. You add all the required UDRs for your subnets
  3. You associate the Route Table with the subnets

It all sounds good until you realise:

  • That individual subnets can require different routes. For example a simple subnet containing some compute might only require a route for 0.0.0.0/0 to use a firewall as a next hop. On the other hand, a subnet containing VNet-integrated API Management might require 60+ routes. Your security model at this point can become complicated, unpredictable, and contradictory.
  • Centrally managing network resources, such as Route Tables, for sharing and “quality control” contradicts one of the main purposes of The Cloud: self-service. Watch how quick the IT staff that the business does listen to (the devs) rebel against what you attempt to force upon them! Cloud is how you work, not where you work.
  • Certain security models won’t work.

1:1 Association

The purpose of 1:1 association is to:

  • Enable granular routing configuration; routes are generated for each subnet depending on the resource/networking/security requirements of the subnet.
  • Enable self-service for developers/operators.

The downside is that you can end up with a lot of subnets – keep in mind that some people create too many subnets. One might argue that this is a lot of effort but I would counter that by saying:

  • I can automate the creation of Route Tables using several means including infrastructure-as-code (IaC), Azure Policy, or even Azure Virtual Network Manager (with it’s new per-VNet pricing model).
  • Most subnets will have just one UDR: 0.0.0.0/0 via the firewall.

What Do I Do & Recommend?

I use the approach of 1:1 association. Each subnet, subject to support, gets its own Route Table. The Route Table is named after the VNet/subnet and is associatded only with that subnet.

I’ve been using that approach for as long as I can remember. It was formalised 6 years ago and it has worked for at scale. As I stated, it’s no effort because the creation/association of the Route Tables is automated. The real benefit is the predictability of the resulting security model.

Routing Is The Security Cabling of Azure

In this post, I want to explain why routing is so important in Microsoft Azure. Without truly understanding routing, and implementing predictable and scaleable routing, you do not have a secure network. What one needs to understand is that routing is the security cabling of Azure.

My Favourite Interview Question

Now and then, I am asked to do a technical interview of a new candidate at my employer. I enjoy doing technical interviews because you get to have a deep tech chat with someone who is on their career journey. Sometimes is a hopeful youngster who is still new to the business but demonstrates an ability and a desire to learn – they’re a great find by the way. Sometimes its a veteran that you learn something from. And sometimes, they fall into the trap of discussing my favourite Azure topic: routing.

Before I continue, I should warn potential interviewees that the thing I dislike most in a candidate is when they talk about things that “happened while I was there” and then they claim to be experts in that stuff.

The candidate will say “I deployed a firewall in Azure”. The little demon on my shoulder says “ask them, ask them, ASK THEM!”. I can’t help myself – “How did you make traffic go through the firewall?”. The wrong answer here is: “it just did”.

The Visio Firewall Fallacy

I love diagrams like this one:

Look at that beauty. You’ve got Azure networks in the middle (hub) and the right (spoke). And on the left is the remote network connected by some kind of site-to-site networking. The deployment even has the rarely used and pricey Network SKU of DDoS protection. Fantastic! Security is important!

And to re-emphasise that security is important, the firewall (it doesn’t matter what brand you choose in this scenario) is slap-bang in the middle of the whole thing. Not only is that firewall important, but all traffic will have to go through it – nothing happens in that network without the firewall controlling it.

Except, that the firewall is seeing absolutely no traffic at all.

Packets Route Directly From Source To Destination

At this point, I’d like you to (re-)read my post, Azure Virtual Networks Do Not Exist. There I explained two things:

  • Everything is a VM in the platform, including NVA routers and Virtual Network Gateways (2 VMs).
  • Packets always route directly from the source NIC to the destination NIC.

In our above firewall scenario, let’s consider two routes:

  • Traffic from a client in the remote site to an Azure service in the spoke.
  • A response from the service in the Azure spoke to the client in the remote site.

The client sends traffic from the remote site across the site-to-site connection. The physical part of that network is the familiar flow that you’d see in tracert. Things change once that packet hits Azure. The site-to-site connection terminates in the NVA/virtual network gateway. Now the packet needs to route to the service in the spoke. The scenario is that the NVA/virtual network gateway is the source (in Azure networking) and the spoke service is the destination. The packet leaves the NIC of the NVA/virtual network and routes directly (via the underlying physical Azure network) directly to the NIC of one of the load-balanced VMs in the spoke. The packet did not route through the firewall. The packet did not go through a default gateway. The packet did not go across some virtual peering wire. Repeat it after me:

Packets route directly from source to destination.

Now for the response. The VM in the spoke is going to send a response. Where will that response go? You might say “The firewall is in the middle of the diagram, Aidan. It’s obvious!”. Remember:

Packets route directly from source to destination.

In this scenario, the destination is the NVA/virtual network gateway. The packet will leave the VM in the spoke and appear in the NIC of the NCA/virtual network gateway.

It doesn’t matter how pretty your Visio is (Draw.io is a million times better, by the way – thanks for the tip, Haakon). It doesn’t matter what your intention was. Packets … route directly from source to destination.

User-Defined Routes – Right?

You might be saying, “Duh, Aidan, User-Defined Routes (UDRs) in Route Tables will solve this”. You’re sort of on the right track – maybe even mostly there. But I know from talking to many people over the years, that they completely overlook that there are two (I’d argue three) other sources of routes in Azure. Those other routes are playing a role here that you’re not appreciating and if you do not configure your UDRs/Route Tables correctly you’ll either change nothing or break your network.

Routing Is The Security Cabling of Azure

In the on-premises world, we use cables to connect network appliances. You can’t get from one top-of-rack switch/VLAN to another without going through a default gateway. That default gateway can be a switch, a switch core, a router, or a firewall. Connections are made possible via cables. Just like water flow is controlled by pipes, packets can only transit cables that you lay down.

If you read my Azure Virtual Networks Do Not Exist post then you should understand that NICs in a VNet or in peered VNets are a mesh of NICs that can route directly to each other. There is no virtual network cabling; this means that we need to control the flows via some other means and that means is routing.

One must understand the end state, how routing works, and how to manipulate routing to end up in the desired end state. That’s the obvious bit – but often overlooked is that the resulting security model should be scaleable, manageable, and predictable.

How Do Network Security Groups Work?

A Greek Phalanx, protected by a shield wall made up of many individuals working under 1 instruction as a unit – like an NSG.

Yesterday, I explained how packets travel in Azure networking while telling you Azure virtual networks do not exist. The purpose was to get readers closer to figuring out how to design good and secure Azure networks without falling into traps of myths and misbeliefs. The next topic I want to tackle is Network Security Groups – I want you to understand how NSGs work … and this will also include Admin Rules from Azure Virtual Network Manager (AVNM).

Port ACLs

In my previous post, Azure Virtual Networks Do Not Exist, I said that Azure was based on Hyper-V. Windows Server 2012 introduced loads of virtual networking features that would go on to become something bigger in Azure. One of them was a mostly overlooked-by-then-customers feature called Port ACLs. I liked Port ACLs; it was mostly unknown, could only be managed using PowerShell and made for great demo content in some TechEd/Ignite sessions that I did back in the day.

Remember: Everything in Azure is a virtual machine somewhere in Azure, even “serverless” functions.

The concept of Port ACLs was it gave you a simple firewall feature controlled through the virtualisation platform – the virtual machine and the guest OS had no control and had to comply. You set up simple rules to allow or deny transport layer (TCP/UDP) traffic on specific ports. For example, I could block all traffic to a NIC by default with a low-priority inbound rule and introduce a high-priority inbound rule to allow TCP 443 (HTTPS). Now I had a web service that could receive HTTPS traffic only, no matter what the guest OS admin/dev/operator did.

Where are Port ACLs implemented? Obviously, it is somewhere in the virtualisation product, but the clue is in the name. Port ACLs are implemented by the virtual switch port. Remember that a virtual machine NIC connects to a virtual switch in the host. The virtual switch connects to the physical NIC in the host and the external physical network.

A virtual machine NIC connects to a virtual switch using a port. You probably know that a physical switch contains several ports with physical cables plugged into them. If a Port ACL is implemented by a switch port and a VM is moved to another host, then what happens to the Port ACL rules? The Hyper-V networking team played smart and implemented the switch port as a property of the NIC! That means that any Port ACL rules that are configured in the switch port move with the NIC and the VM from host to host.

NSG and Admin Rules Are Port ACLs

Along came Azure and the cloud needed a basic rules system. Network Security Groups (NSGs) were released and gave us a pretty interface to manage security at the transport layer; now we can allow or deny inbound or outbound traffic on TCP/UDP/ICMP/Any.

What technology did Azure use? Port ACLs of course. By the way, Azure Virtual Network Manager introduced a new form of basic allow/deny control that is processed before NSG rules called Admin Rules. I believe that this is also implemented using Port ACLs.

A Little About NSG Rules

This is a topic I want to dive deep into later, but let’s talk a little about NSG rules. We can implement inbound (allow or deny traffic coming in) or outbound (allow or deny traffic going out) rules.

A quick aside: I rarely use outbound NSG rules. I prefer using a combination of routing and a hub firewall (dey all by default) to control egress traffic.

When I create a NSG I can associate it with:

  • A NIC: Only that NIC is affected
  • A subnet: All NICs, including Vnet integrated PaaS resources and Private Endpoints, are affected

The association is simply a management scaling feature. When you associate a NSG with a subnet the rules are not processed at the subnet.

Tip: virtual networks do not exist!

Associating a NSG resource with a subnet propagates the rules from the NSG to all NICs that are connected to that subnet. The processing is done by Port ACLs at the NIC.

This means:

  • Inbound rules prevent traffic from entering the virtual machine.
  • Outbound rules prevent traffic from leaving the virtual machine.

Which association should you choose? I advise you to use subnet association. You can see/manage the entire picture in one “interface” and have an easy-to-understand processing scenario.

If you want to micro-manage and have an unpredictable future then go ahead and associate NSGs with each NIC.

If you hate yourself and everyone around you, then use both options at the same time:

  • The subnet NSG is processed first for inbound traffic.
  • The NIC NSG is processed first for outbound traffic.

Keep it simple, stupid (the KISS principle).

Micro-Segmentation

As one might grasp, we can use NSGs to micro-segment a subnet. No matter what the resources do, they cannot bypass the security intent of the NSG rules. That means we don’t need to have different subnets for security zones:

  • We zone using NSG rules.
  • Virtual networks and their subnets do not exist!

The only time we need to create additional subnets is when there are compatibility issues such as NSG/Route table association or a PaaS resource requires a dedicated subnet.

Watch out for more content shortly where I break some myths and hopefully simplify some of this stuff for you. And if I’m doing this right, you might start to look at some Azure networks (like I have) and wonder “Why the heck was that implemented that way?”.

Azure Firewall Deep Dive Training

I’ll tell you about my new virtual training course on Azure Firewall and share some schedule information in this post.

Background

I’ve been talking about Azure Firewall for years. I’ve done lots of sessions at user groups and conferences. I’ve done countless handovers with customers and colleagues. One of my talking points is that I reckoned that I could teach someone with a little Azure/networking knowledge everything there is to know about Azure Firewall in 2 days. And that’s what I decided to do!

I was updating one of my sessions earlier in the year when I realised that it was pretty must the structure of a training couse. Instead of me just listing out features or barely dicusssing architecture to squeeze it into a 45-60 minute-long session, I could take the time to dive deep and share all that I know or could research.

The Course

I produced a 2-day course that could be taught in-person, but my primary vector is virtual/online – it’s hard to get a bunch of people from all over into one place and there is also a cost to me in hosting a physical event that would increse the cost of the course. I decided that virtual was best, with an option off doing it in person if a suitable opportunity arose.

The course content is delivered using a combination of presentation and demo. Presentation lets me explain the what’s, why’s and so on. Demonstration lets me show you how.

The demo lab is built from a Bicep deployment, based on Azure Verified Modules (AVM). A hub & spoke network architecture is created with an Application Gateway, a simple VM workload, and a simple App Services (Private Endpoint) workload. The demonstrations follow a “hands-on guide”; this guide is written as if this was a step-by-step hands-on course, instructing the reader exactly which button to click and what/where to type. Each exercise builds on the last, eventually resulting in a secure network architecture with all of the security, monitoring, and management bells and whistles.

Why did I opt for demo instead of hands-on? Hands-on works for in-person classes. But you cannot assist in the same way when people struggle. In addition, waiting for attendees to complete labs would add another day (and cost) to the class.

Before and class, I share all of the content that I use:

  • System requirements and setup instructions.
  • The Bicep files for the demo lab.
  • The hands-on lab instructions
  • The PowerPoint
  • And a few more useful bits

I always update content – for example, my first run of this class was during Microsoft Ignite 2024 and I added a few bits from the news. Therefore I share the updated content with attendees after the course.

The First Run

I ran the class for the first time earlier this week, Novemer 20-21 2024. Attendees from all around Europe joined me for 2 days. At first they were quiet. Online is tough for speakers like me because I look for visual feedback on how I’m doing. But then the questions started coming – people were interested in what I was saying. Interaction also makes the class more interesting for me – sometimes you get comments that coer things you didn’t originally include and everyone benefits – I updated the course with one such item at the end of day 1!

I shared a 4-question anonymouse survey to learn what people thought. The feedback was awesome.

Feedback

This course was previously run in November 2024 for a European audience. The survey feedback was as follows:

How would you rate this course?

  • Excellent: 83%
  • Good: 17%

Was This Course Worth Your Time?

  • Yes: 100%

Would you recommend this course to others?

  • Yes: 100%

Some of the comments:

“I think it is a very good introduction to Azure Firewall, but it goes beyond foundational concepts so medium- experienced admins will also get value from this. I like the sections on architecture and explanations of routing and DNS. I think this course will enable people to do a good job more than for example az 700 because of the more practical approach. You are good at explaining the material”.

“Just what I wanted from a Deep dive course.”

“Perfectly delivered. Crystal clear content and very well explained”.

Future Classes

I have this class scheduled for two more runs, each timed for different parts of the world:

The classes are ultra-affordable. A few hundred Euros/dollars gets you custom content based on real-world usage. I did fint a virtual 2-day course on Palo Alto firewalls that cost $1700! You’ll also find that I run early-bird registration costs and discounts for more than 1 booking. If you have a large group (5+) then we might be able to figure out a lower rate 🙂

More To Come

More classes are coming! I have an old one to reinvent based on lots of experience over the years and at least 1 new one to write from scratch. Watch out for more!

Azure Back To School 2024 – Govern Azure Networking Using Azure Virtual Network Manager

This post about Azure Virtual Network Manager is a part of the online community event, Azure Back To School 2024. In this post, I will discuss how you can use Azure Virtual Network Manager (AVNM) to centrally manage large numbers of Azure virtual networks in a rapidly changing/agile and/or static environment.

Challenges

Organisations around the globe have a common experience: dealing with a large number of networks that rapidly appear/disappear is very hard. If those networks are centrally managed then there is a lot of re-work. If the networks are managed by developers/operators then there is a lot of governance/verification work.

You need to ensure that networks are connected and are routed according to organisation requirements. Mandatory security rules must be put in place to either allow required traffic or to block undesired flows.

That wasn’t a big deal in the old days when there were maybe 3-4 huge overly trusting subnets in the data centre. Network designs change when we take advantage of the ability to transform when deploying to the cloud; we break those networks down into much smaller Azure virtual networks and implement micro-segmentation. This approach introduces simplified governance and a superior security model that can reliably build barriers to advanced persistent threats. Things sound better until you realise that there are no many more networks and subnets that there ever were in the on-premises data centre, and each one requires management.

This is what Azure Virtual Network Manager was created to help with.

Introducing Azure Virtual Network Manager

AVNM is not a new product but it has not gained a lot of traction yet – I’ll get into that a little later. Spoiler alert: things might be changing!

The purpose of AVNM is to centralise configuration of Azure virtual networks and to introduce some level of governance. Don’t get me wrong, AVNM does not replace Azure Policy. In fact, AVNM uses Azure Policy to do some of the leg work. The concept is to bring a network-specialist toolset to the centralised control of networks instead of using a generic toolset (Azure Policy) that can be … how do I say this politely … hmm … mysterious and a complete pain in the you-know-what to troubleshoot.

AVNM has a growing set of features to assist us:

  • Network groups: A way to identify virtual networks or subnets that we want to manage.
  • Connectivity configurations: Manage how multiple virtual networks are connected.
  • Security admin rules: Enforce security rules at the point of subnet connection (the NIC).
  • Routing configurations: Deploy user-defined routes by policy.
  • Verifier: Verify the networks can allow required flows.

Deployment Methodology

The approach is pretty simple:

  1. Identify a collection of networks/subnets you want to configure by creating a Network Group.
  2. Build a configuration, such as connectivity, security admin rules, or routing.
  3. Deploy the configuration targeting a Network Group and one or more Azure regions.

The configuration you build will be deployed to the network group members in the selected region(s).

Network Groups

Part of a scalable configuration feature of AVNM is network groups. You will probably build several or many network groups, each collecting a set of subnets or networks that have some common configuration requirement. This means that you can have ea large collection of targets for one configuration deployment.

Network Groups can be:

  • Static: You manually add specific networks to the group. This is ideal for a limited and (normally) unchanging set of targets to receive a configuration.
  • Dynamic: You will define a query based on one or more parameters to automatically discover current and future networks. The underlying mechanism that is used for this discovery is Azure Policy – the query is created as a policy and assigned to the scope of the query.

Dynamic groups are what you should end up using most of the time. For example, in a governed environment, Azure resources are often tagged. One can query virtual networks with specific tags and in specific Azure regions and have them automatically appear in a network group. If a developer/operator creates a new network, governance will kick in and tag those networks. Azure Policy will discover the networks and instantly inform AVNM that a new group member was discovered – any configurations applied to the group will be immediately deployed to the new network. That sounds pretty nice, right?

Connectivity Configurations

Before we continue, I want you to understand that virtual network peering is not some magical line or pipe. It’s simply an instruction to the Azure network fabric to say “A collection of NICs A can now talk with a collection of NICs B”.

We often want to either simplify the connectivity of networks or to automate desired connectivity. Doing this at scale can be done using code, but doing it in an agile environment requires trust. Failure usually happens between the chair and the keyboard, so we want to automate desired connectivity, especially when that connectivity enables integration or plays a role in security/compliance.

Connectivity Configurations enable three types of network architecture:

  • Hub-and-spoke: This is the most common design I see being required and the only one I’ve ever implemented for mid-large clients. A central regional hub is deployed for security/transit. Workloads/data are placed in spokes and are peered only with the hub (the network core). A router/firewall is normally (not always) the next hop to leave a spoke.
  • Full mesh: Every virtual network is connected directly to every other virtual network.
  • Hub-and-spoke with mesh: All spokes are connected to the hub. All spokes are connected to each other. Traffic to/from the outside world must go through the hub. Traffic to other spokes goes directly to the destination.

Mesh is interesting. Why would one use it? Normally one would not – a firewall in the hub is a desirable thing to implement micro-segmentation and advanced security features such as Intrusion Detection and Prevention System (IDPS). But there are business requirements that can override security for limited scenarios. Imagine you have a collection of systems that must integrate with minimised latency. If you force a hop through a firewall then latency will potentially be doubled. If that firewall is deemed an unnecessary security barrier for these limited integrations by the business, then this is a scenario where a full mesh can play a role.

This is why I started off discussing peering. Whether a system is in the same subnet/network or not, it doesn’t matter. The physical distance matters, not the virtual distance. Peering is not a cable or a connection – it’s just an instruction.

However, Virtual Network Peering is not even used in mesh! It’s something different that can handle the scale of many virtual networks being interconnected called a Connected Group. One configuration inter-connects all the virtual networks without having to create 1-1 peerings between many virtual networks.

A very nice option with this configuration is the ability to automatically remove pre-existing peering connections to clean up unwanted previous designs.

Security Admin Rules

What is a Network Security Group (NSG) rule? It’s a Hyper-V port ACL that is implemented at the NIC of the virtual machine (yours or in the platform hosting your PaaS service). The subnet or NIC association is simply a scaling/targeting system; the rules are always implemented at the NIC where the virtual switch port is located.

NSGs do not scale well. Imagine you need to deploy a rule to all subnets/NICs to allow/block a flow. How many edits will you need to do? And how much time will you waste on prioritising rules to ensure that your rule is processed first?

Security Admin Rules are also implemented using Port ACLs but they are always processed first. You can create a rule or a set or rules and deploy it to a Network Group. All NICs will be updated and your rules will always be processed first.

Tip: Consider using VNet Flow Logs to troubleshoot Security Admin Rules.

Routing Configurations

This is one of the newer features in AVNM and was a technical blocker for me until it was introduced. Routing plays a huge role in a security design, forcing traffic from the spoke through a firewall in the hub. Typically, in VNet-based hub deployments, we place one user-defined route (UDR) in each subnet to make that flow happen. That doesn’t scale well and relies on trust. Some have considered using BGP routing to accomplish this but that can be easily overridden after quite a bit of effort/cost to get the route propagated in the first place.

AVNM introduced a preview to centrally configure UDRs and deploy them to Network Groups with just a few clicks. There are a few variations on this concept to decide how granular you want the resulting Route Tables to be:

  • One is shared with virtual networks.
  • One is shared with all subnets in a virtual network.
  • One per subnet.

Verification

This is a feature that I’m a little puzzled about and I am left wondering if it will play a role in some other future feature. The idea is that you can test your configurations to ensure that they are working. There is a LOT of cross-over with Network Watcher and there is a common limitation: it only works with virtual machines.

What’s The Bad News?

Once routing configurations go generally available, I would want to use AVNM in every deployment that I do in the future. But there is a major blocker: pricing. AVNM is priced per subscription at $73/month. For those of you with a handful of subscriptions, that’s not much at all. But for those of us who saw that the subscription is a natural governance boundary and use LOTS of subscriptions (like in Microsoft Cloud Adoption Framework), this is a huge deal – it can make AVNM the most expensive thing we do in Azure!

The good news is that the message has gotten through to Microsoft and some folks in Azure networking have publicly commented that they are considering changes to the way that the pricing of AVNM is calculated.

The other bit bad news is an oldie: Azure Policy. Dynamic network group membership is built by Azure Policy. If a new virtual network is created by a developer, it can be hours before policy detects it and informs AVNM. In my testing, I’ve verified that once AVNM sees the new member, it triggers the deployment immediately, but the use of Azure Policy does create latency, enabling some bad practices to be implemented in the meantime.

Summary

I was a downer on AVNM early on. But recent developments and some of the ideas that the team is working on have won me over. The only real blocker is pricing, but I think that the team is serious about fixing that. I stated earlier that AVNM hasn’t gotten a lot of traction. I think that this should change once pricing is fixed and routing configurations are GA.

I recently demonstrated using AVNM to build out the connectivity and routing of a hub-and-spoke with micro-segmentation at a conference. Using Azure Portal, the entire configuration probably took less than 10 minutes. Imagine that: 10 minutes to build out your security and compliance model for now and for the future.

Azure Route Server Saves The Day

In this post, I will discuss a recent scenario where we used Azure Route Server branch-to-branch routing to rescue a client.

The Original Network Design

This client is a large organisation with a global footprint. They had a previous WAN design that was out of scope for our engagement. The heart of the design was Meraki SD-WAN, connecting their global locations. I like Meraki – it’s relatively simple and it just works – that’s coming from me, an Azure networking person with little on-premises networking experience.

The client started using the services of a cloud provider (not Microsoft). The client followed the guidance of the vendor and deployed a leased line connection to a cloud region that was close to their headquarters and to their own main data centre. The leased line provides low latency connectivity between applications hosted on-premises and applications/data hosted in the other cloud.

Adding Azure

The customer wanted to start using Azure for general compute/data tasks. My employer was engaged to build the original footprint and to get them started on their journey.

I led the platform build-out, delegating most of the hands-on and focusing on the design. We did some research and determined the best approach to integrate with the other cloud vendor was via ExpressRoute. The Azure footprint was placed in an Azure region very close to the other vendor’s region.

An ExpressRoute circuit was deployed between a VNet-based hub in Azure – always my preference because of the scalability, security/governance concepts, and the superiority over Virtual WAN hub when it comes to flexibility and troubleshooting. The Meraki solution from the Azure Marketplace was added to the hub to connect Azure to the SD-WAN and BGP propagation with Azure was enabled using Azure Route Server. To be honest – that was relatively simple.

The customer had two clouds:

  • The other vendor via a leased line.
  • Azure via SD-WAN.
  • And an interconnect between Azure and the other cloud via ExpressRoute.

Along Came a Digger

My day-to-day involvement with the client was over months previously. I got a message early one morning from a colleague. The client was having a serious networking issue and could I get online. The issue was that an excavator/digger had torn up the lines that provided connectivity between the client’s data centre and the other cloud.

Critical services in the other Cloud were unavailable:

  • App integration and services with the on-premises data centre.
  • App availability to end users in the global offices.

I thought about it for a short while and checked out my theory online. One of the roles of Azure Route server is to enable branch to branch connectivity between “on-premises” locations between ExpressRoute/VPN.

Forget that the other cloud is a cloud – think of the other cloud’s region as an on-premises site that is connected via ExpressRoute and the above Microsoft diagram makes sense – we can interconnect the two locations via BGP propagation through Azure Route Server:

  • The “on-premises” location via ExpressRoute
  • The SD-WAN via the Meraki which is already peered with Azure Route Server

I presented the idea to the client. They processed the information quickly and the plan was implemented quickly. How quickly? It’s one setting in Azure Route Server!

The Solution

The workaround was to use Azure as a temporary route to the other Cloud. The client had routes from their data centre and global offices to Azure via the Meraki SD-WAN. BGP routes were propagating between the SD-WAN connected locations, thanks to the peering between the Meraki NVA in the Azure hub and Azure Route Server.

BGP routes were also propagating between the other cloud and Azure thanks to ExpressRoute.

The BGP routes that did exist between the SD-WAN and the other cloud were gone because the leased line was down – and was going to be down for some time.

We wanted to fill the gap – get routes from the other cloud and the SD-WAN to propagate through Azure. If we did that then the SD-WAN locations and the other cloud could route via the Meraki and the ExpressRoute gateway in the Azure Hub – Azure would become the gateway between the SD-WAN and the other cloud.

The solution was very simple: enable branch-to-branch connectivity in Azure Route Server. There’s a little wait when you do that and then you run a command to check the routes that are being advertised to the Route Server peer (the Meraki NVA in this case).

The result was near instant. Routes were advertised. We checked Azure Monitor metrics on the ExpressRoute circuit and could see a spike in traffic that coincided with the change. The plan had worked.

The Results

I had not heard anything in a while. This morning I heard that the client was happy with the fix. In fact, user experience was faster.

Go back to the original diagram before Azure and I can explain. Users are located in the branch offices around the world. Their client applications are connecting to services/data in the other cloud. Their route is a “backhaul”:

  1. SD-WAN to central data centre
  2. Leased line over long distance to the other cloud

When we introduced the “Azure bypass” after the leased line failure, a new route appeared for end users:

  1. SD-WAN to Azure
  2. A very short distance hop over ExpressRoute

Latency was reduced quite a bit so user experience improved. On the contrary, latency between the on-premises data centre and the other cloud has increased because the SD-WAN is a new hop but at least the path is available. The original leased line is still down after a few weeks – this is not the fault of the client!

Some Considerations

Ideally one would have two leased lines in place for failover. That incurs costs and it was not possible. What about Azure ExpressRoute Metro? That is still in preview at this time and is not available in the Azure metro in question.

However, this workaround has offered a triangle of connectivity. When the lease line in repaired, I will recommend that the triangle becomes their failover – if any one path fails, the other two will take the place, bringing the automatic recoverability that was part of the concept of the original ARPANET.

The other change is that the other cloud should become another site in the Meraki SD-WAN to improve the user app experience.

If we do keep branch-to-branch connectivity then we need to consider “what is the best path”? For example, we want the data centre to route directly to the other cloud when the leased line is available because that offers the lowest latency. But what if a route via Azure is accidentally preferred? We need control.

In Azure Route Server, we have the option to control connectivity from the Azure perspective (my focus):

  • (Default) Prefer ExpressRoute: Any routes received over ExpressRoute will be used. This would offer sub-optimal routes because on-premises prefixes will be received from the other cloud.
  • Prefer VPN: Any routes received over VPN will be used. This would offer sub-optimal routes because other cloud prefixes will be received from on-premises.
  • Use AS path: Let the admin/network advertise a preferred path. This would offer the desired control – “use this path unless something goes wrong”.

Azure’s Software Defined Networking

In this post, I will explain why Azure’s software-defined networking (virtual networks) differs from the cable-defined networking of on-premises networks.

Background

Why am I writing this post? I guess that this one has been a long time coming. I noticed a trend early in my working days with Azure. Most of the people who work with Azure from the infrastructure/platform point of view are server admins. Their work includes doing all of the resource stuff you’d expect, such as Azure SQL, VMs, App Services, … virtual networks, Network Security Groups, Azure Firewall, routing, … wait … isn’t that networking stuff? Why isn’t the network admin doing that?

I think the answer to that question is complicated. A few years ago I added a question to the audience to some of my presentations on Azure networking. I asked who was a ON-PREMISES networking admin versus an ON-PREMISES something-else. And then I said “the ‘server admins’ are going to understand what I will tech more easily than the network admins will”. I could see many heads nodding in agreement. Network admins typically struggle with Azure networking because it is very different.

Cable-Defined Networking

Normally, on-premises networking is “cable-defined”. That phrase means that packets go from source to destination based on physical connections. Those connections might be indirect:

  • Appliances such as routers decide what turn to take at a junction point
  • Firewalls either block or allow packets
  • Other appliances might convert signals from electrons to photons or radio waves.

A connection is always there and, more often than not, it’s a cable. Cables make packet flow predictable.

Look at the diagram of your typical on-premises firewall. It will have ethernet ports for different types of networks:

  • External
  • Management
  • Site-to-site connectivity
  • DMZ
  • Internal
  • Secure zone

Each port connects to a subnet that is a certain network. Each subnet has one or more switches that only connect to servers in that subnet. The switches have uplinks to the appropriate port in the firewall, thus defining the security context of that subnet. It also means that a server in the DMZ network must pass through the firewall, via the cable to the firewall, to get to another subnet.

In short, if a cable does not make the connection, then the connection is not possible. That makes things very predictable – you control the security and performance model by connecting or not connecting cables.

Software-Defined Networking

Azure is a cloud, and as a cloud, it must enable self-service. Imagine being a cloud subscriber, and having to open a support call to create a network or a subnet. Maybe they need to wait 3 days while some operators plug in cables and run Cisco commands. Or they need to order more switches because they’ve run out of capacity and you might need to wait weeks. Is this the hosting of the 2000’s or is it The Cloud?

Azure’s software-defined networking enables the customer to run a command themselves (via the Portal, script, infrastructure-as-code, or API) to create and configure networks without any involvement from Microsoft staff. If I need a new network, a subnet, a firewall, a WAF, or almost anything networking in Azure (with the exception of a working ExpressRoute circuit) then I don’t need any human interaction from a support staff member – I do it and have the resource anywhere from a few seconds to 45 minutes later, depending on the resource type.

This is because the physical network of Azure is overlayed with a software-defined network based on VXLAN. In simple terms, you have no visibility of the physical network. You use simulated networks that hide the underlying complexities, scale, and addressing. You create networks of your own address/prefix choice and use them. Your choice of addresses affects only your networks because they actually have nothing to do with how packets route at the physical layer – that’s handled by traditional networking at the physical layer – but that’s a matter only for the operators of the Microsoft global network/Azure.

A diagram helps … and here’s one that I use in my Azure networking presentations.

In this diagram, we see a source and a destination running in Azure. In case you were not aware:

  • Just about everything in Azure runs in a virtual machine, even so-called serverless computing. That virtual machine might be hidden in the platform but it is there. Exceptions might include some very expensive SKUs for SAP services and Azure VMware hosts.
  • The hosts for those virtual machines are running (drumroll please) Hyper-V, which as one may now be forced to agree, is scalable 😀

The source wants to send a packet to a destination. The source is connected to a Virtual Network and has the address of 10.0.1.4. The destination is connected to another virtual network (the virtual networks are peered) and has an address of 10.10.1.4. The virtual machine guest OS sends the packet to the NIC where the Azure fabric takes over. The fabric knows what hosts the source and destination are running on. The packet is encapsulated by the fabric – the letter is put into a second envelope. The envelope has a new source address, that of the source host, and a new destination, the address of the destination host. This enables the packet to traverse the physical network of Microsoft’s data centres even if 1000s of tenants are using the 10.x.x.x prefixes. The packet reaches the destination host where it is decapsulated, unpacking the original packet and enabling the destination host to inject the packet into the NIC of the destination.

This is why you cannot implement GRE networking in Azure.

Virtual Networks Aren’t What You Think

The software-defined networking in Azure maintains a mapping. When you create a virtual network, a new map is created. It tells Azure that NICs (your explicitly created NICs or those of platform resources that are connected to your network) that connect to the virtual network are able to talk to each other. The map also tracks what Hyper-V hosts the NICs are running on. The purpose of the virtual network is to define what NICs are allowed to talk to each other – to enforce the isolation that is required in a multi-tenant cloud.

What happens when you peer two virtual networks? Does a cable monkey run out with some CAT6 and create a connection? Is the cable monkey creating a virtual connection? Does that connection create a bottleneck?

The answer to the second question is a hint as to what happens when you implement virtual network peering. The speed of connections between a source and destination in different virtual networks is the potential speed of their NICs – the slowest NIC (actually the slowest VM, based on things like RSS/VMQ/SR-IOV) in any source/destination flow is the bottleneck.

VNet peering does not create a “connection”. Instead, the mapping that is maintained by the fabric is altered. Think of it being like a Venn Diagram. Once you implement peering, the loops that define what can talk to what has a new circle. VNet1 has a circle encompassing its NICs. VNet2 has a circle encompassing its NICs. Now a new circle is created that encompasses VNet1 and VNet2 – any source in VNet1 can talk directly, using encapsulation/decapsulation) to any destination in VNet2 and vice versa without going through some resource in the virtual networks.

You might have noticed before now that you cannot ping the default gateway in an Azure virtual network. It doesn’t exist because there is no cable to a subnet appliance to reach other subnets.

You also might have noticed that tools like traceroute are pretty useless in Azure. That’s because the expected physical hops are not there. This is why using tools like test-netconnection (Windows PowerShell) or Network Watcher Connection Troubleshoot/Connection Monitor are very important.

Direct Connections

Now you know what’s happening under the covers. What does that mean? When a packet goes from source to destination, there is no hop. Have a look at the diagram below.

It’s not an unusual diagram. There’s an on-prem network on the left that connects to Azure virtual networks using a VPN tunnel that is terminated in Azure by a VPN Gateway. The VPN Gateway is deployed into a hub VNet. There’s some stuff in the hub, including a firewall. Services/data are deployed into spoke VNets – the spoke VNets are peered with the hub.

One can immediately see that the firewall, in the middle, is intended to protect the Azure VNets from the on-premises network(s). That’s all good. But this is where the problems begin. Many will look at that diagram and think that this protection will just work.

If we take what I’ve explained above we’ll understand really what will happen. The VPN Gateway is implemented in the platform as two Azure virtual machines. Packets will come in over the tunnel to one of those VMs. Then the packets will hit the NIC of the VM to route to a spoke VNet. What path will those packets take? There’s a firewall in the pretty diagram. The firewall is placed right in the middle! And that firewall is ignored. That’s because packets leaving the VPN Gateway VM will be encapsulated and go straight to the NIC of the destination NIC in one of the spokes as if it were teleported.

To get the flow that you require for security purposes you need to understand Azure routing and either implement the flow via BGP or User-Defined Routing.

Now have a look at this diagram of a virtual appliance firewall running in Azure from Palo Alto.

Look at all those pretty subnets. What is the purpose of them? Oh I know that there’s public, management, VPN, etc. But why are they all connecting to different NICs? Are there physical cables to restrict/control the flow of packets between some spoke virtual network and a DMZ virtual network? Nope. What forces packets to the firewall? Azure routing does. So those NICs in the firewall do what? They don’t isolate, they complicate! They aren’t for performance, because the VM size controls overall NIC throughput and speed. They don’t add performance, they complicate!

The real reason for all those NICs is to simulate eth0, eth1, etc that are referenced by the Palo Alto software. It enables Palo Alto to keep the software consistent between on-prem appliances and their Azure Marketplace appliance. That’s it – it saves Palo Alto some money. Meanwhile, Azure Firewall using a single IP address on the virtual network (via the Standard tier load balancer, but you might notice each compute instance IP as a source) and there is no sacrifice in security.

Wrapping Up

There have been countless times over the years when having some level of understanding of what is happening under the covers has helped me. If you grasp the fundamentals of how packets rally get from A to B then you are better prepared to design, deploy, operate, or troubleshoot Azure networking.