There Is More To Azure Networking Than Connectivity & Security

This post will explain how a well-designed, secured, governed and managed network design plays a foundational role in digital transformation and cloud enablement.

Cloud Adoption Versus Cloud Migration

What? Aidan – I thought this was a post about Azure networking!

Yes, it is … but you’ll have to join me on this journey. Lately, I’ve been using the “we need to step back and think about why we’re doing any of this” line quite a bit. The context of that line changes, but the message remains consistent.

Why did we go to The Cloud (Azure in our case)? For many, the reason is something like “I was told to”, “we were leaving our old hosting company”, or “our hardware support ended”. Those reasons triggered what I call a cloud migration project. I’ve done a LOT of those projects – thanks to scope limitations in the engagement, forced either by poorly advised customers (that lead to restricted tenders) or salespeople who refused to have a larger conversation.

Many organisations with internal developers that do a cloud migration end up in a situation 18-24 months later. Developers refuse to deploy into “IT’s cloud”. This is because IT has recreated its old data centre in Azure, along with the restrictions, controls, and lack of trust. We were told “cloud is how you work, not where you work”, but not many people heard that message. We end up with situations where businesses have paid for Azure, but developers don’t get the Cloud; they get IT-driven and IT-restricted virtualisation in Azure.

Cloud Adoption is a change journey, as documented by the Cloud Adoption Framework. We are supposed to:

  1. Understand why the business (not IT) wants to use the Cloud
  2. Create a cloud strategy for the organisation
  3. Define and enable a new way of delivering cross-functional digital services.
  4. Do all the other technical stuff that we focus on, with the architecture based on the above.

Steps 1 and 2 (CAF Strategy and Phase) are the keys to cloud adoption success. In theory, if we do everything correctly:

  1. The developers want to adopt the new cloud environment because it enables their mission.
  2. The business sees a return on the investment with faster innovation of digital services.

Where Does Networking Come Into This?

Pretty much every customer I’ve dealt with wants to improve their security for business protection or to meet compliance requirements. That typically results in larger usage of Virtual Networks. Many customers end up recreating their data centre networks in Azure; they create 1 Virtual Network (spoke) for each VLAN:

  • DMZ
  • Regular zone
  • Secure Zone

Or maybe they have:

  • Dev
  • Test
  • Production

Each of these networks shares various traits:

  • A big virtual network with many subnets
  • Managed by the central IT infrastructure

I can go into all the security and complexity flaws that result from this too-common design pattern. But my focus is on cloud adoption in this post:

  • Developers are actively prevented from having network access/control. They rely on helpdesk tickets to get anything done – what happened to the essential cloud trait of “on-demand self-service”?
  • Subscriptions are filled with dozens of resource groups. Access is granted on a per-resource group granularity, which complicates and slows things down.
  • The desire for more security is gradually eroded due to operational complexity and constant delegation of rights with complicated granularity.

So, believe it or not, Azure networking is our canary in the mine. I have used, and I continue to use this reliable little bird to smell out operational/security failures in customers’ Azure environments.

Now, you know how I can detect adoption problems from the floor up. Next I want to explain how I can architect the Azure network to solve these issues.

Landing Zones

Let’s bend some minds. 8-ish years ago, I started working on a new “standard design” for my employer (a consulting company) with a fellow principal consultant. We mutually came to the table with an alternative subscription strategy than usual. The norm was that each of the above traditional spoke VNets would be aligned with a subscription each. That results in very few subscriptions, with demands for complicated role delegations, tagging, cost management, and so on. We switched to a 1 subscription/workload (application/service) approach; this new level of granularity:

  • Required 1 small Virtual Network where networking is required
  • Developer/operator role delegations are done once per subscription
  • Cost management is done per subscription (Budgets) with much less tagging for metadata
  • Easier operations with fewer mistakes through subscription selection in Azure Portal/PowerShell/CLI/etc. The resource groups in the subscription are related to only that workload.
  • The security boundary is much smaller. The access boundary is the single workload. Any VNet-based workloads must route via the hub firewall to reach any other workload, subject to rules and IDPS inspection.

Microsoft introduced the concept of landing zones a few years ago, which uses the same subscription/workload approach:

  • Platform landing zone: A subscription that offers shared infrastructure, such as a hub, a shared Application Gateway/WAF, Active Directory Domain Controllers, DNS, etc.
  • Application landing zone: A subscription that hosts a single application/service/workload.

Like with my approach, each landing zone has a Virtual Network (if required) that is:

  • Sized according to the workload architecture with some spare capacity.
  • Peered with the hub, with the egress path from the workload being via the hub firewall.

Security & Governance

Let’s consider some things:

  • The business requires governance to manage IT and to ensure regulatory compliance.
  • IT security must protect the business, customers, vendors, etc.
  • We have many workloads/subscriptions.

We cannot have 1 policy for everything – sometimes we have business/operational reasons to have more-strict policies or less-strict policies. For example, we might require more Defender for Cloud features in some workloads or allow PaaS public endpoints in others.

Microsoft gave us Enterprise Scale around 5 years ago. This reference architecture (with supplied templated deployments) offers a subscription categorisation approach using Management Groups:

  • Corporate: Workloads that can connect to other networks.
  • Online: Workloads that have an online presence and should not connect to other workloads.

Azure Policy is used to enforce the standards for each Management Group.

I don’t know about you, but I have never seen such a binary requirement in the real world. I’ve seen many people discuss/use a third Management Group called Hybrid; they wonder how to build the policies to enforce the requirements.

In the real world, just about everything is shades of grey when it comes to connectivity. I’ve had ultra-secure workloads with web interfaces. I’ve had low-end workloads with high security. And I can guarantee you that sensitive workloads have compelling business reasons to be both online and integrated with traditional private-protocol connectivity.

I thought about this last year and came up with a different approach. We can use CAF’s operational methodologies to develop a tiered, documented, and implemented policy that aligns with the organisation’s governance, security, and management requirements. I suggested that we would have three tiers (names are irrelevant):

  • Gold: The strictest policies
  • Silver: Medium-level policies, containing the most workloads
  • Bronze: The most relaxed policies

The result is 3 Management Groups (above), each with Azure Policy automatically auditing/enforcing the designed and continuously improved requirements.

The new (CAF Plan) operational model would introduce a step to categorise the workload based on security risks, governance requirements, and management needs. Each workload would be placed in the correct Management Group with policies.

The policies give us automation and guardrails. For example, where appropriate, we can:

  • Restrict regions.
  • Ban public IP association with NICs
  • Disable public endpoints
  • Enable Defender for Cloud plans
  • Force VNet Flow Logging
  • Configure diagnostics settings
  • Enable VNet Flow Logs
  • And much more

The key to this is momentum. My approach is “minimum viable product” (MVP). For example, I had a 30-minute call with a customer last year and designed their starter policies. Now they (should) run regular reviews to assess the policies/risks/requirements and expand the policies/implementations. We didn’t freeze for 2 years to build a policy. We got some essentials in place and we carried on with getting results for the business.

Now, let’s get back to networking!

At-Scale Network Configuration And Enforcement

Developers, operators, and (rival) service providers are empowered to build in the Azure environment with a new guardrail-protected landing zone approach. How do we ensure that their Virtual Networks are built correctly?

We can use Azure Virtual Network Manager (AVNM).

Note that the horrid per-subscription pricing for AVNM was replaced a long time ago. Please go back and reassess the pricing before you run away.

AVNM gives us policy-driven:

  • Discovery and grouping of Virtual Networks for granular policy assignments
  • Peering with a hub and mesh capabilities
  • Route Table deployment/association with User-Defined Routes (UDRs)
  • Security Admin Rules that are processed before NSG rules with override capabilities
  • IP Address Management (IPAM) to provide approved, non-repeating IP prefixes for new networks and to manage their lifecycle

In short, if you deploy a VNet, I can:

  • Get an approved IP prefix for the Virtual Network
  • Use Azure Policy to automatically configure/enforce things like VNet Flow Logs and DNS settings
  • Use AVNM to correctly connect, route, and secure your VNet

To quote Van Halen: “they got you coming in, and they got you going out”. I always did prefer “Van Hagar” ๐Ÿ™‚

Summary

A legacy, cable-oriented, on-prem network in Azure indicates that the organisation has not modernised how digital services are created, operated, and delivered to the business. In short, the business is paying for the cloud but is getting remotely hosted Hyper-V.

We can enable modern collaborative working processes by modernising our designs. Using application landing zones will create a new form of granularity for all aspects of infrastructure, security, governance, and management. We can use the governance features to create the guardrails and some of the autmations. We can use Azure Virtual Network Manager (AVNM) to ensure a good Virtual Network deployment.

If You Want To Learn More

Contact me via my consulting company, Cloud Mechanix, if you would like to learn how I can help you with this design pattern.

When To Add Subnets To An Azure Virtual Network

In this post, I want to explain the real reasons to add subnets to an Azure virtual network. This post is born out of frustration. I’ve seen post after post on social media, particularly on LinkedIn, where the poster has “Azure expert” in their description, and sharing advice from the year 2002 for cable-oriented (on-prem) networks.

The BS Advice

Consider the scenario below:

The above diagram shows us the commonly advised Virtual Network architecture for a 3-tier web app. There are 3 tiers. The poster will say:

Each tier should have its own subnet for security reasons. Each subnet will have an NSG.

So if we have web servers, app servers, and database servers, the logic is that the subnet + NSG combination provides security. The poster is half right:

  • The NSG does micro-segmentation of the machines.
  • The subnets do nothing.

Back To The Basics … Again

I want you to do this:

  1. Build a VNet with 2 subnets.
  2. Build 2 VMs, each attached to a different subnet.
  3. Log into one of the VMs.
  4. Run tracert to the second VM.

What will you see? The next and only hop is the second VM.

Ping the default gateway. What happens? Timeouts. The default gateway does not exist.

This is easily explained: Virtual Networks do not exist. Subnets do not exist.

Think of a Virtual Network as a Venn diagram. Our two virtual machines are in the same circle. That is an instruction to the Azure fabric to say:

These machines are permitted to route to each other

That’s how Coca-Cola and PepsiCo could both have Virtual Networks with overlapping address spaces in the same Azure room and not be able to talk to each other.

Note: This is functionality of VXLAN implemented through the Hyper-V switch extension capability that was introduced in Windows Server 2012.

Simple Example

Let us fix that simple example. We will first understand that NSGs offer segmentation. No matter how I associate an NSG, the rules are always applied on the host virtual switch port (in Hyper-V, that’s on the NIC). If a rule says “no” then that packet is automatically dropped. If a rule says “yes”, then the packet is permitted.

In the diagram below, we accept that subnets play no role in security segmentation. We have flattened the network to a single subnet. There is a single web server, app server, and database server – we will add complexity later:

This network is much simpler, right? And it offers no less security than the needlessly more complicated first example. An NSG is associated with the subnet. NSG rules allow the required traffic, and a low-priority rule denies all other traffic. Only the permitted traffic can enter any specific NIC.

I’ve seen arguments that this will create complicated rules. Pah! I’ve built/migrated more apps than I care to remember. The rules for these apps are hardly ever that numerous.

Aidan, what if I am going to run a highly available application? Lucky for you if the code supports that (seriously!). Whether you’re using availability sets or availability zones (lucky you, these days), we will make a tiny design change.

We will create a (free) Application Security Group (ASG) for each tier. We will then use the ASG as the source and destination instead of the VM IP addresses.

Aidan, what if I’m going to use Virtual Machine Scale Sets (VMSS)? It’s no different: you add the ASG for the tier to the networking properties of the VMSS. Each created VMSS instance will automatically be associated with the ASG.

When Should I Add Subnets?

There are several reasons why you should add subnets. I’ll list them first before I demonstrate them:

  • Azure requires it
  • Unique routing
  • Remote network sources
  • Scaling

Azure Requires It

There are scenarios when Azure requires a dedicated subnet. Some that I can immediately think of are:

  • Virtual Network Gateway
  • Azure Route Server
  • SQL Managed Instance (MI)
  • App Service Regional Virtual Network Integration
  • App Service Environment (ASE – App Service Isolated Tier) VNet injection

Let’s PaaS-ify the developers (see what I did there :D) and move from VMs to PaaS. We will replace the web servers with App Services and the database with SQL MI:

  • The web servers ran two apps, Web and API. Each will have a Private Endpoint for ingress traffic. The Private Endpoints can remain in the General Subnet.
  • The web servers must talk to the app servers (still VMs) over the VNet, so they will get Regional VNet Integration via the App Service Plan. This will require a dedicated subnet for egress only. This subnet will have no ingress.
  • SQL MI requires a dedicated subnet.

Unique Routing

Next-hop routing is always executed by the source Azure NIC. Every subnet has a collective set of routes to destination prefixes (network addresses). Those routes are propagated to the NICs in that subnet (subnets do not exist). The NICs decide the next hop for each packet, and the Azure network fabric sends the packet (by VXLAN) directly to the NIC of the next hop

There may be a situation where you want to customise routing.

For the sake of consistency, I’m going to use our web app, but in a little wonky way. My app is expanding. Some more VMs are being added for custom processing. Those VMs are being added to the GeneralSubnet.

My wonky scenario is that the security team have decided that traffic from the App Servers to the SQL VM must go through a firewall – that implies that the return traffic must also go through the firewall. No other traffic inside the app needs to go through the firewall. The firewall is deployed in a peered hub.

That means that I must split the GeneralSubnet into two source routing domains (subnets):

  1. GeneralSubnet: Containing the Private Endpoint NICs and my new custom processing VMs.
  2. AppServerSubnet: Containing only the app server VMs.

We will implement the desired via-firewall routing using User-Defined Routes in Route Tables:

  • AppServerSubnet: Go via the hub firewall to get to SqlMiSubnet.
  • SqlMiSubnet: Go via the hub firewall to get to AppServerSunet.

Remote Network Sources

So far, we have been using Application Security Groups (ASGs) to abstract IP addresses in our NSGs. ASGs are great, but they have restrictions:

  • Firewalls, including Azure Firewall, have no idea what an ASG is. You will have to use IP addresses as the sources in the firewall rules – possibly abstracted as IP Groups (Azure Firewall) or similar in third-party firewalls.
  • ASGs can only be used inside their parent subscription. You’re not going to be able to use them as sources in other workloads if you follow the subscription/workload approach of application landing zones.

Using an IP address(es) as a source is OK if the workload does not autoscale. What happens if your app tier/role uses autoscaling and addresses are mixed with addresses from other tiers/roles that should not have access to a remote resource?

There is only one way to solve this: break the source resource(s) out into their own subnet. I recently saw this one with a multi-subscription workload where there was going to be an Azure-hosted DevOps agent pool. Originally, the autoscaling pool was going to share a subnet with other VMs. However, I needed to grant HTTPS access to the DevOps pool only to all other resources. I couldn’t do that if the DevOps pool remained in a shared subnet. I split the pool into its own subnet and was able to use that subnet’s prefix as the source in the various firewall/NSG rules.

Scaling

There are two scaling scenarios that I can think of at the moment. You will have some workload component that will autoscale. The autoscaling role/tier requires a large number of IPs that you want to dedicate to that role/tier. In this case, yes, you may dedicate a subnet to that role/tier.

The second scenario is that you have followed a good practice of deploying a relatively small VNet for your workload, with some spare capacity for additional subnets. However, the scope of the workload has changed significantly. The spare capacity will not be enough. You need to expand the VNet, so you add a second IP prefix to the VNet. This means that new IP capacity requires additional subnets from the new prefix.

In Summary

Every diagram for a new VNet in Azure should start very simply: 1 subnet. Do not follow the overly-simple advice from “Azure expert”LinkedIn posts that say “create a subnet for every tier to create a security boundary”. You absolutely do not need to do that! Most workloads, even in large enterprises, are incredibly simple, and one subnet will suffice. You should only add subnets when the need requires it, as documented above.

Virtual WAN Is Not Required For SD-WAN

Did you know that you do not need to use Virtual WAN to implement an SD-WAN with Azure? In fact, contrary to the recommendations from Microsoft, Virtual WAN might be the worst way to add Azure networks to an SD-WAN.

My History With Virtual WAN

You might think that the introduction of this post paints me as a complete hater who has never given Virtual WAN a chance. I have. In fact, I can point out features that some of my 1:1 feedback calls probably contributed to. I’ve implemented Virtual WAN with customers.

However, I’ve seen the problems. I’ve seen that the hype doesn’t always work. I’ve personally experienced the lack of troubleshooting capabilities that depended on my deep understanding of the hidden networking. I’ve seen colleagues struggle with the complexity. I’ve seen how some customers’ routing requirements cannot be met with Virtual WAN. And many architectural features that some organisations require cannot be deployed with Virtual WAN.

I concluded that my time with Virtual WAN was over during a proof of concept that I insisted a customer do. They had previously used Virtual WAN without a firewall. I was asked to build a new multi-region Azure environment (multiple hubs) with firewalls. I was not sure that it would go well – this was before routing intent was in preview. I tested and confirmed that Virtual WAN was not going to work; the customer implemented a Meraki SD-WAN using Virtual Network-based hubs and lost no functionality. In fact, they gained functionality.

In an older case, I convinced a customer to go with Virtual WAN. I regret this one. There was a lot of hype. They used Meraki. There was a solution from Meraki to integrate with the Virtual WAN VPN Gateway. We found bugs in the script and fixed them. But the most annoying thing about that solution was that every time the customer changed anything in the SD-WAN, every VPN tunnel to Azure was torn down and recreated. I heard recently that the customer is looking to remove SD-WAN. I don’t blame them, and I regret ever recommending it to them.

The Microsoft Claims

The Azure Cloud Adoption Framework incorrectly states the following:

Use a Virtual WAN topology if any of the following requirements apply to your organization:

  • Your organization intends to deploy resources across several Azure regions and requires global connectivity between virtual networks in these Azure regions and multiple on-premises locations.
  • Your organization intends to use a software-defined WAN (SD-WAN) deployment to integrate a large-scale branch network directly into Azure, or requires more than 30 branch sites for native IPSec termination.
  • You require transitive routing between a virtual private network (VPN) and Azure ExpressRoute. For example, if you use a site-to-site VPN to connect remote branches or a point-to-site VPN to connect remote users, you might need to connect the VPN to an ExpressRoute-connected DC through Azure.

I will burst those bubbles one by one.

Several Regions & Global Connectivity

Do you want to deploy across multiple regions? Not a problem. You can very easily do that with Virtual Network-based hubs. I’ve done it again and again.

Do you want to connect the spokes in different regions? Yup, also easy:

  • Build each hub-and-spoke from a single IP prefix.
  • Your spokes already route via the hub.
  • Peer the hubs.
  • Create User-Defined Routes in each firewall subnet (you will be using firewalls in this day and age) to route to remote hub-and-spoke IP prefixes via the remote hub firewalls.

Job done! The only additional steps were:

  • Peer the hubs
  • Add UDRs to each firewall subnet for each remote hub-and-spoke IP prefix

You do that once. Once!

How about connecting the remote sites? Simples: you connect them as usual.

There is some marketing material about how we can use the Microsoft WAN as the company WAN using vWAN. Yes, in theory. The concept is that the Microsoft Global WAN is amazing. You VPN from site A (let’s say Oslo, Norway) to a local Azure region and you VPN from site B (let’s say Houston, Texas) to a local Azure region. Then vWAN automatically enables Oslo <> Texas connectivity over the Microsoft Global Network. Yes, it does. And the performance should be amazing. I did a proof-of-concept in 2 hours with a customer. The performance of VPN directly between Oslo <> Houston was much better. Don’t buy the hype! Question it and test. And by the way, we can build this with VNets too – I was told by an MS partner that they did this solution between two sites on different continents years before vWAN existed.

SD-WAN

Microsoft suggests that you can only add Azure networks to an SD-WAN if you use Virtual WAN.

Here’s some truth. Under the covers, vWAN hub is built on a traditional Virtual Network. Then you can use (don’t) a VPN Gateway or a third-party SD-WAN appliance for connectivity.

The list of partners supporting vWAN was greatly increased recently – I remember looking for Meraki support a few months ago, and it was not there (it is now). But guess what, I bet you that everyone one of those partners offers the exact same solution for Virtual Networks via the Marketplace. And I bet:

  • There are more partner options
  • There are no trade-offs
  • The resilience is just the same

I have done Azure/Meraki SD-WAN twice since the above customer X. In both cases, we went with the Azure Marketplace and Virtual Network. And in both cases, it was:

  • Dead simple to set up.
  • It worked the first time.

Transitive Routing

Virtual WAN is powered by a feature that is hidden unless you do an ARM export. That feature is a routing service that is quite similar (not exactly identical) to Azure Route Server. Did you know:

  • You can deploy Azure Route Server to a Virtual Network. The deployment is a next-next-next.
  • It can be easily BGP peered with a third-party networking appliance, including HA services – for example, HA Meraki gets seamless failover using AS PATH when coupled with Azure Route Server.
  • The Azure Route server will learn remote site prefixes from the networking appliance/SD-WAN.
  • The Azure Route Server will advertise routes to the networking appliance/SD-WAN.

Azure Route Server BGP propagation is managed using the same VNet peering settings as Virtual Network Gateway.

There is a single checkbox (true/false property) to enable transitive routing between VPN/ExpressRoute remote sites. And that setting is amazing.

I signed in to work one day and was asked a question. I had built out the environment for a large customer with an HQ in Oslo:

  • Remote sites around the world with a Meraki SD-WAN.
  • Leased line to Oracle Cloud – the global sites backhauled through Oslo.
  • The VNet-based hub in Azure was added to the SD-WAN. All offices wre connected directly to Azure via VPN.
  • Azure Route Server was added and peered to the Meraki SD-WAN.
  • Azure had an ExpressRoute connection (Oracle Cloud Interconnect) to Oracle Cloud.

An excavator has torn up the leased line to Oracle. The essential services in Oracle Cloud were unavailable. I was asked if the Azure connection to Oracle Cloud coule be leveraged to get the business back online? I thought for 30 seconds and said, “Yes, give me 5 minutes”. Here’s what I did:

  1. I check the box to enable transitive routing in Azure Route Server.
  2. I clicked Save/Apply and waited a few minutes for the update task
  3. I asked the client to test.

And guess what? Contrary to the above CAF text, the client was back online. A few weeks later, I was told that not only did they get back online, but the SD-WAN connection to the VIRUTAL NETWORK-BASED hub in Azure gave the global branch offices lower latency connections than their backhaul through Oslo to Oracle Cloud. Whoda-thunk-it?

vWAN is PaaS

One of the arguments for the vWAN hub is that it pushes complexity down into the platform; it’s a PaaS sub-resource.

Yes, it’s a PaaS sub-resource. Is a well-designed hub complex? A hub should contain very few resources, based around:

  • Remote connectivity resource
  • Firewall
  • Maybe Azure Bastion

There’s not much more to a hub than that if you value security. What exactly am I saving with the more-expensive vWAN?

Limitations of vWAN

Let’s start with performance. A hub in Virtual WAN has a throughput limitation of 50 Gbps. I thought that was a theoretical limit … until I did a network review for a client a few years ago. They had a single workload that pushed 29Gbps through the hub, 1 Gbps shy of the limit for a Standard tier Azure Firewall. I recommended an increase to the 100 Gbps Premium tier, but warned that the bottleneck was always going to be the vWAN hub.

The architectural limitations of vWAN are many – so many that I will miss some:

  • No VNet Flow Logs
  • Impossible to troubleshoot routing/connectivity in a real way
  • No support for Azure Bastion in the hub
  • No support for NAT Gateway for firewall egress traffic (SNAT port exhaustion)
  • Secured traffic between different secured (firewall) hubs requires Routing Intent
  • No Forced Tunnelling in Azure Firewall without Routing Intent
  • Routing Intent is overly simplistic – everything goes through the firewall
  • No support for IP Prefix for the firewall
  • Azure Firewall cannot use Route Server Integration (auto-configuration of non-RFC1918 usage in private networks)
  • Hub Route Tables are a complexity nightmare

Impossible Solution

Anyone who has deployed more than a couple of Azure networks has heard the following statement made regarding failing connections over site-to-site networking:

The Azure network is broken

A new site-to-site appliance or firewall has been placed in Azure, and the root cause of the issue is “never the remote network“.

Proving that the issue isn’t the firewall can be tricky. That’s because firewall appliances are black boxes. I updated my standard hub design last year to assist with this:

  • Add a subnet with identical routing configuration (BGP propagation and user-defined routes) as the (private) firewall subnet.
  • Add a low-spec B-series VM to this subnet with an autoshutdown. This VM is used only for diagnostics.

The design allows an Azure admin to log into the VM. The VM mimics the connectivity of the firewall and allows tests to be done against failing connections. If the test fails from the VM, it proves that the firewall is not at fault.

No other compute resources are placed in the hub.

Here’s the gotcha. I can do this in a VNet hub. I cannot do this in a vWAN hub. The vWAN hub Virtual Network is in a Microsoft-managed tenant/subscription. You have no access, and you cannot troubleshoot it. You are entirely at the mercy of Azure support – and, sadly, we know how that process will go.

Virtual WAN In Summary

You do not need Virtual WAN for connectivity or SD-WAN. So why would one adopt it instead of VNet-based hubs, especially when you consider costs and the loss of functionality? I just do not understand (a) why Microsoft continues to push Virtual WAN and (b) why it continues to exist.

Tracing Packets in Azure Networks

While I’m on the topic of troubleshooting, I thought that I would add some tips on how to trace packets in Microsoft Azure.

The Problem

Here are a few scenarios, in descending order from most common, that I’ve been through over the years:

Remote Desktop To New VMs

You’ve just established a new site-to-site connection between a remote location and a (probably) new Azure network. The remote site admin complains:

No packets are getting through. Your Azure network is broken.

You know that everything in Azure is in good order, and you’re pretty sure the remote site firewall is blocking the traffic. However, many systems administrators jump to “the new Azure network is broken” when something doesn’t work – even if they configured their firewall to block that damned RDP traffic (it’s nearly always RDP in the first tests!).

Connecting to PaaS Services

You’ve deployed some PaaS services in Azure. Something can’t connect to them. The client might be in the same Virtual Network. Maybe it’s in another spoke that must route through a hub? Or maybe the client is in a remote site? The developer or operator is going to say:

We’re getting timeouts when we connect. Your network is broken.

So many things could be wrong here.

SSL Goes Wrong

SSL/PKI feels to me like a dinosaur technology that:

  • Most never learn
  • Few who did learn it never completely mastered it (I’m here, I’d estimate)
  • Those of us who did learn it have forgotten most of it

And modern application/network security is built on this deck of cards (from a knowledge perspective). I’ve seen a few scenarios:

My app gets a weird response from the database when it attempts to connect.

That one’s probably because something is reverse proxying the connection and something is going wrong in the connection – see Application Rules NATing east-west connections in Azure Firewall, causing the client IP to change.

How about this one I saw recently:

My application is failing to connect to a remote server.

When I dug into it, I saw that the TLS handshake was failing and the TCP connection was cleanly terminated. A self-signed certificate was to blame. Other scenarios I’ve seen are where Linux-based appliances fail the same handshake because the server cert doesn’t contain the full keychain. Tip: Windows LIES to you when it shows the whole keychain which it self-builds from the trusted publishers store on the machine. Most appliances require the full keychain in the cert, which many online CAs do not do by default. You’d be amazed how many weeks are wasted and repeated discusssions are had because of this.

But how have I proven this?

Complex Routing

Not everyone builds a simple hub-and-spoke. Sometimes there is a need for complexity. I had one of these a few years ago, where a customer required an ExpressRoute connection to a third-party data provider. The data provider mandated:

  • An ExpressRoute connection
  • The use of SNAT

The ExpressRoute Gateway doesn’t offer SNAT (unlike the VPN Gateway), so I had to conjure an interesting design. Luckily, I know Azure routing pretty well, and I tested this design in a lab. I was sure it would work – it did. But what if something went wrong? I would have had to troubleshoot what was happening.

The Need

What we need is:

  1. The ability to prove that packets are routing to confirm the infrastructure’s ability.
  2. Check how a PaaS resource has responded to connections.
  3. The ability to see inside those packets to investigate application-layer issues.

Folks, most of this is basic logging/querying. But there are a few tricks.

Packet Travel

I want to confirm that a packet reached A, then went to B, then got to the destination. For example:

  1. A packet from a remote client entered the hub and went through the firewall.
  2. It then routed across peering – GAH! More on this in a moment.
  3. And the packet routed through the destination spoke Virtual Network to reach the destination server – double GAH!

Before we proceed, I literally get session audiences to repeat the following 3 times each to enforce some basic knowledge:

Virtual Networks do not exist.

Subnets do not exist

Peering does not exist

Packets go directly from the source to the destination

This is why tracert is useless in Azure.

Understanding the above is halfway to mastering Azure networking. Please read this post before asking me questions or attempting to debate me on this topic of existence.

By the way, if you are using Azure Firewall, then (PowerShell) test-networkconnection is useful only to generate logs. The result may not be the actual result. Azure Firewall feeds “200” results from application rules, even when denying traffic. I always advise: generate the traffic and then check the logs.

Back to the topic …

The basic tool we need is a log of a packet or flow (a series of packets in a “conversation” between the client and server). Fortunately, we have a few sources of those.

The first is your firewall. Azure Firewall’s diagnostics settings send logs to your preferred destination. I prefer Log Analytics. You might prefer Splunk or similar. Potatoe Potahtoh. In Azure Firewall, the “decision making logs” include:

  • Threat Intelligence (an under-appreciated and oh-so useful feature)
  • IDPS
  • Network rules
  • Application rules

Log Analytics has a built-in query to search all those logs in a union. I can search for any combination of source IP, source port (not typically useful), protocol, destination IP, and destination port (very useful).

A third-party firewall has similar logs, often locked away in the previous grip of the firewall administrator. Sorry, I’m binge-watching Lord of the Rings, and I couldn’t help myself, firewall admins ๐Ÿ™‚ Some firewalls can make those logs more available to other Azure operators. For example, the Palo Alto Cloud NGFW has the ability to route logs, via Application Insights, into Log Analytics, where queries, dashboards, and workbooks can share that data. Nice!

The firewall logs will show me:

  • If packets entered the firewall
  • If those packets were allowed or denied

The simple mention of a flow from a client to a server in the firewall log means that packets made it there:

  • A spoke routed via the firewall to another spoke or a remote site.
  • Packets from a remote site passed successfully over a site-to-site network connection.

The firewall log is often my first port of call. Sometimes, however, it doesn’t go deep enough. There have been a number of times where I’ve been told something along the lines of:

I can ping VM X in Azure, but I cannot make a HTTPS connection to it.

I know from experience that they have made a successful connectionless ping (ICMP). But they have failed to make a connection-oriented (TCP) HTTPS request. The stateful firewall is blocking a response to the connection request because it never saw the original SYN. Thank you to my 3rd-year networking lecturer – I can picture the guy demonstrating a luggable PC to us around 1993, but I don’t remember his name. Experience has taught me that:

  • A route for the spoke network prefix is missing from the GatewaySubnet, and the request is bypassing the firewall.
  • A Private Endpoint has added a /32 route to the GatewaySunet (see network policies for Private Endpoint), and routing “long prefix match” has chosen that system route over your User-Defined Route for the spoke prefix.

For these crazy situations, you need to dig a little deeper into the firewall logs. I cannot speak for third-party firewalls here. Azure Firewall doesn’t capture dropped connections such as these. For that deep dive, we need Flow Trace logs to be enabled. Note that:

  • Enabling the logs does not enable the feature; this must be enabled using PowerShell.
  • The logs will be very detailed – and expensive to ingest into your monitoring solution. Only leave this feature enabled while troubleshooting the issue – set a calendar entry to unset it.

I haven’t had the opportunity to use this one in the real world personally, but I wonder if I sent those JSON logs to blob storage, could I download them to Copilot and get a reasonable response to my queries? Note to self.

Did the packet traverse a Virtual Network? Now you should know that’s a dumb question. The Azure fabric takes packets from source NICs and drops them into destination NICs. The correct question is: Did a packet reach the destination NIC?

The correct solution to answer that question today is Virtual Network Flow Logs with Traffic Analytics.

The wrong answer is the deprecated NSG Flow Logs. Virtual Network Flow Logs are current and capture much better data, including Private Endpoints.

Flow Logs will tell me about:

  • Outbound flows – did a packet leave a client?
  • Inbound flows – did a packet reach a server?
  • NSG Rules – what rule allowed/denied a connection?

Now I know if a connection:

  • Left an Azure client
  • Reached an Azure server
  • Was allowed or denied by an NSG rule

Flow Logs take time to generate:

  1. The logs will take 30+ seconds to be written to blob storage. Honestly, I’ve seen this take longer during the pandemic. I think MSFT might throttle monitoring when CPU usage is in high demand.
  2. Traffic Analytics is configured to run every 10 or 60 minutes. I prefer the 10-minute option.
  3. Log Analytics will take time to process the data. I was told many years ago to allow up to 15 minutes for NSG Flow Logs to be processed.

Between the firewall logs and the Virtual Network Flow Logs, I have visibility of the traffic. Or some of it.

PaaS Resources

A PaaS resource may be deployed with:

  • Public endpoint: Firewall or Virtual Network Flow Logs will show my traffic leaving my network, but not the last mile.
  • Private Endpoint: Private Endpoint NICs fool us, because the packet is sent directly by the fabric from the client NIC to the NIC of the machine hosting the PaaS resource instance. Virtual Network Flow Logs show us “connectability” but not the full connection.
  • VNet Injection and VNet Integration: The PaaS resources don’t really live in our Virtual Network. I know that it’s confusing.

Let me give you a working example. You have an App Service wth VNet Integration that is attempting to talk to a Key Vault with a Private Endpoint. We can see the flows in the previously discussed logs. But are the packets really getting to the Key Vault? What happens when the App Service attempts to access a secret?

The only answer to this is to enable the diagnostics settings in the Key Vault. Querying those logs in Log Analytics, Splunk, etc, will tell you exactly what’s going on:

  • Was there a connection?
  • Was the connection successful?
  • Why did the connection fail?

Packet Capture

Don’t get scared! I promise that packet capture is easier than ever now. I’ll explain later.

The results of a packet capture show you the contents (as much as encryption allows) of packets in a flow between a client and a server. This is super useful for investigating further. Let me explain two scenarios:

In the first scenario, we have proven that packets get from A to B, but the customer/developer/operator doesn’t accept that because their application is failing. If we know the packets from client to server, then we know the error is further up the stack – it’s an application configuration or authoring issue. The only way to prove to the other person is to show them the actual packets.

Network Watcher provides a feature called Packet Capture. The only place you need the free Wireshark client is on your PC to open the capture. Network Watcher will automatically add an Azure extension (agent) to the client/server VM, based on your Azure rights over that VM. You can capture all or filtered packets and save the resulting .CAP file to blob storage. Unfortunately, this ability is limited to VMs.

The second scenario is where we have a remote admin complaining about their failing RDP connection (it’s always this) over site-to-site networking. You’ve proven the traffic doesn’t reach the firewall/Azure VM. You know their firewall is blocking the outbound connection, but they won’t accept that. You have to prove that the traffic never crossed the site-to-site connection. You can enable packet capture on a VPN Virtual Network Gateway or a Virtual WAN VPN Gateway. This will ultimately prove that packets never got across the tunnel, and the remote admin must face the mirror.

Back to the scary part about packet captures. Who the heck can read those things? Not many of us can. I understand some basics, such as control flags like SYN, SYN-ACK, and RST. But what would I do if I had to really understand a packet capture? Enter Copilot or another AI:

  1. In Wireshark, click File > Export Packet Dissections > As JSON and select “Packet Range: All packets” and “Packet Format: JSON”. JSON is nice for AI to parse.
  2. Upload the capture to your AI and ask it your questions.

You’ll get an answer that you can work with. I used this recently for an application issue to help a (good guy) developer get to the root cause of an issue.

By the way:

  • ExpressRoute does not offer packet capture, but Traffic Collector provides a Flow Log experience.
  • Azure Firewall with a Management NIC (recommended by me for the last 1.5+ years) has packet capture.

Some Other Tricks

Network Watcher can be useful for doing some basic diagnostics:

  • IP flow verify: Checks whether a specific traffic flow would be allowed or denied by NSG rules.
  • NSG diagnostics: Analyse NSG rules across hops to identify which rule permits or blocks traffic.
  • Next hop: Identifies the next routing hop a packet will take from a selected VM.
  • Effective security rules: Displays the combined, active security rules applied to a network interface after all NSGs are merged.
  • VPN troubleshoot: Diagnoses issues with Azure VPN gateways and siteโ€‘toโ€‘site or pointโ€‘toโ€‘site tunnels.
  • Packet capture: Captures packet data directly from a VMโ€™s network stack for deep traffic analysis.
  • Connection troubleshoot: Tests endโ€‘toโ€‘end connectivity between a VM and a target to identify routing or NSG issues.

Connection Troubleshoot is especially nice:

We can send a bunch of probe packets from a source to a destination and see if the connection was successful. If not, the tool gives you some indication why – keep in mind that remote destinations will result in vague failure reasoning because Azure doesn’t control remote locations.

The sources can be:

  • VMs and VM Scale Sets: Using the Network Watcher extension.
  • Application Gateway: Great for figuring out those pesky backend health issues and proving that the CA-provided cert (lacking the complete trust chain) is the cause of the failure.
  • Bastion Host: Bastion-to-VM connections can be a head-wrecker.

If there is a connection that is working but you consider to be critical, then I recommend using Connection Monitor in Network Watcher:

  • Works with any mix of Arc agents (non-Azure VMs) and Azure VMs – consider those remote site connections!
  • Model application connections.
  • Tests success and speed (latency).
  • Can trigger an alert/Action Group.

I used this a few years ago for a SaaS company that was using Placement Proximity Groups as a part of their need to minimise latency. I wanted proof of the platform performance, just in case. My colleague who wrote the Terraform for modelling the application in Connection Monitor probably didn’t like me for requiring this ๐Ÿ˜‰ I started seeing alerts one day, so I let the customer know that I was opening a support ticket with Microsoft. We found out that there was a physical issue with one network appliance, and Microsoft fixed it. Wow – not only were we monitoring our infrastructure and the application’s networking, but we were monitoring Azure’s physical network too!

Last Tool

The last tool is you, not Copilot. Honestly, Azure Copilot is not good at this stuff. I’ve tested it in my build labs, and it hasn’t a clue (thankfully for us IT pros). You need a combination of:

  • Experience: What’s most common?
  • Intuition: Listen to the customer – did they just mention a cert, for example?
  • Knowledge: Understanding how Azure networks function is critical – did you know that not setting the network policies for Private Endpoint in your subnet causes asynchronous routing in the firewall?

Using your tools will better prepare you to use the above Azure tools.

If You Liked This …

Maybe you liked this post and are wondering: “Could Aidan help me?” Maybe I can through my company, Cloud Mechanix. Whether you need a review, design something, figure out some issue, do a large deployment, or figure out why the cloud is not working for your organisation, I can help – and other things too. Cloud Mechanix works with large and small organisations and service providers throughout Europe. Check out the site, and contact me if you are interested.

Interpretation of The Azure Cloud Adoption Framework

In this post, I will explain how I have interpreted the Cloud Adoption Framework for Microsoft Azure and how I apply it with my company, Cloud Mechanix.

Taking Theory Into Practice

In my last post, I explained two things:

  1. The value of the Cloud Adoption Framework (CAF)
  2. It is never too late to apply the CAF

I strongly believe in the value of the CAF, mostly because:

  • I’ve seen what happens when an organisation rushes into an IT-driven cloud migration project.
  • The CAF provides a process to avoid the issues caused by that rush.

The CAF does have an issue – it is not opinionated. The CAF has lots of discussion, but can be light on direction. That’s why I have slightly tweaked the CAF to:

  • Take into account what I believe an organisation should do.
  • Include the deliverables of each phase.
  • Indicate the dependencies and flow between the phases.
  • Highlight where there will be continuous improvement after the adoption project is complete.

The Cloud Mechanix CAF

Here is a diagram of the Cloud Mechanix version of the Azure Cloud Adoption Framework:

Cloud Mechanix Azure Cloud Adoption Framework

There are two methodologies:

  • Foundational
  • Operational

Foundational Methodology

There are four phases in the Foundational Methodology:

  • Strategy
  • Plan
  • Ready
  • Adopt

Strategy

The Strategy phase is the key to making the necessary changes in the organisation. When an IT (infrastructure) manager starts a migration project:

  • They have little to no knowledge of the organisation-wide needs of IT services.
  • No influence outside their department – particularly with other departments/divisions/teams – to make changes.
  • Possibly have little interest in any process/organisational/tool changes to how IT services are delivered.

The process will run sequentially as follows:

TaskDescriptionDeliverable
Define Strategy TeamSelect the members who will participate in this phase. They should know the organisational needs/strategy. They must have authority to speak for the organisation.A team that will review and publish the Cloud Strategy.
Determine Motivations, Mission, and ObjectivesIdentify and rank the organisation’s reasons to adopt the cloud.
Create a mission statement to summarise the project.
Define objectives to accomplish the mission statement/motivations and assign “definitions of success”.
Ranked motivations.
A mission statement.
Objectives with KPIs.
Assess Cloud Adoption StrategyReview the existing cloud adoption strategy, if one exists.A review of the cloud strategy, contrasting it with the identified motivations, mission statement, and objectives.
Write Cloud StrategyA cloud strategy document will be created using the gathered information. This will record the information and provide a high-level plan, with timelines for the rest of the cloud adoption project.A non-technical document that can be read and understood by members of the organisation.
Inform StrategyThe Cloud Strategy will be published. A clear communication from the Strategy Team will inform all staff of the mission statement and objectives, authorising the necessary changes.A clear communication that will be understood by all staff.

Note that the steps to produce and publish this strategy will be repeated on a regular basis to keep the cloud strategy up-to-date.
Assemble Operations TeamsThe leadership of the Operational Framework tracks will be selected and authorised to perform their project duties.The team leaders will initiate their tracks, based on instructions from the Cloud Strategy.

The Cloud Strategy is the primary parameter for the tracks in the Operational Framework and the Plan phase of the Foundational Framework.

Plan

The Plan phase is primarily focused on designing the organisational changes to how holistic IT services (not just IT infrastructure) are delivered.

TaskDescriptionDeliverable
Azure Foundational TrainingThe entry level of Azure training should be delivered to any staff participating in the Plan/Ready phases of the project.The AZ-900 equivalent of knowledge should be learned by the staff members.
Plan MigrationAn assessment of workloads should begin for any workloads that are candidates for migration to the cloud. This is optional, depending on the Cloud Strategy.A detailed migration plan for each workload.
Define Operating ModelDefine the new way that IT services (not just infrastructure) will be delivered.An authorised plan for how IT services will be delivered in Azure.
The operating model will be a parameter for the Design task in the Govern/Secure/Manage tracks in the Foundational Methodology.
Cloud Centre of ExcellenceA “special forces” team will be created to be the early adopters of Azure. They will be the first learners/users and will empower/teach other users over time.A list of cross-functional IT staff with the necessary roles to deliver the operational model.
Process, Tools, People, and SkillsThe processes for delivering the new operational model will be defined.
The tools that will be used for the operational model will be tested, selected, and acquired.
People will be identified for roles and reorganised (actually or virtually) as required.
Skills gaps will be identified and resolved through training/acquisition.
The necessary changes to deliver the operational model will be planned and documented.
Skills will be put in place to deliver the operational model.
Document Adoption PlanA plan will be created to:
1. Deploy the new tools
2. Build platform landing zones
3. Prepare for Adopt
An adoption plan is created and published to the agreed scope.

The Adoption Plan will be the primary parameter for the Ready phase.

Ready

The purpose of Ready is to:

  1. Get the tooling in place.
  2. Prepare the platform landing zones to enable application landing zones.

There is a co-dependency between Ready and the Operational Methodology. The Operational Methodology will:

  • Require the tooling to deploy the governance, security and management features, especially if an infrastructure-as-code approach will be used.
  • Provide the governance, security, and management systems that will be required for the platform landing zones.

This means that there is a required ordering:

  1. Governance, Secure, and Manage must design their features.
  2. Ready must prepare the tooling.
  3. Governance, Secure, and Manage will deploy their features.
  4. Ready can continue.
TaskDescriptionDeliverable
Deploy Process & ToolsThe tools and processes for the operating model will be deployed and made ready.This is required to enable Govern, Secure, and Manage to deploy their features.
Deploy Platform Landing ZonesLanding zones for features such as hubs, domain controllers, DNS, shared Web Application Firewalls, and so on, will be deployed.The infrastructure features that are required by application landing zones will be prepared.
Operate Platform Landing ZonesEach platform landing zone is operated in accordance with the Well-Architected Framework.Continuous improvement for performance, reliability, cost, management, and functionality.

The platform landing zones are a technical delivery parameter for the Adopt phase.

Adopt

The nature of Adopt will be shaped by the cloud strategy. For example, an organisation might choose to do a simple migration because of a technical motivation. Another organisation might decide to build new applications in The Cloud, while keeping old ones in on-premises hosting. Another might choose to focus entirely on market disruption by innovating new services. No one strategy is right, and a blend may be used. All of this is dictated by the mission statement and objectives that are defined during Strategy.

TaskDescriptionDeliverable
MigrateA structured process will migrate the applications based on the migration plan generated during Plan.An application landing zone for each migrated application.
ModerniseApplications are rearchitected/rebuilt based on the migration plan generated during Plan.An application landing zone for each migrated application.
BuildNew applications are built in Azure.An application landing zone is created for each workload.
InnovateNew services to disrupt the market are researched, developed, and put into production.An innovation process will eventually generate an application landing zone for each new service.
Operate Application Landing ZonesEach application landing zone is operated in accordance with the Well-Architected Framework.Continuous improvement for performance, reliability, cost, management, and functionality.

Operational Methodology

The Operational Methodology must not be overlooked; this is because the three tracks, running in parallel with the Foundational Methodology, will perform necessary functions to design and continuously operate/improve systems to protect the organisation.

The three tracks, each with identical tasks, are:

  • Govern: Build, maintain, and improve governance systems.
  • Secure: Build, maintain, and improve security systems.
  • Manage: Build, maintain, and improve systems guidelines and management systems.

This approach assigns ownership of the Well-Architected Framework pillars to the three tracks.

  • Govern: Cost optimisation
  • Secure: Security
  • Manage: Reliability, operational excellence, and performance efficiency

Each track has a separate team with:

  • A leader
  • Stakeholders
  • Architect
  • Implementors

Each is a separate track, but there is much crossover. For example, Azure Policy is perceived as a governance solution. However, Azure Policy might be used:

  • By Govern to apply compliance requirements.
  • By Secure to harden the Azure resources.
  • By Manage to automate desired systems configurations.

The inheritance model for Azure Policy is Management Groups, so all three tracks will need to collaborate to design a governance architecture. For this reason, the architect should reside in each team. The implementors may also be common.

TaskDescriptionDeliverable
AssessPerform an assessment of the current/future requirements, risks, and requirements.A risk assessment with a statement of measurable objectives.
Author PolicyA new policy is written, or an existing policy is updated to enforce the objectives from the assessment.A policy document is written and published.
DesignA solution to implement the policy is designed. The goal is to automate as much of the policy as possible. Remaining exceptions should be clearly documented and communicated with guidelines.High-level and low-level design documentation for the technical implementation.
Clearly written and communicated guidelines for other requirements.
DeployThis depends on Deploy Process & Tools from Ready.
Deploy the technical solution.
The technical Azure (platform landing zones) and any third-party resources are deployed to implement governance, security, and management based on the published policies.
OperateThe systems are run and maintained.Continuous improvement for performance, reliability, cost, management, and functionality.
The Deploy Platform Landing Zone(s) in Ready can proceed.

Note that Govern, Secure and Manage should never finish. They should deliver a minimal viable product (MVP) to quickly enable Ready with a baseline of governance, security, and management best practices, as defined by the organisation. A regular review process will assess the policy versus new risks/requirements/experience. This will start a new cycle of continuous improvement.

This approach should be the method used for continuous risk assessment in IT Security or compliance. If this is true, then the new Azure process can be blended with those processes.

Final Thoughts

The partners of a 3-or 4-letter consulting franchise do not have to get rich from your cloud journey. The Cloud Adoption Framework does not have to be a process that generates tens of thousands of pages of reports that will never be read. The focus of this approach is to:

  1. Enable cloud adoption.
  2. Use a rapid light-touch approach that avoids change friction.

For example, a Cloud Strategy workshop can be completed in 1.5 days. A high-level design for a minimum viable security policy can be discussed in under 1 day. The Cloud Strategy will, and should, evolve. The IT Security policy will evolve with regular (risk) assessments.

If You Like This Approach …

As I stated, this is the approach that I use with Cloud Mechanix. The focus is on results, including speed and correct delivery. This process can be done during the cloud journey, or it can be done afterwards if you realise that the cloud is not working for your organisation. Contact Cloud Mechanix if you would like to learn how I can facilitate your experience of the Cloud Adoption Framework.

Enabling Virtual Network Flow Logs At Scale

In this post, I will explain how you can enable Virtual Network (VNet) Flow Logs at scale using a built-in Azure Policy.

Background

Flow logging plays an essential role in Azure networking by recording every flow (and more):

  • Troubleshooting: Verify that packets get somewhere or pass through an appliance. Check if traffic is allowed by an NSG. And more!
  • Security: Search for threats by pushing the data into a SIEM, like Microsoft Sentinel, and provide a history of connectivity to investigate a penetration.
  • Auditing: Have a history of what happened on the network.

There is a potential performance and cross-charging use that I’ve not dug into yet, by using the throughput data that is recorded.

Many of you might have used NSG Flow Logs. Those are deprecated now with an end-of-life date of September 30, 2027. The replacement is VNet Flow Logs, which records more data and requires less configuration – once per VNet instead of once per NSG.

But there is a catch! Modern, zero-trust, Cloud Adoption Framework-compliant designs use many VNets. Each application/workload gets a landing zone, and a landing zone will include a dedicated VNet for every networked workload, probably deployed as a spoke in a hub-and-spoke architecture. A modest organisation might have 50+ VNets with little free admin hours to do configurations. A large, agile organisation might have an ever-increasing huge collection of VNets and struggle with consistency.

Enter Azure Policy

Some security officers and IT staff resist one of the key traits of a cloud: self-service. They see it as insecure and try to lock it down. All that happens, eventually, is that the business gets ticked off that they didn’t get the cloud, and they take their vengeance out on the security officers and/or IT staff that failed to deliver the agile compute and data platform that the business expected – I’ve seen that happen a few times!

Instead, organisations should use the tools that provide a balance between security/control and self-service. One perfect example of this is Azure Policy, which provides curated guardrails against insecure or non-compliant deployments or configurations. For example, you can ban the association of Public IP Addresses with NICs, which the compute marketing team has foisted on everyone via the default options in a virtual machine deployment.

Using Azure Policy With VNet Flow Logs

Our problem:

We will have some/many VNets that we need to deploy Flow Logging to. We might know some of the VNets, but there are many to configure. We need a consistent deployment. We may also have many VNets being created by other parties, either internal or external to our organisation.

This sounds like a perfect scenario for Azure Policy. And we happen to have a built-in policy to deploy VNet Flow Logging called Configure virtual networks to enforce workspace, storage account and retention interval for Flow logs and Traffic Analytics.

The policy takes 5 mandatory parameters:

  • Virtual Networks Region: A single Azure region that contains the Virtual Networks that will be targeted by this policy.
  • Storage Account: The storage account that will temporarily store the Flow Logs in blob format. It must be in the same region as the VNets.
  • Network Watcher: Network Watcher must be configured in the same region as the VNets.
  • Workspace Resource ID: A Log Analytics Workspace will store the Traffic Analytics data that can be accessed using KQL for queries, visualisations, exported to Microsoft Sentinel, and more.
  • Workspace Region: The workspace can be in any region. The Workspace can be used for other tasks and with other assignment instances of this policy.

What if you have VNets across three regions? Simple:

  1. Deploy 1 central Workspace.
  2. Deploy 3 Storage Accounts, 1 per region.
  3. Assign the policy 3 times, once per region, for each region.

You will collect VNet Flow Logs from all VNets. The data will be temporarily stored in region-specific Storage Accounts. Eventually, all the data will reside in a single Log Analytics Workspace, providing you with a single view of all VNet flows.

Customisation

It took a little troubleshooting to get this working. The first element was to configure remediation identity during the assignment. Using the GUID of the identity, I was able to grant permanent reader rights to a Management Group that contained all the subscriptions with VNets.

Troubleshooting was conducted using the Activity Log in various subscriptions, and the JSON logs were dumped into regular Copilot to facilitate quick interpretation. ChatGPT or another would probably do as good a job.

The next issue was the Traffic Analytics collection interval. In a manual/coded deployment, one can set it to every 10 or 60 minutes. I prefer the 10-minute option for quicker access (it’s still up to 25 minutes of latency). The parameter for this setting is optional. When I enabled that parameter in the assignment, the save went into a permanent (commonly reported) verifying action without saving the change. My solution was to create a copy of the policy and to change the default option of the parameter from 60 to 10. Job done!

In The Real World

Azure Policy has one failing – it has a huge and unpredictable run interval. There is a serious lag between something being deployed and a mandated deployIfNotExists task running. But this is one of the scenarios where, in the real world, we want it to eventually be correct. Nothing will break if VNet Flow Logs are not enabled for a few hours. And the savings of not having to do this enablement manually are worth the wait.

If You Liked This?

Did you like this topic? Would you like to learn more about designing secure Azure networks, built with zero-trust? If so, then join me on October 20-21 2025 (scheduled for Eastern time zones) for my Cloud Mechanix course, Designing Secure Azure Networks.

18th Microsoft Most Valuable Professional Award

I found out yesterday that I was awarded my 18th annual Most Valuable Professional (MVP) award by Microsoft, continuing with the Azure Networking expertise.

It’s been an interesting year since last July, when I received my 17th award. My amount of billable work (the KPI for any consultant) with my then-employer was zero for a long time. I started thinking that the end would eventually come, so I started no plan-B: my own company.

I started my company, Cloud Mechanix, 7 years ago as a side-gig to my previous job. I used personal time to write custom-Azure training and to deliver it at in-person classes. That first year was incredible – I still remember squeezing 22 people into a meeting room in a London hotel that I’d hoped to get 10 people into! Things went well and the feedback was awesome. I’d started to write new content … and then the world changed. I changed my day-job. The COVID19 pandemic happened. And my wife and I welcomed twin girls into the world. There was no time for a side-gig!

I did a little bit with Cloud Mechanix during the lockdown but I didn’t have the time to put a sustained effort in. Then last year, the world started changing again. The twins were 4, in their second year of pre-school, and quite happy to entertain themselves. The pandemic was a distant memory but our way of working had change quite a bit. And my day-job went from too much work to no work. I’ve been around long enough to develop a sense of redundancy smell. My spidey-sense tingles long before anyone else discusses the topic. I talked with my wife and we decided that I had more time to invest in my company, Cloud Mechanix, and my MVP activities.

I started to write new content, focusing first on what I’m best known for these days (Azure Networking) and on another in-demand course (Azure for small-medium businesses). I did the Azure Firewall Deep Dive course online for anyone to sign up and privately. I’ve done the Azure Operations for Small/Medium Businesses class in-person 3 times so far this year for a Microsoft distributor (the attendees were employees of Microsoft partners).

Meanwhile I’ve applied for and spoken at a number of Microsoft community/conference events. I’ve been invited to talk on a number of podcasts – which are always enjoyable … poor Ned and Kyler probably didn’t know what they were in for when I talked non-stop about Azure networking for 39 minutes without stopping to breath. And I wrote a series of blog posts on Azure network design/security to explain why trying to implement on-premises designs make no sense and the resulting complexity breaks the desired goal of better security – simplicity actually offers more security!

The expected happened in June. I was made redundant. I wasn’t sad – I knew that it was coming and I had a plan. The agreed terms meant that I was free from June 28th with no restrictions. I had decided that I would not go job hunting. I have a job; I’m the Manading Director, trainer, and consultant with Cloud Mechanix. Yes, I am going out with my own company and it has expanded into consulting on Azure, including (but not limited to):

  • Cloud strategy
  • Reviews
  • Security
  • Migration
  • System design & build
  • Cloud Adoption by Mentorship
  • Small/Medium business
  • Assisting Microsoft partners

Things have started well. I have a decent sales pipe. I have completed two small gigs. And I have developed new training content: Designing Secure Azure Networks.

Back to the award! I’m at the Costa Blanca in Spain with my family for 4 weeks. Cloud Mechanix HQ has temporarily relocated from Ireland for 2 weeks and then I’m on vacation for 2 weeks. I’m spending my time doing some pre-sales stuff (things are going well) and writing some stuff that I will be sharing soon ๐Ÿ™‚ I was working yesterday afternoon and thinking about going to the pool with the kids, and got to thinking “what day/date is it?” – how one knows that they are relaxed! I asked my wife and she said that it was July 10th! Wait – isn’t that what the MVPs call “F5 day”, the day that we find out if we are renewed or not? I checked Teams and confirmed that it was indeed F5 day. Usually we get the emails at 4PM Irish time, making it 5PM Spanish time. I’d decided I was going to the pool. My phone was in a bag on a bench and I kept an eye on the time. Then from 5PM, I checked my email every few minutes until … there it was:

Year number 18 had begun! To be honest, this was the first time in years that I wasn’t that worried. I had written quite a bit of blog content. I’d done a number of online and in-person things. I also had (I hope) great interactions with the Azure product group. I felt like that the contributions were there … and they are still coming.

I’ve been doing quite a bit this week. It’s the start of something bigger but I hope that the first part will be ready in the coming days – it depends on that pre-sales pipeline and testing results … ooooh it’s technical!

I have two confirmed future events with TechMentor in the USA where I’m doing a panel, breakout sessions, and a post-con all-day class at:

  • Microsoft HQ 2025 in Redmond, Washington, on August 11-15.
  • Orlando, Florida, on November 16-21.

I have applied for a number of other events in Europe too. If you’re interested then:

  • See my profile on Sessionize for speaking at events
  • Check out my blog posts here for podcast subject matter.
  • Check out Cloud Mechanix to see how I can help you with your Azure journey
  • Follow me on my socials to see what I’m chatting about.

Building A Hub & Spoke Using Azure Virtual Network Manager

In this post, I will show how to use Azure Virtual Network Manager (AVNM) to enforce peering and routing policies in a zero-trust hub-and-spoke Azure network. The goal will be to deliver ongoing consistency of the connectivity and security model, reduce operational friction, and ensure standardisation over time.

Quick Overview

AVNM is a tool that has been evolving and continues to evolve from something that I considered overpriced and under-featured, to something that I would want to deploy first in my networking architecture with its recently updated pricing. In summary, AVNM offers:

  • Network/subnet discovery and grouping
  • IP Address Management (IPAM)
  • Connectivity automation
  • Routing automation

There is (and will be) more to AVNM, but I want to focus on the above features because together they simplify the task of building out Azure platform and application landing zones.

The Environment

One can manage virtual networks using static groups but that ignores the fact that The Cloud is a dynamic and agile place. Developers, operators, and (other) service providers will be deploying virtual networks. Our goal will be to discover and manage those networks. An organisation might be simple, and there will be a one-size-fits-all policy. However, we might need to engineer for complexity. We can reduce that complexity by organising:

  • Adopt the Cloud Adoption Framework and Zero Trust recommendations of 1 subscription/virtual network per workload.
  • Organising subscriptions (workloads) using Management Groups.
  • Designing a Management Group hierarchy based on policy/RBAC inheritance instead of basing it on an organisation chart.
  • Using tags to denote roles for virtual networks.

I have built a demo lab where I am creating a hub & spoke in the form of a virtual data centre (an old term used by Microsoft). This concept will use a hub to connect and segment workloads in an Azure region. Based on Route Table limitations, the hub will support up to 400 networked workloads placed in spoke virtual networks. The spokes will be peered to the hub.

A Management Group has been created for dub01. All subscriptions for the hub and workloads in the dub01 environment will be placed into the dub01 Management Group.

Each workload will be classified based on security, compliance, and any other requirements that the organisation may have. Three policies have been predefined and named gold, silver, and bronze. Each of these classifications has a Management Group inside dub01, called dub01gold, dub01silver, and dub01bronze. Workloads are placed into the appropriate Management Group based on their classification and are subject to Azure Policy initiatives that are assigned to dub01 (regional policies) and to the classification Management Groups.

You can see two subscriptions above. The platform landing zone, p-dub01, is going to be the hub for the network architecture. It has therefore been classified as gold. The workload (application landing zone) called p-demo01 has been classified as silver and is placed in the appropriate Management Group. Both gold and silver workloads should be networked and use private networking only where possible, meaning that p-demo01 will have a spoke virtual network for its resources. Spoke virtual networks in dub01 will be connected to the hub virtual network in p-dub01.

Keep in mind that no virtual networks exist at this time.

AVNM Resource

AVNM is based on an Azure resource and subresources for the features/configurations. The AVNM resource is deployed with a management scope; this means that a single AVNM resource can be created to manage a certain scope of virtual networks. One can centrally manage all virtual networks. Or one can create many AVNM resources to delegate management (and the cost) of managing various sets of virtual networks.

I’m going to keep this simple and use one AVNM resource as most organisations that aren’t huge will do. I will place the AVNM resource in a subscription at the top of my Management Group hierarchy so that it can offer centralised management of many hub-and-spoke deployments, even if we only plan to have 1 now; plans change! This also allows me to have specialised RBAC for managing AVNM.

Note that AVNM can manage virtual networks across many regions so my AVNM resource will, for demonstration purposes, be in West Europe while my hub and spoke will be in North Europe. I have enabled the Connectivity, Security Admin, and User-Defined Routing features.

AVNM has one or more management scopes. This is a central AVNM for all networks, so I’m setting the Tenant Root Group as the top of the scope. In a lab, you might use a single subscription or a dedicated Management Group.

Defining Network Groups

We use Network Groups to assign a single configuration to many virtual networks at once. There are two kinds of members:

  • Static: You add/remove members to or from the group
  • Dynamic: You use a friendly wizard to define an Azure Policy to automatically find virtual networks and add/remove them for you. Keep in mind that Azure Policy might take a while to discover virtual networks because of how irregularly it runs. However, once added, the configuration deployment is immediately triggered by AVNM.

There are two kinds of members in a group:

  • Virtual networks: The virtual network and contained subnets are subject to the policy. Virtual networks may be static or dynamic members.
  • Subnets: Only the subnet is targeted by the configuration. Subnets are only static members.

Keep in mind that something like peering only targets a virtual network and User-Defined Routes target subnets.

I want to create a group to target all virtual networks in the dub01 scope. This group will be the basis for configuring any virtual network (except the hub) to be a secured spoke virtual network.

I created a Network Group called dub01spokes with a member type of Virtual Networks.

I then opened the Network Group and configured dynamic membership using this Azure Policy editor:

Any discovered virtual network that is not in the p-dub01 subscription and is in North Europe will be automatically added to this group.

The resulting policy is visible in Azure Policy with a category of Azure Virtual Network Manager.

IP Address Management

I’ve been using an approach of assigning a /16 to all virtual networks in a hub & spoke for years. This approach blocks the prefix in the organisation and guarantees IP capacity for all workloads in the future. It also simplifies routing and firewall rules. For example, a single route will be needed in other hubs if we need to interconnect multiple hub-and-spoke deployments.

I can reserve this capacity in AVNM IP Address Management. You can see that I have reserved 10.1.0.0/16 for dub01:

Every virtual network in dub01 will be created from this pool.

Creating The Hub Virtual Network

I’m going to save some time/money here by creating a skeleton hub. I won’t deploy a route NVA/Virtual Network Gateway so I won’t be able to share it later. I also won’t deploy a firewall, but the private address of the firewall will be 10.1.0.4.

I’m going to deploy a virtual network to use as the hub. I can use Bicep, Terraform, PowerShell, AZ CLI, or the Azure Portal. The important thing is that I refer to the IP address pool (above) when assigning an address prefix to the new virtual network. A check box called Allocate Using IP Address Pools opens a blade in the Azure Portal. Here you can select the Address Pool to take a prefix from for the new virtual network. All I have to do is select the pool and then use a subnet mask to decide how many addresses to take from the pool (/22 for my hub).

Note that the only time that I’ve had to ask a human for an address was when I created the pool. I can create virtual networks with non-conflicting addresses without any friction.

Create Connectivity Configuration

A Connectivity Configuration is a method of connecting virtual networks. We can implement:

  • Hub-spoke peering: A traditional peering between a hub and a spoke, where the spoke can use the Virtual Network Gateway/Azure Route Server in the hub.
  • Mesh: A mesh using a Connected Group (full mesh peering between all virtual networks). This is used to minimise latency between workloads with the understanding that a hub firewall will not have the opportunity to do deep inspection (performance over security).
  • Hub & spoke with mesh: The targeted VNets are meshed together for interconnectivity. They will route through the hub to communicate with the outside world.

I will create a Connectivity Configuration for a traditional hub-and-spoke network. This means that:

  • I don’t need to add code for VNet peering to my future templates.
  • No matter who deploys a VNet in the scope of dub01, they will get peered with the hub. My design will be implemented, regardless of their knowledge or their willingness to comply with the organisation’s policies.

I created a new Connectivity Configuration called dub01spokepeering.

In Topology I set the type to hub-and-spoke. I select my hub virtual network from the p-dub01 subscription as the hub Virtual Network. I then select my group of networks that I want to peer with the hub by selecting the dub01spokes group. I can configure the peering connections; here I should select Hub As Gateway – I don’t have a Virtual Network Gateway or an Azure Route Server in the hub, so the box is greyed out.

I am not enabling inter-spoke connectivity using the above configuration – AVNM has a few tricks, and this is one of them, where it uses Connected Groups to create a mesh of peering in the fabric. Instead, I will be using routing (later) via a hub firewall for secure transitive connectivity, so I leave Enable Connectivity Within Network Group blank.

Did you notice the checkbox to delete any pre-existing peering configurations? If it isn’t peered to the hub then I’m removing it so nobody uses their rights to bypass by networking design.

I completed the wizard and executed the deployment against the North Europe region. I know that there is nothing to configure, but this “cleans up” the GUI.

Create Routing Configuration

Folks who have heard me discuss network security in Azure should have learned that the most important part of running a firewall in Azure is routing. We will configure routing in the spokes using AVNM. The hub firewall subnet(s) will have full knowledge of all other networks by design:

  • Spokes: Using system routes generated by peering.
  • Remote networks: Using BGP routes. The VPN Local Network Gateway creates BGP routes in the Azure Virtual Networks for “static routes” when BGP is not used in VPN tunnels. Azure Route Server will peer with NVA routers (SD-WAN, for example) to propagate remote site prefixes using BGP into the Azure Virtual Networks.

The spokes routing design is simple:

  • A Route Table will be created for each subnet in the spoke Virtual Networks. This design for these free resources will allow customised routing for specific scenarios, such as VNet-integrated PaaS resources that require dedicated routes.
  • A single User-Defined Route (UDR) forces traffic leaving a spoke Virtual Network to pass through the hub firewall, where firewall rules will deny all traffic by default.
  • Traffic inside the Virtual Network will flow by default (directly from source to destination) and be subject to NSG rules, depending on support by the source and destination resource types.
  • The spoke subnets will be configured not to accept BGP routes from the hub; this is to prevent the spoke from bypassing the hub firewall when routing to remote sites via the Virtual Network Gateway/NVA.

I created a Routing Configuration called dub01spokerouting. In this Routing Configuration I created a Rule Collection called dub01spokeroutingrules.

A User-Defined Route, known as a Routing Rule, was created called everywhere:

The new UDR will override (deactivate) the System route to 0.0.0.0/0 via Internet and set the hub firewall as the new default next hop for traffic leaving the Virtual Network.

Here you can see the Routing Collection containing the Routing Rule:

Note that Enable BGP Route Propagation is left unchecked and that I have selected dub01spokes as my target.

And here you can see the new Routing Configuration:

Completed Configurations

I now have two configurations completed and configured:

  • The Connectivity Configuration will automatically peer in-scope Virtual Networks with the hub in p-dub01.
  • The Routing Configuration will automatically configure routing for in-scope Virtual Network subnets to use the p-dub01 firewall as the next hop.

Guess what? We have just created a Zero Trust network! All that’s left is to set up spokes with their NSGs and a WAF/WAFs for HTTPS workloads.

Deploy Spoke Virtual Networks

We will create spoke Virtual Networks from the IPAM block just like we did with the hub. Here’s where the magic is going to happen.

The evaluation-style Azure Policy assignments that are created by AVNM will run approximately every 30 minutes. That means a new Virtual Network won’t be discovered straight after creation – but they will be discovered not long after. A signal will be sent to AVNM to update group memberships based on added or removed Virtual Networks, depending on the scope of each group’s Azure Policy. Configurations will be deployed or removed immediately after a Virtual Network is added or removed from the group.

To demonstrate this, I created a new spoke Virtual Network in p-demo01. I created a new Virtual Network called p-demo01-net-vnet in the resource group p-demo01-net:

You can see that I used the IPAM address block to get a unique address space from the dub01 /16 prefix. I added a subnet called CommonSubnet with a /28 prefix. What you don’t see is that I configured the following for the subnet in the subnet wizard:

As you can see, the Virtual Network has not been configured by AVNM yet:

We will have to wait for Azure Policy to execute – or we can force a scan to run against the resource group of the new spoke Virtual Network:

  • Az CLI: az policy state trigger-scan –resource-group <resource group name>
  • PowerShell: Start-AzPolicyComplianceScan -ResourceGroupName <resource group name>

You could add a command like above into your deployment code if you wished to trigger automatic configuration.

This force process is not exactly quick either! 6 minutes after I forced a policy evaluation, I saw that AVNM was informed about a new Virtual Network:

I returned to AVNM and checked out the Network Groups. The dub01spokes group has a new member:

You can see that a Connectivity Configuration was deployed. Note that the summary doesn’t have any information on Routing Configurations – that’s an oversight by the AVNM team, I guess.

The Virtual Network does have a peering connection to the hub:

The routing has been deployed to the subnet:

A UDR has been created in the Route Table:

Over time, more Virtual Networks are added and I can see from the hub that they are automatically configured by AVNM:

Summary

I have done presentations on AVNM and demonstrated the above configurations in 40 minutes at community events. You could deploy the configurations in under 15 minutes. You can also create them using code! With this setup we can take control of our entire Azure networking deployment – and I didn’t even show you the Admin Rules feature for essential “NSG” rules (they aren’t NSG rules but use the same underlying engine to execute before NSG rules).

Want To Learn More?

Check out my company, Cloud Mechanix, where I share this kind of knowledge through:

  • Consulting services for customers and Microsoft partners using a build-with approach.
  • Custom-written and ad-hoc Azure training.

Together, I can educate your team and bring great Azure solutions to your organisation.

Day Two Devops – Azure VNets Don’t Exist

I had the pleasure of chatting with Ned Bellavance and Kyler Middleton on Day Two DevOps one evening recently to discuss the basics of Azure networking, using my line “Azure Virtual Networks Do Not Exist”. I think I talked nearly non-stop for nearly 40 minutes ๐Ÿ™‚ Tune in and you’ll hear my explanation of why many people get so much wrong in Azure networking/security.

Designing A Hub And Spoke Infrastructure

How do you plan a hub & spoke architecture? Based on much of what I have witnessed, I think very few people do any planning at all. In this post, I will explain some essential things to plan and how to plan them.

Rules of Engagement

Microsoft has shared some concepts in the Well-Architected Framework (simplicity) and the documentation for networking & Zero Trust (micro-segmentation, resilience, and isolation).

The hub & spoke will contain networks in a single region, following concepts:

  • Resilience & independence: Workloads in a spoke in North Europe should not depend on a hub in West Europe.
  • Micro-segmentation: Workloads in North Europe trying to access workloads in West Europe should go through a secure route via hubs in each region.
  • Performance: Workload A in North Europe should not go through a hub in West Europe to reach Workload B in North Europe.
  • Cost Management: Minimise global VNet peering to just what is necessary. Enable costs of hubs to be split into different parts of the organisation.
  • Delegation of Duty: If there are different network teams, enable each team to manage their hubs.
  • Minimised Resources: The hub has roles only of transit, connectivity, and security. Do not place compute or other resources into the hub; this is to minimise security/networking complexity and increase predictability.

Management Groups

I agree with many things in the Cloud Adoption Framework “Enterprise Scale” and I disagree with some other things.

I agree that we should use Management Groups to organise subscriptions based on Policy architecture and role-based access control (RBAC – granting access to subscriptions via Entra groups).

I agree that each workload (CAF calls them landing zones) should have a dedicated subscription – this simplifies operations and governance like you wouldn’t believe.

I can see why they organise workloads based on their networking status:

  • Corporate: Workloads that are internal only and are connected to the hub for on-premises connectivity. No public IP addresses should be allowed where technically feasible.
  • Online: Workloads that are online only and are not permitted to be connected to the hub.
  • Hybrid: This category is missing from CAF and many have added it themselves – WAN and Internet connectivity are usually not binary exclusive OR decisions.

I don’t like how Enterprise Scale buckets all of those workloads into a single grouping because it fails to acknowledge that a truly large enterprise will have many ownership footprints in a single tenant.

I also don’t like how Enterprise Scale merges all hubs into a single subscription or management group. Yes, many organisations have central networking teams. Large organisations may have many networking teams. I like to separate hub resources (not feasible with Virtual WAN) into different subscriptions and management groups for true scaling and governance simplicity.

Here is an example of how one might achieve this. I am going to have two hub & spoke deployments in this example:

  • DUB01: Located in Azure North Europe
  • AMS01: Located in Azure West Europe

Some of you might notice that I have been inspired by Microsoft’s data centre naming for the naming of these regional footprints. The reasons are:

  • Naming regions after “North Europe” or “East US” is messy when you think about naming network footprints in East US2, West US2, and so on.
  • Microsoft has already done the work for us. The Dublin (North Europe) region data centres are called DUB05-DUB15 and Microsoft uses AMS01, etc for Middenmeer (West Europe).
  • A single virtual network may have up to 500 peers. Once we hit 500 peers then we need to deploy another hub & spoke footprint in the region. The naming allows DUB02, DUB03, etc.

The change from CAF Enterprise Scale is subtle but look how instantly more scalable and isolated everything is. A truly large organisation can delegate duties as necessary.

If an identity responsible for the AMS01 hub & spoke is compromised, the DUB01 hub & spoke is untouched. Resources are in dedicated subscriptions so the blast area of a subscription compromise is limited too.

There is also a logical placement of the resources based on ownership/location.

You don’t need to recreate policy – you can add more associations to your initiatives.

If an enterprise currently has a single networking team, their IDs are simply added to more groups as new hub & spoke deployments are added.

IP Planning

One of the key principles in the design is simplicity: keep it simple stupid (KISS). I’m going to jump ahead a little here and give you a peek into the future. We will implement “Network segmentation: Many ingress/egress cloud micro-perimeters with some micro-segmentation” from the Azure zero-trust guidance.

The only connection that will exist between DUB01 and AMS01 is a global VNet peering connection between the hubs. All traffic between DUB01 and AMS01 mist route via the firewalls in the hubs. This will require some user-defined routing and we want to keep this as simple as possible.

For example, the firewall subnet in DUB01 must have a route(s) to all prefixes in AMS01 via the firewall in the hub of AMS01. The more prefixes there are in AMS01, the more routes we must add to the Route Table associated with the firewall subnet in the hub of DUB01. So we will keep this very simple.

Each hub & spoke will be created from a single IP prefix allocation:

  • DUB01: All virtual networks in DUB01 will be created from 10.1.0.0/16.
  • AMS01: All virtual networks in AMS01 will be created from 10.2.0.0/16.

You might have noticed that Azure Virtual Network Manager uses a default of /16 for an IP address block in the IPAM feature – how convenient!

That means I only have to create one route in the DUB01 firewall subnet to reach all virtual networks in AMS01:

  • Name: AMS01
  • Prefix: 10.2.0.0/16
  • Next Hop Type: VirtualAppliance
  • Next Hop IP Address: The IP address of the AMS01 firewall

A similar route will be created in AMS01 firewall subnet to reach all virtual networks in DUB01:

  • Name: DUB01
  • Prefix: 10.1.0.0/16
  • Next Hop Type: VirtualAppliance
  • Next Hop IP Address: The IP address of the DUB01 firewall

Honestly, that is all that is required. I’ve been doing it for years. It’s beautifully simple.

The firewall(s) are in total control of the flows. This design means that neither location is dependent on the other. Neither AMS01 nor DUB01 trust each other. If a workload is compromised in AMS01 its reach is limited to whatever firewall/NSG rules permit traffic. With threat detection, flow logs, and other features, you might even discover an attack using a security information & event management (SIEM) system before it even has a chance to spread.

Workloads/Landing Zones

Every workload will have a dedicated subscription with the appropriate configurations, such as enabling budgets and configuring Defender for Cloud. Standards should be as automated as possible (Azure Policy). The exact configuration of the subscription should depend on the zone (corp, online or corporate).

When there is a virtual network requirement, then the virtual network will be as small as is required with some spare capacity. For example, a workload with a web VM and a SQL Server doesn’t need a /24 subnet!

Essential Workloads

Are you going to migrate legacy workloads to Azure? Are you going to run Citrix or Azure Virtual Desktop (AVD)? If so, then you are going to require doamin controllers.

You might say “We have a policy of running a single ADDS site and our domain controllers are on-premises”. Lovely, at least it was when Windows Server 2003 came out. Remember that I want my services in Azure to be resilient and not to depend on other locations. What happens to all of your Azure servces when the network connection to on-premises fails? Or what happens if on-premises goes up in a cloud of smoke? I will put domain controllers in Azure.

Then you might say “We will put domain controllers in DUB01 and AMS01 can use them”. What happens if DUB01 goes offline? That does happen from time to time. What happens if DUB01 is compromised? Not only will I put domain controllers in DUB01, but I will also put them in AMS01. They are low end virtual machines and the cost will be minor. I’ll also do some good ADDS Sites & Services stuff to isolate as much as ADDS lets you:

  • Create subnets for each /16 IP prefix.
  • Create an ADDS site for AMS01 and another for DUB01.
  • Associate each site with the related subnet.
  • Create and configure replication links as required.

The placement and resilience of other things like DNS servers/Private DNS Resolver should be similar.

And none of those things will go in the hub!

Micro-Segmentation

The hub will be our transit network, providing:

  • Site-to-site connectivity, if required.
  • Point-to-site connecticity, if required.
  • A firewall for security and routing purposes.
  • A shared Azure Bastion, if required.

The firewall will be the next hop, by default (expect exceptions) for traffic leaving every virtual network. This will be configured for every subnet (expect exceptions) in every workload.

The firewall will be the glue that routes every spoke virtual network to each other and the outside world. The firewall rules will restrict which of those routes is possible and what traffic is possible – in all directions. Don’t be lazy and allow * to Internet; do you want to automatically enable malware to call home for further downloads or discovery/attack/theft instructions?

The firewall will be carefully chosen to ensure that it includes the features that your organisation requires. Too many organisations pick the cheapest firewall option. Few look at the genuine risks that they face and pick something that best defends against those risks. Allow/deny is not enough any more. Consider the features that pay careful attentiont to what must be allowed; these are the firewall ports that attackers are using to compromise their victims.

Every subnet (expect exceptions) will have an NSG. That NSG will have a custom low-priority inbound rule to deny everything; this means that no traffic can enter a NIC (from anywhere, including the same subnet) without being explicityly allowed by a higher priority rule.

“Web” (this covers a lot of HTTPS based services, excluding AVD) applications will not be published on the Internet using the hub firewall. Instead, you will deploy a WAF of some kind (or different kinds depending on architectural/business requirements). If you’re clever, and it is appropriate from a performance perspective, you might route that traffic through your firewall for inspection at layers 4-7 using TLS Inspection and IDPS.

Logging and Alerting

You have placed all the barriers in place. There are two interesting quotes to consider. The first warns us that we must assume a pentration has already taken place or will take place.

Fundamentally, if somebody wants to get in, theyโ€™re getting inโ€ฆaccept that. What we tell clients is: Number one, youโ€™re in the fight, whether you thought you were or not. Number two, you almost certainly are penetrated.

Michael Hayden Former Director of NSA & CIA

The second warns us that attackers don’t think like defenders. We build walls expecting a linear attack. Attackers poke, explore, and prod, looking for any way, including very indeirect routes, to get from A to B.

Biggest problem with network defense is that defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.

John Lambert

Each of our walls offers some kind of monitoring. The firewall has logs, which ideally we can either monitor/alert from or forward to a SIEM.

Virtual Networks offer Flow Logs which track traffic at the VNet level. VNet Flow logs are superior to NSG FLow logs because they catch more traffic (Private Endpoint) and include more interesting data. This is more data that we can send to a SIEM.

Defender for Cloud creates data/alerts. Key Vaults do. Azure databases do. The list goes on and on. All of this data that we can use to:

  • Detect an attack
  • Identify exploration
  • Uncover an expansion
  • Understand how an attack started and happened

And it amazes me how many organisations choose not to configure these features in any way at all.

Wrapping Up

There are probably lots of finer details to consider but I think that I have covered the essentials. When I get the chance, I’ll start diving into the fun detailed designs and their variations.