Why The Classic DMZ/Secure Zone Design Is Worthless in Azure

I see many people implementing classic network security designs in Azure. Maybe there’s DMZ and an internal virtual network. Maybe they split Production, Test, and Dev into three virtual networks. Possibly, they do a common government implementation – what Norway calls “Secure Zone”. I’m going to explain to you why these network designs offer very little security.

I have written this post as a contribution to Azure Spring Clean 2025. Please head over and check out the other content.

Essential Reading

This post is part of a series that I’ve been writing over several weeks. If you have not read my previous posts then I recommend that you do. I can tell that many people assume certain things about Azure network based on designs that I have witnessed. You must understand the “how does it really work” stuff before you go any further.

A Typical Azure Network Design

Most of the designs that I have encountered in Azure, in my day job and as a community person who “gets around”, are very much driven by on-premises network designs. Two exceptions are:

  • What I see produced by my colleagues at work.
  • Those using Enterprise Scale from the Microsoft Cloud Adoption Framework – not that I recommend implementing this, but that’s a whole other conversation!

What I mostly observe is what I like to call “big VNets”. The customer will call it lots of different things but it essentially boils down to a hub-and-spoke design that features a few large virtual networks that are logically named:

  • Dev, Test, and Production
  • DMZ and private
  • Internal and Secure

Workload: A collection of resources that provide a service. For example, an App Service, some Functions, a Redis cache, and a database might make a retail system. The collection of resources is a workload, united in their task to provide a service for the organisation.

You get the idea. There are a few spoke virtual networks that are each peered to a hub.

The hub is a transit network, enabling connectivity between each of the big VNets – or “isolating them completely” – except for where they don’t (quite real, thanks to business-required integrations or making the transition from testing to production easier for developers). The hub provides routing to Azure/The Internet and to remote locations via site-to-site networking.

If we drill down into the logical design we can see the many subnets in each spoke virtual network. Those subnets are logically divided in some way. Some might do it based on security zones – they don’t understand NSGs. Some might have one subnet per workload – they don’t know that subnets do not exist. Each subnet has an NSG and a Route Table. The NSG “micro-segments” the subnet. The Route Table forces traffic from the subnet to the firewall – the logic here can vary.

Routing & Subnet Design

Remember three things for me:

  • Virtual networks and subnets do not exist – packets go directly from sender to receiver in the software-defined network.
  • Routing is our cabling when designing network security.
  • The year is 2025, not 2003 (before Windows XP Service Pack 2 introduced Windows Firewall to the world).

There might be two intents for routing in the legacy design:

  • Each virtual network will be isolated from the others via the hub firewall.
  • Each subnet will be isolated from the others via the hub firewall.

Big VNet Network Isolation

Do you remember 2003? Kid Rock and Sheryl Crow still sang to each other. Avril Lavigne was relevant (Canada, you’re not getting out of this!). The Black Eyed Peas wanted to know where the love was because malware was running wild on vulnerable Windows networks.

I remember a Microsoft security expert wandering around a TechEd Europe hall, shouting at us that network security was something that had to be done throughout the network. The edge firewall was like the shell of an egg – once you got inside (and it didn’t matter how) then you had all that gooey goodness without any barriers.

A year later, Microsoft released Windows XP/Windows Server 2003 Service Pack 2 to general availability. This was such a rewrite that many considered it a new OS, not a Service Pack – what the kids today call a feature update, a cumulative update, or an annual release. One of the new features was Windows Firewall, which was on by default and blocked stuff from getting into our machines unless we wanted that stuff. And what did every Windows admin do? They used Group Policy to turn Windows Firewall off in the network. So malware continued, became more professional, and became ransomware.

Folks, it’s been 21 years. It’s time to harden those networks – let the firewall do what it can do and micro-segment those networks. Microsoft tells you to do it. The US NSA tells you to do it. The Canadian Centre for Cyber Security tells you to do it. The UK NCSC tells you to do it. Maybe, just maybe, they know more about this stuff than those of you who like gooey network insides?

Big VNet Subnet Isolation

The goal here is to force any traffic that is leaving a subnet to use the hub firewall as the next hop. In my below example, if traffic wants to get from Subnet 1 to Subnet 2, it must first pass through the firewall in the hub. A Route Table is created with a collection of User-Defined Routes (UDR) such as shown below.

Each UDR uses Longest Prefix Match to force traffic to other subnets to route via the firewall. You don’t see it in the diagram, but there would also be a route to 0.0.0.0/0 via the firewall, including any prefix outside of this virtual network, except the hub (Longest Prefix Match selecting the System route created by peering with the hub).

Along comes the business and they demand another workload or whatever. A new subnet is required. So you add that subnet. It’s been a rough Friday and the demand came right before you went home. You weren’t thinking straight and .. hmm … maybe you forgot to update the routing.

Oh it’s only one Route Table for Subnet 4, right? Em, no; you do need to add a route table to Subnet 4 with prefixes to subnets 1-3 and 0.0.0.0/0. But that only affects traffic leaving Subnet 4.

What you forget is that routing works in two ways. Subnets 1-3 require a UDR each for Subnet 4, otherwise traffic from Subnets 1-3 will route directly to Subnet 4 and the deeper inspection of the firewall won’t see the traffic. Worse, you probably broke TCP communications because you set up an asynchronous route and the stateful hub firewall will block responses from Subnet 4 to Subnets 1-3.

Imagine this Production VNet with 20, 30, or 100 subnets. This routing update is going to be like like manual patching – which never happens.

One of the biggest lessons I can share in secure network design is KISS: keep it simple, stupid. Routing should be simple, and routing should be predictable when there is expansion/change, because routing is your cabling for enforcing or bypassing network security.

Network Security Group Design

As a consultant, I often have a kickoff meeting with a customer where they stress how important security is. I agree – it’s critical. And then I get to see their network or their plans. At this point, I shouldn’t be surprised but I always am. Some “expert” who passed an Azure certifcation exam or three implements a big VNet design. And the NSGs – wow!

What you’ll observe is:

  • They implement subnets as security zones, when the only security zoning in Azure is the NSG. NSG rules, processed on the NIC, are how we allow/deny incoming or outgoing traffic at the most basic level. In the end, there are too many subnets in an already crowded big VNet.
  • The NSG either uses lots of * (any) in the sources and destinations leading to all sorts of traffic being allowed from many locations.
  • They think that they are blocking all incoming traffic by default but don’t understand what the default rule 65000 does – it lets every routable network (Azure & remote) in.
  • They open up all traffic inside the subnet – who cares if some malware gets in via devops or a consultant who uploads it via a copy/paste in RDP?

And they’ll continue to stress the importance of security.

Shared Resources In The Hub

This one makes me want to scream. And to be fair, Microsoft play a role in encouraging this madness – shame on you, Microsoft!

The only things that should be in your hub are:

  • Virtual Network Gateways
  • Third-party routers and Azure Route Server
  • The firewall
  • Maybe a shared Azure Bastion with appropriate minmised RBAC rights

That’s it! Nothing else!

Don’t put DNS servers here. Don’t put a “really important database” in the hub. Don’t put domain controllers in the hub. Repeat after me:

I will not place shared resources in the hub

Everything is a shared resource. Just about every workload shares with other workloads. Should all shared resources go in the hub? What goes in the spokes now?

“Why?” you may ask. Remember:

  • By default, everything goes straight from source to destination
  • Routing is our way to force traffic through a firewall
  • When you peer two VNets, a new System route enables direct connectivity between NICs in the two VNets.

People assume that a 0.0.0.0/0 route includes everything, but Longest Prefix Match overrides that route when other routes exist. So, if you place a critical database in the hub, spokes will have direct connectivity to that database without going through the firewall and any advanced inspection/filtering services that it can offer – and vice versa. In other words:

  • You opened up every port on the critical resource to every resource in every spoke.
  • You created an open bridge between every spoke.

And the fact is that putting something in the hub doesn’t make it “more shared” (how is it less shared than something in a spoke?) or faster (software-defined networking treats two NICs in peered VNets as if they were in the same VNet).

Those clinging to putting things in the hub will then want more routes and more complexity. What happens when the organisation goes international and adds hub & spoke deployments in other regions? What should be a simple “1 peering & 1 route” solution between two hubs will expand into routes for each hub subnet containing compute.

Everything is shared – that’s modern computing. Place your workloads into spokes, whether they are file shares, databases, domain controllers, or DNS servers/Private Resolvers. They will work perfectly well and your network will function, be more secure, simpler to manage/understand, and the security model will be more predictable.

Wrapping Up

This is a long post. There is a good chance that I just spat in the face of your cute lil’ baby Azure network. I will be showing you alterantives in future posts, building up the solution a little at a time. Until then, KISS … keep it simple, stupid!

How Does Azure Routing Work

Here comes yet another “How does it work” post on Azure networking. I have observed many folks who assume that routing in Azure works one way, but are shocked to learn that there are more layers than they anticipated. In this post, I will explain how routing really works in Azure networking.

The Misconception

I will start by revisiting a Microsoft diagram that I previously used for a discussion on the importance of routing in network security.

The challenge with the above architecture is to make traffic flow through the firewall. Most people will answer that User-Defined Routes (UDRs) via Route Tables are required. Yes, that is true. But they fail to understand that two (I would argue three) other sources of routes are also present in this diagram. The lack of that additional knowledge may impact this simple scenario. And I know for certain that if this scenario were the typical mid-large organisation, then the lack of knowledge would become:

  • An operational issue
  • A security issue
  • A troubleshooting issue
  • A connectivity issue

The NIC Is The Router

One of my first posts in this series was “Azure Virtual Networks Do Not Exist“. In that post, I explained that all traffic routes directly from the source NIC to the destination NIC. There is no subnet, no default gateway, and no virtual network. Instead, a virtual network is a mapping of a mesh connectivity between all NICs in that virtual network. When you peer virtual networks, the mapping expands to mesh all NICs in the peered virtual networks.

Where does routing happen if there is no default gateway or subnet? The answer (just like “where are NSG rules processed?” is the NIC is the router.

Remember that everything is a virtual machine, including “serverless computing”, somewhere in the platform.

If packets travel directly from source to destination, then there is no router appliance between the source and the destination. That means that the source must be its own router.

Some Basic Routing Theory

A route is an instruction: if you want to get to address A then go to place X. X might be the destination, or it might be the first hop to get to the destination.

For example, I might have a remote network of 192.168.0.0/16. I have an Azure App Service that wants to use a site-to-site connection to reach out to a server with an address of 192.168.1.10. A route might say:

  • Prefix: 192.168.0.0/16
  • Next Hop Type: Virtual Network Gateway (VPN or ExpressRoute)

The NIC of the App Service will learn that route (see BGP later). Packets from the App Service will go directly to the NIC(s) of the Virtual Network Gateway and then route over VPN/ExpressRoute to 192.168.1.10.

Maybe I will manipulate that route a little to force egress traffic through a firewall. My firewall will have an internal IP address of 10.0.1.4. I can introduce a route (see User-Defined Routes later) of:

  • Prefix: 192.168.0.0/16
  • Next Hop Type: Virtual Appliance
  • Next Hop IP Address: 10.0.1.4

Now packets to 192.168.1.10 will go to my firewall. It’s important now that the firewall has a route to 192.168.0.0/16 – normally it would by default in a hub & spoke design.

The second piece of knowledge to have is that there must be a route for the response. There is no implied return route. Either a human or the network must implement that return route. And it’s really important that the return route is the same as the egress route; stateful firewalls will block TCP responses when they have not permitted the requests – this is one of those “you’ll learn it the hard way” things when dealing with site-to-site connections and firewalls.

The Laws Of Azure Routing

I will revisit this at the end, but here’s what you need to know when you are designing/troubleshooting routing in Azure:

  1. Route source priority
  2. Longest prefix match

Law 1: Route Source Priority

You might know that User-Defined Routes (UDRs) exist. But there are two (or three) other sources of routes and they each have a priority.

System Routes

The first source of routes that is always there is System (or Default) routes. System routes are created when you create or configure a virtual network. For example, every subnet in a brand-new virtual network has many system routes out of the box. The major routes we are concerned with are:

  • Route(s) to the address prefix(es) of the virtual network to route directly (VirtualNetwork) to the destination NICs.
  • A route to send all other traffic to the Internet (including Azure).

Yes, I am leaving out a bunch of other system routes that are implemented to protect Microsoft 365 from hacking but I want to keep this simple.

Another important System route is what is created when you peer two virtual networks. A route is created in each of the peered virtual networks to state that the next hop to the new neighbour is via peering. This is a human-friendly message; what it means is that the NICs in the connected peer are now part of the local virtual network’s mesh – packets from local NICs will route directly to NICs in the peered virtual network.

BGP Routes

Border Gateway Protocol (BGP) is a mechanism where one routing appliance shares its knowledge of routes with neighbours. For example, a router in Dublin might say “If you want to get to any NICs in Dublin then come to me”. A router in Paris might hear that message and relay it by saying “I know how to get to Dublin so if you want to get to Dublin, come to me”. A router in Munich might pick up that relay from Paris and advertise locally that it knows how to get to Dublin. A PC in Munich wants to send a packet to a NIC in Dublin. The Munich network says that the route to Dublin is via the router in Munich, so the flow of packets will be:

Munich PC > Munich router > Paris router > Dublin Router > Dublin IP NIC

Azure implements BGP in two scenarios:

  • Site-to-site networking
  • Azure Route Server

You must configure BGP when using ExpressRoute for remote site connections. You optionally configure BGP when configuring a BGP tunnel. What most people don’t realise is that you will still have BGP routes with a BGP-less VPN tunnel thanks to the Local Network Gateway which generates BGP routes for the remote site prefixes. In the case of site-to-site networking, BGP routes are propagated from the GatewaySubnet and propagate to all other subnets in the virtual network and (by default) to all peered virtual networks/subnets.

The other scenario is Azure Route Server (ARS), which also includes Virtual WAN, where the router is Azure Route Server – Azure Route Server originated in Virtual WAN. ARS can peer with other appliances, such as a router Network Virtual Appliance (NVA), and share routes with it:

  • Routes of remote connected networks are learned from the NVA and propagated to the Azure hub/spokes. The hub/spokes now know that the route to the remote networks is to use the router as the next hop (not your firewall!).
  • The prefixes of the hub/spokes are shared with the NVA to enable remote networks to know how to get to them.

User-Defined Routes (UDRs)

This is the one kind of route that we can directly manage as Azure architects/administrators/operators. A resource called a Route Table is created. The Route Table is associated with a subnet and applies its settings to all NICs in the subnet. There are two important things we can use the Route Table for:

  • Disable BGP Propagation: We can disable inward BGP route propagation to the associated subnet. This means that we can prevent routes to remote sites from bypassing our firewall by using the Virtual Network Gateway/NVA as the next hop.
  • User-Defined Routes: We can implement routes that force traffic in ways that we want.

UDRs have several possible next hops for packets:

  • Virtual Appliance: A router or firewall – you additionally specify the IP address of the virtual appliance NIC to use.
  • Internet: Including the Internet and Azure
  • Virtual Network Gateway: An Azure site-to-site connection in the virtual network or shared with the virtual network via peering.
  • Virtual Network: Send packets to the same virtual network.
  • None: The packets are dropped at the source NIC and are never transmitted – a useful security feature.

Hidden Programmed Routes

You won’t find this one in any official documentation on routing but it does exist and you’ll learn about them either by accident or by educated observation of behaviour.

Microsoft will sometimes introduce a system route to fix an issue where if you do X, they will program a route to be generated. Unfortunately, this (probably a) type of System route cannot be visibly observed in any way because no diagnostics tools exist for that subnet.

One example of this is Private Endpoint. When you create a subnet, network policies for Private Endpoint are disabled by default. This causes a chain of things to happen:

  • UDRs are ignored by Private Endpoints in the subnet
  • Each Private Endpoint in the subnet will create its own /32 (the IP address of the Private Endpoint is the destination prefix) System Route in the virtual network and directly peered virtual networks. This means that a /32 route for the Private Endpoint is added to the GatewaySubnet of the hub/spoke depending on your design.

That GatewaySubnet System route has broken the spirit of many Azure admins over the years. You can’t see it and, from our perspective, it shouldn’t exist. The result was that traffic from on-premises to Private Endpoints went directly to the Private Endpoint, even if we set up a UDR to force traffic to the spoke virtual network to go via the firewall. This is because of the second law of routing: Longest Prefix Match.

Route Deactivation

We have established that there are three* sources of routes. What happens if two or three of them create routes to the same prefix? That can happen; in fact, you will probably make it happen if you want to force traffic through a firewall.

Let’s imagine a scenario where there are 3 routes to 192.168.0.0/16 from:

  • System
  • BGP
  • UDR

What happens? The fabric handles this automatically and applies a prioritisation rule to deactive the routes from lesser sources. The priority is as follows:

  1. UDR: Routes that you explicity create in Azure will deactive routes from BGP & System to the same prefix. UDR beats BGP & System.
  2. BGP: Routes that are created by admins/networks in other locations will deactivate routes from System to the same prefix. BGP beats System.
  3. System: System routes are Azure generated and get beat by BGP and UDR routes to the same prefix.

Let’s consider a simple/common example. We have a virtual network with a subnet. If you want to see this in action, add a VM to the subnet, power it up, open the Azure NIC resource, and go to Effective Routes (wait 30 seconds). Withotu doing anything to the subnet/virtual network a System Route will be created for all NICs in the subnet:

  • Prefix: 0.0.0.0/0
  • Next Hop Type: Internet

What that means is that any traffic that doesn’t have a route will be sent to Internet.

Let’s say that I want to force that traffic through a firewall appliance with an IP address of 10.0.1.4. I can associate a new Route Table to the subnet and add a UDR to the subnet:

  • Prefix: 0.0.0.0/0
  • Next Hop Type: Virtual Appliance
  • Next Hop IP Address: 10.0.1.4

Two routes to 0.0.0.0/0 are present. Which one will be used? That decision is already made. The System route to 0.0.0.0/0 is automatically deactivated by the fabric as soon as a higher (BGP or UDR) route is added to the subnet. The only active route to 0.0.0.0/0 in that subnet is my UDR via the firewall.

Law 2: Longest Prefix Match

There is another scenario where there may be multiple route options. A packet might be destined to an IP address and multiple active routes might be applicable. In this case, Azure applies “Longest Prefix Match” – you can think of it as the best matching route. This one is best explained with an example.

Let’s say a packet is going 10.10.10.4. However, the source NIC has 3 possible routes that could apply:

  • System: 0.0.0.0/ via Internet
  • BGP: 10.10.10.0/24 via Virtual Network Gateway
  • UDR: 10.0.0.0/8 via a firewall

All of the routes are active because the prefixes are different. Which one is chosen? Tip: Route priority (UDR/BGP/System) is irrelevant now.

I don’t know the internal mechanics of this but I suspect that an AND operation is done using the destination address and the route prefix. Remember that each octet in a 32 bit IP address is 8 bits:

Here is the calculation for the System route, which sums to 0 bits:

Route Prefix0000
Destination1010104
AND Bits0000

Here is the calculation for the BGP route, which sums to 24 bits:

Route Prefix1010100
Destination1010104
AND Bits8880

Here is the calculation for the UDR route, which sums to 8 bits:

Route Prefix10000
Destination1010104
AND Bits8000

Which route is the best match? The BGP route is because it has the longest prefix match to the destination IP address.

Review: The Laws of Azure Routing

Now you’ve learned how Azure routes are generated, how they are prioritised, and how they are chosen when a packet is sent. Let’s summarise the laws of Azure routing:

  1. Route Source Priority: When there are routes to the same prefix, BGP beats Sytem, and UDR beats BGP & System.
  2. Longest Prefix Match: When multiple routes can be used to send a packet to a destination, the route with the longest bit match will be selected.
  3. It’s Always DNS: Ask any Windows admin – when routing isn’t the cause of issues, then it’s DNS 🙂

How Many Subnets Do I Need In An Azure Virtual Network?

You’re designing a new virtual network in Azure. You’re going to have three different security zones in your application. How many subnets do you need? I will help you understand why many of you gave the incorrect answer.

Back To Basics

In a previous post, I explained that virtual networks do not exist. Therefore, subnets do not exist. That’s why you cannot ping a default gateway. Packets do not leave a source NIC and route via default gateway to enter another subnet. Packets go from the source NIC, disappear in the physical network of Azure, and reappear at the destination NIC, whether it is on the same host, in the same data centre, in a neighbouring data centre, or on the other side of the world. Say it after me:

Subnets do not exist.

If packets go straight from source to destination, what is the logic of creating subnets to isolate resources?

Why Did We Segment Networks Using Subnets?

In the on-premises world, there are many reasons to segment a network. A common reason was to control the size of broadcast/multicast domains. That’s not an issue in Azure because virtual networks do not support broadcasts/multicasts.

From a security perspective, we segmented networks because we needed to isolate a firewall. The firewall is a central resource. A network runs from a top-of-rack switch to an ethernet interface in the firewall. That subnet uses the firewall to route to other subnets, possibly using the same cable (VLANs) or using different cables to other top-of-rack switches.

Earlier I asked you to imagine a workload with three security zones. Let’s call them:

  • Web
  • Application
  • Database

That’s not too crazy. My security model requires me to ensure:

  • Internet users can only reach the web servers on HTTPS
  • The Application server can only be talked to by the web servers.
  • The database servers can only be talked to by the application servers.

How would I create that? I’d set up three VLANs or subnets. Each VLAN would use a default gateway which is either the firewall or uses the firewall as a next hop to reach other VLANs. The firewall would then enforce my security intent, ensuring that only desired traffic could enter a VLAN to reach the required machines.

This design works perfectly well in on-premises cable-oriented networks because the networks (physical or virtual) are connected via cable(s) running to the firewall.

Bringing Cable-Oriented Designs To Azure

There is no finger-pointing here – I still have nightmares about an early Azure design I did where I created a VNet diagram with somewhere between 10-20 subnets. We all learn, and I’m hoping you learn from my mistakes.

Using the same requirements as before for our workload, we can produce the below design … based on cable-oriented patterns.

We create a single virtual network broken into 3 subnets. Each subnet has VMs for each role in the application. We then isolate each of the machines using NSGs.

That seems perfect, right? It is secure. Traffic will get from A to B. If we implement the rules correctly, then only the correct traffic will flow. But this design does display a lack of understanding.

Remember: packets go directly from source to destination. There is no default gateway. If an NSG that is processing rules on an Application Server NIC is allowing or denying traffic, then what is the point of the subnet? The subnet is not doing the segmentation; the NSG is doing the segmentation.

How Can We Segment Networks In Azure?

The most basic segmentation method in an Azure virtual network is the Network Security Group (NSG). While the previous Azure diagram is not technically wrong, the below diagram displays a better understanding of the underlying technology:

In this design, we are accepting that neither the virtual network nor the subnet exist. We are using rules in the NSG to isolate each tier of the application:

Look at the below NSG to see how this isolation can be done with a very simple example:

The NSG denies all traffic by default (rule 4000). Then the only traffic permitted is what we modeled previously using subnets. The rules are processed on the NICs, so the only way traffic enters a VM is if it is compliant with the above NSG.

Yes, I could use groups of IP addresses, or better, Application Security Groups that make the rules more readable and allow aggregation/abstraction of NICs & IP addresses.

So Why Do We Create Subnets In Azure

The primary reason is quite boring: technical requirements. Let me adjust my design a little. The database is going to be implemented using SQL Managed Instance instead of a VM. In the original VM-only design, there were no impediments to using a single subnet. SQL Managed Instance changes the technical requirements because it must be connected to a dedicated subnet.

That’s a simple example. A different example might be that I must use different address prefixes – see an older post by me on using a Linux VM as a NAT gateway where the VM has an internal NIC on a regularly addressed subnet and a second NIC in a subnet that is addressed based on NAT requirements.

Another example might be that you need to create custom routes for different NICs to the same prefix. For example, some NICs will go via your firewall to 0.0.0.0/0. Other NICs might go to “None” (a blackhole that drops packets) for traffic going to 0.0.0.0/0. The only way to implement that is to have one subnet for each Route Table. I’m not going to dive into routing here – let’s save that for another day.

Taking This Bigger

I am eventually going to explain enough things so I can show you why the classic Azure “big VNet” likely called production, test, or dev, is both an operational and security nightmare. But the above content, along with my other recent posts, are just part of the puzzle. Watch out for more content coming soon.

Beware Of The Default Rules In Network Security Groups

The Network Security Group (NSG) is the primary mechanism for segmenting a subnet in Microsoft Azure. NSGs are commonly implemented. Unfortunately, people assume quite a bit about NSGs, and I want to tackle that by explaining why you need to be aware of the default rules in Network Security Groups.

The Assumption

Let’s say I have an extremely simple workload consisting of:

  • A virtual machine acting as a web server.
  • A virtual machine acting as a database server.

Yes, this could be VNet-connected PaaS services and have the same issues, but I want the example to be as clear as possible.

I want to lock down and protect that subnet so I will create an NSG and associate it with the subnet. Traffic from the outside world is blocked, right? Now, I will create an inbound rule to allow HTTPS/TCP 443 from client external addresses to the web server.

NameSourceProtocolPortDestinationAction
AllowWeb<clients>TCP443Web VMAllow

The logic I expect is:

  1. Allow web traffic from the clients to the web server.
  2. Allow SQL traffic from the web server to the database server in the same subnet.
  3. Everything else is blocked.

I check the Inbound Rules in the NSG and I can see my custom rules and the built-in default rules. This confirms my logic, right?

All is well, until one day, every computer in the office has a ransomware demand screen and both of my Azure VMs are offline. Now my boss is screaming at me because the business’s online application is not selling our products to customers.

Where It All Went Wrong

Take a look at the default rules in the above screenshot. Rule 65500 denies all traffic. That’s what we want; block all traffic where a higher priority rule doesn’t allow it. That’s the rule that we were banking on to protect our Azure workload.

But take a look at rule 65000. That rule allows all traffic from VirtualNetwork to VirtualNetwork. We have assumed that VirtualNetwork means the virtual network that the NSG of the subnet that it is associated with – in other words, the virtual network that we are working on.

You are in for a bigger surprise than a teddy bear picnic in your woods if you research the definition of VirtualNetwork:

The virtual network address space (all IP address ranges defined for the virtual network), all connected on-premises address spaces, peered virtual networks, virtual networks connected to a virtual network gateway, the virtual IP address of the host, and address prefixes used on user-defined routes. This tag might also contain default routes.

In summary, this means that VirtualNetwork contains:

  • The prefixes in your virtual network
  • Any peered virtual networks
  • Any remote networks connected by site-to-site networking
  • Any networks where you have referenced in a user-defined route in your subnets.

Or, pretty much every network you can route to/from. And that’s how the ransomware got from someone’s on-premises PC into the virtual network. The on-premises networks were connected with the Azure virtual network by VPN. The built-in 65000 rule allowed all traffic from on-premises. There was nothing to block the ransomware from spreading to the Azure VMs from the on-premises network.

Solving This Problem

There are a few ways to solve this issue. I’ll show you a couple. I am a believer in true micro-segmentation to create trust-noone networks. The goal here is that no traffic is allowed anywhere on any Azure network without a specific rule to permit flows that are required by the business/technology.

The logic of the below is that all traffic will be denied by default, including traffic inside the subnet.

Remember, all NSG rules are processed at the NIC, no matter how the NSG is associated.

I have added a low-priority (4000) rule to deny everything that is not allowed in the higher-priority rules. That will affect all traffic from any source, including sources in the same virtual network or subnet.

By the way, the above is the sort of protection that many national cyber security agencies are telling people to implement to stop modern threats – not just the threats of 2003.

I know that some of you will prefer to treat the NSG as an edge defence, allowing all traffic inside the virtual network. You can do that too. Here’s an example of that:

My rule at 3900 allows all traffic inside the address prefix of the virtual network. The following rule, 4000, denies everything, which means that anything from outside the network (not including the traffic in rule 100) will be blocked.

The Lesson

Don’t assume anything. You now know that VirtualNetwork means everything that can route to your virtual network. For example the Internet service tag includes The Internet and Microsoft Azure!

Azure Virtual Networks Do Not Exist

In this post, I want to share the most important thing that you should know when you are designing connectivity and security solutions in Microsoft Azure: Azure virtual networks do not exist.

A Fiction Of Your Own Mind

I understand why Microsoft has chosen to use familiar terms and concepts with Azure networking. It’s hard enough for folks who have worked exclusively with on-premises technologies to get to grips with all of the (ongoing) change in The Cloud. Imagine how bad it would be if we ripped out everything they knew about networking and replaced it with something else.

In a way, that’s exactly what happens when you use Azure’s networking. It is most likely very different to what you have previously used. Azure is a multi-tenant cloud. Countless thousands of tenants are signed up and using a single global physical network. If we want to avoid all the pains of traditional hosting and enable self-service, then something different has to be done to abstract the underlying physical network. Microsoft has used VXLAN to create software-defined networking; this means that an Azure customer can create their own networks with address spaces that have nothing to do with the underlying physical network. The Azure fabric tracks what is running where, and what NICs can talk to each other, and forward packets as required.

In Azure, everything is either a physical (rare) or a virtual (most common) machine. This includes all the PaaS resources and even those so-called serverless resources. When you drill down far enough in the platform, you will find either a machine with an operating system with a NIC. That NIC is connected to a network of some kind, either an Azure-hosted one (in the platform) or a virtual network that you created.

The NIC Is The Router

The above image is from a slide I use quite often in my Azure networking presentations. I use it to get a concept across to the audience.

Every virtual machine (except for Azure VMware Services) is hosted on a Hyper-V host, and remember that most PaaS services are hosted in virtual machines. In the image, there are two virtual machines that want to talk to each other. They are connected to a common virtual network that uses a customer-defined prefix of 10.0.0.0/8.

The source VM sends a packet to 10.10.1.5. The packet exits the VM’s guest OS and hits the Azure NIC. The NIC is connected to a virtual switch in the host – did you know that in Hyper-V, the switch port is a part of the NIC to enable consistent processing no matter what host the VM is moved to? The virtual switch encapsulates the packet to enable transmission across the physical network – the physical network has no idea about the customer’s prefix of 10.0.0.0/8. How could it? I’d guess that 80% of customers use all or some of that prefix. Encapsulation allows the pack to hide the customer-defined source and destination addresses. The Azure Fabric knows where the customer’s destination (10.10.1.5) is running, so it uses the physical destination host’s address in the encapsulated packet.

Now the packet is free to travel across the physical Azure network – across the rack, data centre, region or even the global network – to reach the destination host. Now the packet moves up the stack, is encapsulated and dropped into the NIC of the destination VM where things like NSG rules (how the NSG is associated doesn’t matter) are processed.

Here’s what you need to learn here:

  1. The packet went directly from source to destination at the customer level. Sure it travelled along a Microsoft physical network but we don’t see that. We see that the packet left the source NIC and arrived directly at the destination NIC.
  2. Each NIC is effectively its own router.
  3. Each NIC is where NSG rules are processed: source NIC for outbound rules and destination NIC for inbound rules.

The Virtual Network Does Not Exist

Have you ever noticed that every Azure subnet has a default gateway that you cannot ping?

In the above example, no packets travelled across a virtual network. There were no magical wires. Packets didn’t go to a default gateway of the source subnet, get routed to a default gateway of a destination subnet and then to the destination NIC. You might have noticed in the diagram that the source and destination were on different peered virtual networks. When you peer a virtual network, an operator is not sent sprinting into the Azure data centres to install patch cables. There is no mysterious peering connection.

This is the beauty and simplicity of Azure networking in action. When you create a virtual network, you are simply stating:

Anything connected to this network can communicate with each other.

Why do we create subnets? In the past, subnets were for broadcast control. We used them for network isolation. In Azure:

  • We can isolate items from each other in the same subnet using NSG rules.
  • We don’t have broadcasts – they aren’t possible.

Our reasons for creating subnets are greatly reduced, and so are our subnet counts. We create subnets when there is a technical requirement – for example, an Azure Bastion requires a dedicated subnet. We should end up with much simpler, smaller virtual networks.

How To Think of Azure Networks

I cannot say that I know how the underlying Azure fabric works. But I can imagine it pretty well. I think of it simply as a mapping system. And I explain it using Venn diagrams.

Here’s an example of a single virtual network with some connected Azure resources.

Connecting these resources to the same virtual network is an instruction to the fabric to say: “Let these things be able to route to each other”. When the app service (with VNet Integration) wants to send a packet to the virtual machine, the NIC on the source VM will send the packets directly to the NIC of the destination VM.

Two more virtual networks, blue and green, are created. Note that none of the virtual networks are connected/peered. Resources in the black network can talk only to each other. Resources in the blue network can talk only to each other. Resources in the green network can talk only to each other.

Now we will introduce some VNet peering:

  • Black <> Blue
  • Black <> Green

As I stated earlier, no virtual cables are created. Instead, the fabric has created new mappings. These new mappings enable new connectivity:

  • Black resources can talk with blue resources
  • Black resources can talk with green resources.

However, green resources cannot talk directly to blue resources – this would require routing to be enabled via the black network with the current peering configuration.

I can implement isolation within the VNets using NSG rules. If I want further inspection and filtering from a firewall appliance then I can deploy one and force traffic to route via it using BGP or User-Defined Routing.

Wrapping Up

The above simple concept is the biggest barrier I think that many people have when it comes to good Azure network design. If you grasp the fact that virtual networks do not exist and that packets route directly from source to destination and then be able to process those two facts then you are well on your way to designing well-connected/secured networks and being able to troubleshoot them.

If You Liked This Article

If you liked this article, then why don’t you check out my custom Azure training with my company, Cloud Mechanix. My next course is Azure Firewall Deep Dive, a two day virtual course where I go through how to design and implement Azure Firewall, including every feature. This two day course runs on February 12/13, timed for (but not limited to) European attendees.

Will International Events Impact Cloud Computing

You must have been hiding under a rock if you haven’t noticed how cloud computing has become the default in IT. I have started to wonder about the future of cloud computing. Certain international events have the potential to disrupt cloud computing in a major way. I’m going to play out two scenarios in this post and illustrate what the possible problems may be.

Bear In The East

Russia expanded their conflict with Ukraine in February 2024. This was the largest signal so far that the leadership of Russia wanted to expand their post-Soviet borders to include some of the former USSR nations. The war in Ukraine is taking much longer than expected and has eaten the Russian military, thanks to the determination of the Ukrainian people. However, we know that Russia has eyes elsewhere.

The Baltic nations (Lithuania, Latvia and Estonia) provide a potential land link between Russia and the Baltic Sea. North of those nations is Finland, a country with a long & wild border with Russia – and also one with a history of conflict with Russia. Finland (and Sweden) has recognised the potential of this expanded threat by joining NATO.

If you read “airport thrillers” like me, then you’ll know that Sweden has an island called Gotland in the Baltic Sea. It plays a huge strategic role in controlling that sea. If Russia were to take that island, they could prevent resupply via the Baltic Sea to the Baltic countries and Finland, leaving only air, land, and the long route up North – speaking of which …

Norway also shares a land border with Russia to the north of Finland. The northern Norwegian coast faces the main route from Murmansk (a place I attacked many times when playing the old Microprose F-19 game). Murmansk is the home of the Russian Atlantic fleet. Their route to the Atlantic is north of the Norwegian coast and south between Iceland and Ireland.

In the Artic is Svalbard, a group of islands that is host to polar bears and some pretty tough people. This island is also eyed up by Russia – I’m told that it’s not unusual to hear stories of some kind of espionage there.

So Russia could move west and attack. What would happen then?

Nordic Azure Regions

There are several Azure regions in the Nordics:

  • Norway East, paired with Norway West
  • Sweden Central, paired with Sweden South
  • One is “being built” in Espoo, Finland, just outside the capital of Helsinki.

Norway West is a small facility that is hosted in a third-party data centre and is restricted to a few customers.

I say “being built” with the Finish region because I suspect that its been active for a while with selected customers. Not long after the announcement of the region (2022) I had a nationally strategic customer tell me that the local Microsoft data centre salesperson was telling them to stop deploying in Azure West Europe (Netherlands) and to start using the new Finnish region.

FYI: the local Microsoft data centre salesperson has a target of selling only the local Azure region. The local subsidiary has to make a usage commitment to HQ before a region is approved. Adoption in another part of Azure doesn’t contribute to this target.

I remember this conversation because it was not long after tanks rolled into Ukraine and talk of Finland joining NATO began heating up. I asked my customer: “Let’s say you place nationally critical services into the new Finnish region. What is one of the first things that Russia will send missiles to?” Yes, they will aim to shut down any technology and communications systems first … including Azure regions. All the systems hosted in Espoo will disappear in a flaming pile of debris. I advised the customer that if I were them, I would continue to use cloud regions that were as far away as possible while still meeting legal requirements.

Norway’s situation is worse. Their local and central governments have to comply with a data placement law, which prevents the placement of certain data outside of Norway. If you’re using Azure, you have no choice, you must use Norway East, which is in urban Oslo (the capital on the south coast). Private enterprises can choose any of the European regions (they typically take West Europe/Netherlands, paired with North Europe/Ireland) so they have a form of disaster recovery (I’ll come back to this topic later). However, Norway East users cannot replicate into Norway West – the Stavanger-located region is only available to a select (allegedly) three customers and it is very small.

FYI: restricted access paired regions are not unusual in Azure.

Disaster Recovery

So a hypersonic missile just took out my Azure region – what do I do next? In an ideal world, all of your data was replicated in another location. Critical systems were already built with redundant replicas. Other systems can be rebuilt by executing pipelines with another Azure region selected.

Let’s shoot all of that down, shall we?

So I have used Norway East. And I’ve got a bunch of PaaS data storage systems. Many of those storage systems (Azure Backup recovery services vaults) are built on blob storage. Blob storage offers geo-redundancy which is restricted to the paired region. If my data storage can only replicate to the paired region and there is no paired region available to me, when there is no replication option. You will need to bake your own replication system.

Some compute/data resource types offer replication in any region. For example, Cosmos DB can replicate to other regions but that comes with potential sync/latency issues. Azure VMs offer Azure Site Recovery which enables replication to any region. This is where I expect the “cloud native” types to be “GitOps!” but they always seem to focus only on compute and forget things like data – no we won’t be putting massive data stores in an AKS container 🙂

Has anyone not experienced capacity issues in an Azure region in the last few years? There are probably many causes for that so we won’t go down that rabbit hole. But a simple task of deploying a new AVD worker pool or a firewall with zone resilience commonly results in a failure because the region doesn’t have capacity. What would happen if Norway East disappeared and all of the tenants started to failover/redeploy to other European regions? Let’s just say that there would be massive failures everywhere.

Orange Man In The West

Greenland is an autonomous territory of the Kingdom of Denmark. Being a Danish territory makes it a part of the EU. US president-elect, Donald Trump, has been sabre-rattling about Greenland recently. He either wants the US to take it over by economic (trade war) or military means.

If the USA goes into a trade war with Denmark, then it will go into a trade war with all of the EU. Neither side will win. If the tech giants continue to personally support Donald Trump then I can imagine the EU retaliating against them. Considering that Microsoft, Amazon, and Google are American companies, sanctions against those companies would be bad – the cost of cloud computing could rocket and make it unviable.

If the USA invaded Greenland (a NATO ally by virtue of being a Danish territory) then it would lead to very a unpleasant situation between NATO/EU and the USA. One could imagine that American companies would be shunned, not just emotionally but also legally. That would end Azure, AWS, and Google in the EU.

So how would one recover from losing their data and compute platform? It’s not like you can just live migrate a petabyte data lake or a workload based on Azure Functions.

The Answer

I don’t have a good answer. I know of an organisation that had a “only do VMs in Azure” policy. I remember bing dumbfounded at the time. They explained that it was for support reasons. But looking back on it, they abstracted themselves from Azure by use of an operating system. They could simply migrate/restore their VMs to another location if necessary – on-prem, another cloud, another country. They are not tied to the cloud platform, the location, or the hardware. But they do lose so many of the benefits of using the cloud.

I expect someone will say “use on-prem for DR”. OK, so you’ll build a private cloud, at huge expense and let it sit there doing nothing on the off-chance that it might be used. If I was in that situation then I wouldn’t be using Azure/etc at all!

I’ve been wondering for a while if the EU could fund/sponsor the creation of an IT sector in Europe that is independent from the USA. It would need an operating system, productivity software, and a cloud platform. We don’t have any tech giants as big or as cash rich as Microsoft in the EU so this would have to be sponsored. I also think that it would have to be a collaboration. My fear is that it would be bogged down in bureaucracy and have a heavy Germany/France first influence. But I am looking at the news every day and realsing that we need to consider a non-USA solution.

Wrapping Up

I’m all doom and gloom today. Maybe it’s all of the negativity in the news that is bringing me down. I see continued war in Ukraine, Russia attacking infrastructure in the Baltic sea, and threats from the USA. The world has changed and we all will need to start thinking about how we act in it.

Lots of Speaking Activity

After a quiet few pandemic years with no in-person events and the arrival of twins, my in-person presentation activity was minimal. My activity has started to increase, and there have been plenty of recent events and more are scheduled for the near future.

The Recent Past

Experts Live Netherlands 2024

It was great to return to The Netherlands to present at Experts Live Netherlands. Many European IT workers know the Experts Live brand; Isidora Maurer (MVP) has nurtured & shepherded this conference brand over the years, starting with a European-wide conference and then working with others to branch it out to localised events that give more people a chance to attend. I presented at this event a few years ago but personal plans prevented me from submitting again until this year. And I was delighted to be accepted as a speaker.

Hosted in Nieuwegein, just a short train ride from Schiphol airport in Amsterdam (Dutch public transport is amazing) the conference featured a packed expo hall and many keen attendees. I presented my “Azure Firewall: The Legacy Firewall Killer” session to a standing-room-only room.

TechMentor Microsoft HQ 2024

The first conference I attended was WinConnections 2004 in Lake Las Vegas. That conference changed my career. I knew that TechMentor had become something like that – the quality of the people I knew who were presenting at the event in the past was superb. I had the chance to submit some sessions this time around and was happy to have 3 accepted, including a pre-conference all-day seminar.

I worked my tail off on that “pre-con”. It’s an expansion of one of my favourite sessions that many events are scared of, probably because they think it’s too niche or too technical: “Routing – The Virtual Cabling of Secure Azure Networking”. Expanding a 1 hour session to a full day might seem daunting but I had to limit how much content I included! Plus I had to make this a demo session. I worked endless hours on a Bicep deployment to build a demo lab for the attendees. This was necessary because it would take too long to build by hand. I had issues with Azure randomly failing and with version stuff changing inside Microsoft. As one might expect, the demo gods were not kind on the day and I quickly had to pivot from hands-on labs to demos. While the questions were few during the class, there were lots of conversations during the breaks and even on the following days.

My second session was “Azure Firewall: The Legacy Firewall Killer” – this is a popular session. I like doing this topic because it gives me a chance to crack a few jokes – my family will groan at that thought!

My final session was the one that I was most worried about. “Your Azure Migration Project Is Doomed To FAIL” was never accepted by any event before. I think the title might seem negative but it’s meant to be funny. The content is based on my experience dealing with mid-large organisations who never quite understand the difference between cloud migration and cloud adoption. I explain this through several fictional stories. There is liberal use of images from Unsplash and opportunities to make some laughter. This have been the session that I was least confident in, but it worked.

TechMentor puts a lot of effort into mixing the attendees and the presenters. On the first night, attendees and presenters went to a local pizza place/bar and sat in small booths. We had to talk to each other. The booth that I was at featured people from all over the USA with different backgrounds. People came and went, but we talked and were the last to leave. On the second day, lunch was an organised affair where each presenter was host to a table. Attendees could grab lunch and sit with a presenter to discuss what was on their minds. I knew that migrations were a hot topic. And I also knew that some of those attendees were either doing their first migration or first re-attempt at a migration. I was able to tune my session a tiny bit to the audience and it hit home. I think the best thing about this was the attention I saw in the room, the verbal feedback that I heard just after the session, and the folks who came up to talk to me after.

A Break

I brought my family to the airport the day before I flew to TechMentor. They were going to Spain for 4 weeks and I joined them a few days later after a l-o-n-g Seattle-Las Angeles-Dublin-Alicante journey (I really should have stayed one extra night in Seattle and taken the quicker 1-hop via Iceland).

33+ Celsius heat, sunshine, a pool, a relaxed atmosphere in a residential town (we didn’t go to a “hotel town”) was a great place to work for a week and then do two weeks of vacation.

I went running most mornings, doing 5-7KMs. I enjoy getting up early in places like this, picking a route to explore on a map, and hitting the streets to discover the locality and places to go with my family. It’s so different to home where I have just two routes with footpaths that I can use.

Coming home was a shock. Ireland isn’t the sunniest or the warmest place in the world, but it feels like mid-winter at the moment. I think I acclimatised to Spain as much as a pasty Irish person can. This morning I even had to put a jacket on and do a couple of KMs to wait for my legs to warm up before picking up the pace.

Upcoming Events

There are three confirmed events coming up:

Nieuwegein (Netherlands) September 11: Azure Fest 2025

I return to this Dutch city in a few days to do a new session “Azure Virtual Network Manager”. I’ve been watching this product develop since the private preview. It’s not quite ready (pricing is hopefully being fixed) but it could be a complete game changer for managing complex Azure networks for secure/compliant PaaS and IaaS deployments. I’ll discuss and demo the product, sharing what I like and don’t like.

Dublin (Ireland) October 7: Microsoft Azure Community & AI Day

Organised by Nicolas Chang (MVP) this event will feature a long list of Irish MVPs discussing Azure and AI in a rapid-fire set of short sessions. I don’t think that the event page has gone live yet so watch out for it. I will be presenting the “Azure Virtual Network Manager” again at this event.

TBA: Nordics

I’ve confirmed my speaking slots for 2 sessions at an event that has not publicly announced the agenda yet. I look forward to heading north and sharing some of my experiences.

My Sessions

If you are curious, then you can see my Sessionize public profile here, which is where you’ll see my collection of available sessions.

Network Rules Versus Application Rules for Internal Traffic

This post is about using either Network Rules or Application Rules in Azure Firewall for internal traffic. I’m going to discuss a common scenario, a “problem” that can occur, and how you can deal with that problem.

The Rules Types

There are three kinds of rules in Azure Firewall:

  • DNAT Rules: Control traffic originating from the Internet, directed to a public IP address attached to Azure Firewall, and translated to a private IP Address/port in an Azure virtual network. This is implicitly applied as a Network Rule. I rarely use DNAT Rules – most modern applications are HTTP/S and enter the virtual network(s) via an Application Gateway/WAF.
  • Application Rules: Control traffic going to HTTP, HTTPS, or MSSQL (including Azure database services).
  • Network Rules: Control anything going anywhere.

The Scenario

We have an internal client, which could be:

  • On-premises over private networking
  • Connecting via point-to-site VPN
  • Connected to a virtual network, either the same as the Azure Firewall or to a peered virtual network.

The client needs to connect to a server, with SSL authentication, to a server. The server is connected to another virtual network/subnet. The route to the server goes through the Azure Firewall. I’ll complicate things by saying that the server is a PaaS resource with a Private Endpoint – this doesn’t affect the core problem but it makes troubleshooting more difficult 🙂

NSG rules and firewall rules have been accounted for and created. The essential connection is either HTTPS or MSSQL and is implemented as an Application Rule.

The Problem

The client attempts a connection to the server. You end up with some type of application error stating that there was either a timeout or a problem with SSL/TLS authentication.

You begin to troubleshoot:

  • Azure Firewall shows the traffic is allowed.
  • NSG Flow Logs show nothing – you panic until you remember/read that Private Endpoints do not generate flow logs – I told you that I’d complicate things 🙂 You can consider VNet Flow Logs to try to capture this data and then you might discover the cause.

You try and discover two things:

  • If you disconnect the NSG from the subnet then the connection is allowed. Hmm – the rules are correct so the traffic should be allowed: traffic from the client prefix(es) is permitted to the server IP address/port. The rules match the firewall rules.
  • You change the Application Rule to a Network Rule (with the NSG still associated and unchanged) and the connection is allowed.

So, something is going on with the Application Rules.

The Root Cause

In this case, the Application Rule is SNATing the connection. In other words, when the connection is relayed from the Azure Firewall instance to the server, the source IP is no longer that of the client; the source IP address is a compute instance in the AzureFirewallSubnet.

That is why:

  • The connection works when you remove the NSG
  • The connection works when you use a Network Rule with the NSG – the Network Rule does not SNAT the connection.

Solutions

There are two solutions to the problem:

Using Application Rules

If you want to continue to use Application Rules then the fix is to modify the NSG rule. Change the source IP prefix(es) to be the AzureFirewallSubnet.

The downsides to this are:

  • The NSG rules are inconsistent with the Azure Firewall rules.
  • The NSG rules are no longer restricting traffic to documented approved clients.

Using Network Rules

My preference is to use Network Rules for all inbound and east-west traffic. Yes, we lose some of the “Layer-7” features but we still have core features, including IDPS in the Premium SKU.

Contrary to using Application Rules:

  • The NSG rules are consistent with the Azure Firewall rules.
  • The NSG rules restrict traffic to the documented approved clients.

When To Use Application Rules?

In my sessions/classes, I teach:

  • Use DNAT rules for the rare occasion where Internet clients will connect to Azure resources via the public IP address of Azure Firewall.
  • Use Application Rules for outbound connections to Internet, including Azure resources via public endpoints, through the Azure Firewall.
  • Use Network Rules for everything else.

This approach limits “weird sh*t” errors like what is described above and means that NSG rules are effectively clones of the Azure Firewall rules, with some additional stuff to control stuff inside of the Virtual Network/subnet.

Why Are There So Many Default Routes In Azure?

Have you wondered why an Azure subnet with no route table has so many default routes? What the heck is 25.176.0.0/13? Or What is 198.18.0.0/15? And why are they routing to None?

The Scenario

You have deployed a virtual machine. The virtual machine is connected to a subnet with no Route Table. You open the NIC of the VM and view Effective Routes. You expect to see a few routes for the non-RFC1918 ranges (10.0.0.0/8, 172.16.0.0/12, etc) and “quad zero” (0.0.0.0/0) but instead you find this:

What in the nelly is all that? I know I was pretty freaked out when I first saw it some time ago. Here are the weird addresses in text, excluding quad zero and the virtual network prefix:

10.0.0.0/8
172.16.0.0/12
192.168.0.0/16
100.64.0.0/10
104.146.0.0/17
104.147.0.0/16
127.0.0.0/8
157.59.0.0/16
198.18.0.0/15
20.35.252.0/22
23.103.0.0/18
25.148.0.0/15
25.150.0.0/16
25.152.0.0/14
25.156.0.0/16
25.159.0.0/16
25.176.0.0/13
25.184.0.0/14
25.4.0.0/14
40.108.0.0/17
40.109.0.0/16

Next Hop = None

The first thing that you might notice is the next hop which is sent to None.

Remember that there is no “router” by default in Azure. The network is software-defined so routing is enacted by the Azure NIC/the fabric. When a packet is leaving the VM (and everything, including “serverless”, is a VM in the end unless it is physical) the Azure NIC figures out the next hop/route.

When traffic hits a NIC, the best route is selected. If that route has a next hop set to None then the traffic is dropped like it disappeared into a black hole. We can use this feature as a form of “firewall – we don’t want the traffic so “Abracadabra – make it go away”.

A Microsoft page (and some googling) gives us some more clues.

RFC-1918 Private Addresses

We know these well-known addresses, even if we don’t necessarily know the RFC number:

  • 10.0.0.0/8
  • 172.16.0.0/12
  • 192.168.0.0/16

These addresses are intended to be used privately. But why is traffic to them dropped? If your network doesn’t have a deliberate route to other address spaces then there is no reason to enable routing to them. So Azure takes a “secure by default” stance and drops the traffic.

Remember that if you do use a subset of one of those spaces in your VNet or peered VNets, then the default routes for those prefixes will be selected ahead of the more general routes that dropping the traffic.

RFC-6598 Carrier Grade NAT

The subnet, 100.64.0.0/10, is defined as being used for carrier-grade NAT. This block of addresses is specifically meant to be used by Internet service providers (or ISPs) that implement carrier-grade NAT, to connect their customer-premises equipment (CPE) to their core routers. Therefore we want nothing to do with it – so drop traffic to there.

Microsoft Prefixes

20.35.252.0/22 is registered in Redmond, Washington, the location of Microsoft HQ. Other prefixes in 20.235 are used by Exchange Online for the US Government. That might give us a clue … maybe Microsoft is firewalling sensitive online prefixes from Azure? It’s possible someone could hack a tenant, fire up lots of machines to act as bots and then attack sensitive online services that Microsoft operates. This kind of “route to None” approach would protect those prefixes unless someone took the time to override the routes.

104.146.0.0/17 is a block that is owned by Microsoft with a location registered as Boydton, Virginia, the home of the East US region. I do not know why it is dropped by default. The zone that resolves names is hosted on Azure Public DNS. It appears to be used by Office 365, maybe with sharepoint.com.

104.147.0.0/16 is also owned by Microsoft which is also in Boydton, Virginia. This prefix is even more mysterious.

Doing a google search for 157.59.0.0/16 on the Microsoft.com domain results in the fabled “google whack”: a single result with no adverts. That links to a whitepaper on Microsoft.com which is written in Russian. The single mention translates to “Redirecting MPI messages of the MyApp.exe application to the cluster subnet with addresses 157.59.x.x/255.255.0.0.” . This address is also in Redmond.

23.103.0.0/18 has more clues in the public domain. This prefix appears to be split and used by different parts of Exchange Online, both public and US Government.

The following block is odd:

  • 25.148.0.0/15
  • 25.150.0.0/16
  • 25.152.0.0/14
  • 25.156.0.0/16
  • 25.159.0.0/16
  • 25.176.0.0/13
  • 25.184.0.0/14
  • 25.4.0.0/14

They are all registered to Microsoft in London and I can find nothing about them. But … I have a sneaky tin (aluminum) foil suspicion that I know what they are for.

40.108.0.0/17 and 40.109.0.0/16 both appear to be used by SharePoint Online and OneDrive.

Other Special Purpose Subnets

RFC-5735 specifies some prefixes so they are pretty well documented.

127.0.0.0/8 is the loopback address. The RFC says “addresses within the entire 127.0.0.0/8 block do not legitimately appear on any network anywhere” so it makes sense to drop this traffic.

198.18.0.0/15 “has been allocated for use in benchmark tests of network interconnect devices … Packets with source addresses from this range are not
meant to be forwarded across the Internet”.

Adding User-Defined Routes (UDRs)

Something interesting happens if you start to play with User-Defined Routes. Add a table to the subnet. Now add a UDR:

  • Prefix: 0.0.0.0/0
  • Next Hop: Internet

When you check Effective Routes, the default route to 0.0.0.0/0 is deactivated (as expected) and the UDR takes over. All the other routes are still in place.

If you modify that UDR just a little, something different happens:

  • Prefix: 0.0.0.0/0
  • Next Hop: Virtual Appliance
  • Next Hop IP Address: {Firewall private IP address}

All the mysterious default routes are dropped. My guess is that the Microsoft logic is “This is a managed network – the customer put in a firewall and that will block the bad stuff”.

The magic appears only to happen if you use the prefix 0.0.0.0/0 – try a different prefix and all the default routes re-appear.

Designing Network Security To Combat Modern Threats

In this post, I want to discuss how one should design network security in Microsoft Azure, dispensing with past patterns and combatting threats that are crippling businesses today.

The Past

Network security did not change much for a very long time. The classic network design is focused on an edge firewall.”All the bad guys are trying to penetrate our network from the Internet” so we’ll put up a very strong wall at the edge. With that approach, you’ll commonly find the “DMZ” network; a place where things like web proxies and DNS proxies isolate interior users and services from the Internet.

The internal network might be made up of two/more VLANs. For example, one or more client device VLANs and a server VLAN. While the route between those VLANs might pass through the firewall, it probably didn’t; they really “routed” through a smart core switch stack and there was limited to no firewall isolation between those VLANs.

This network design is fertile soil for malware. Ports usually are not let open to attack on the edge firewall. Hackers aren’t normally going to brute force their way through a firewall. There are easier ways in such as:

  • Send an “invoice” PDF to the accounting department that delivers a trojan horse.
  • Impersonate someone, ideally someone that travels and shouts a lot, to convince a helpful IT person to reset a password.
  • Target users via phishing or spear phishing.
  • Cimpromise some upstream include that developers use and use it to attack from the servers.
  • Use a SQL injection attack to open a command prompt on an internal server.
  • And on and on and …

In each of those cases, the attack comes from within. The spread of the blast (the attack) is unfettered. The blast area (a term used to describe the spread of an attack) is the entire network.

Secure Zones To The Rescue!

Government agencies love a nice secure zone architecture. This is a design where sensitive systems, such as GDRP data or secrets are stored on an isolated network.

Some agencies will even create a whol duplicate network that is isolated, forcing users to have two PCs – one “regular” one on the Internet-connected network and a “secure” PC that is wired onto an isolated network with limited secret services.

Realistically, that isolated network is of little value to most, but if you have that extreme a need – then good luck. By the way, that won’t work in The Cloud 🙂 Back to the more regular secure zone …

A special VLAN will be deployed and firewall rules will block all traffic into and out of that secure zone. The user experience might be to use Citrix desktops, hosted in the secure zone, to access services and data in that secure zone. But then reality starts cracking holes in the firewall’s deny all rules. No line of business app lives alone. They all require data from somewhere. Or there are integrations. Printers must be used. Scanners need to scan and share data. And legacy apps often use:

  • Domain (ADDS) credentials (how many ports do you need for that!!!)
  • SMB (TCP 445) for data transfer and integration

Over time, “deny all” becomes a long list of allow * from X to *, and so on, with absolutely no help from the app vendors.

The theory is that if an attack is commenced, then the blast area will be limited to the client network and, if it reaches the servers, it will be limtied to the Internal network. But this design fails to understand that:

  • An attack can come from within. Consider the scneario where compromised runtimes are used or a SQL injection attack breaks out from a database server.
  • All the required integrations open up holes between the secure zone and the other networks, including those legacy protocols that things like ransomware live on.
  • If one workload in the secure zone is compromised, they all are because there is no network segmentation inside of the VLAN.

And eventually, the “secure zone” is no more secure than the Internal network.

Don’t Block The Internet!!!

I’m amazed how many organisations do not block outbound access to the Internet. It’s just such hard work to open up firewall rules for all these applications that have Internet dependencies. I can understand that for a client VLAN. But the server VLAN such be a controlled space – if it’s not known & controlled (i.e. governed) then it should not be permitted.

A modern attack, an advanced persistent threat (APT), isn’t just some dumb blast, grab, and run. It is a sneaky process of:

  1. Penetration
  2. Discovery, often manually controlled
  3. Spread, often manually controlled
  4. Steal
  5. Destroy/encrypt/etc

Once an APT gets in, it usually wants to call home to pull instructions down from a rogue IP address or compromised bot. When the APT wants to steal data, to be used as blackmail and/or to be sold on the Darknet, the malware will seek to upload data to the Internet. Both of these actions are taking advantage of the all-too-common open access to the Internet.

Azure is Different

Years of working with clients has taught me that there are three kinds of people when it comes to Azure networking:

  1. Those who managed on-premises networks: These folks struggle with Azure networking.
  2. Those who didn’t do on-premises networking, but knew what to ask for: These folks take to Azure networking quite quickly.
  3. Everyone else: Irrelevant to this topic

What makes Azure networking so difficult for the network admins? There is no cabling in the fabric – obviously there is cabling in the data centres but it’s all abstracted by the VXLAN software-defined networks. Packets are encapsulated on the source virtual machine’s host, transmitted over the physical network, decapstulated on the destination virtual machine host, and presented to the destination virtual machine’s NIC. In short, packets leave the source NIC and magically arrive on the destination NIC with no hops in between – this is why traceroute is pointless in Azure and why the default gateway doesn’t really exist.

I’m not going to use virtual machines, Aidan. I’m doing PaaS and serverless computing. In Azure, everything is based on virtual machines, unless they are explcitly hosted on physical hosts (Azure VMware Services and some SAP stuff, for example). Even Functions run on a VM somewhere hidden in the platform. Serverless means that you don’t need to manage it.

The software-defined thing is why:

  • Partitioned subnets for a firewall appliance (front, back, VPN, and management) offer nothing from a security perspective in Azure.
  • ICMP isn’t as useful as you’d imagine in Azure.
  • The concept of partitioning workloads for security using subnets is not as useful as you might think – it’s actually counter-productive over time.

Transformation

I like to remind people during a presentation or a project kickoff that going on a cloud journey is supposed to result in transformation. You now re-evaluate everything and find better ways to do old things using cloud-native concepts. And that applies to network security designs too.

Micro-Segmentation Is The Word

Forget “Greece”, get on board with what you need to counter today’s threats: micro-segmentation. This is a concept where:

  • We protect the edge, inbound and outbound, permitting only required traffic.
  • We apply network isolation within the workload, permitting only required traffic.
  • We route traffic between workloads through the edge firewall, , permitting only required traffic.

Yes, more work will be required when you migrate existing workloads to Azure. I’d suggest using Azure Migrate to map network flows. I never get to do that – I always get the “messy migration projects” and I never get to use Azure Migrate – so testing and accessing and understanding NSG Traffic Analytics and the Azure Firewall/firewall logs via KQL is a necessary skill.

Security Classification

Every workload should go through a security classification process. You need to weigh risk verus complexity. If you max the security, you will increase costs and difficulty for otherwise simple operations. For example, a dev won’t be able to connect Visual Studio straight to an App Service if you deploy that App Service on a private or isolated App Service Plan. You also will have to host your own DevOps agents/GitHub runners because the Microsoft-hosted containers won’t be able to reach your SCM endpoints.

Every piece of compute is a potential attack vector: a VM, an App Service, a Function, a Container, a Logic App. The question is, if it is compromised, will the attacker be able to jump to something else? Will the data that is accessible be secret, subject to regulation, or reputational damage?

This measurement process will determine if a workload should use resources that:

  • Have public endpoints (cheapest and easiest).
  • Use private endpoints (medium levels of cost, complexity, and security).
  • Use full VNet integration, such as an App Service Environment or a virtual machine (highest cost/complexity but most secure).

The Virtual Network & Subnet

Imagine you are building a 3-tier workload that will be isolated from the Internet using Azure virtual networking:

  • Web servers on the Internet
  • Middle tier
  • Databases

Not that long ago, we would have deployed that workload on 3 subnets, one for each tier. Then we would have built isolation using Network Security Groups (NSGs), one for each subnet. But you just learned that a SD-network routes packets directly from NIC to NIC. An NSG is a Hyper-V Port ACL that is implemented at the NIC, even if applied at the subnet level. We can create all the isolation we want using an NSG within the subnet. That means we can flatten the network design for the workload to one subnet. A subnet-associated subnet will restrict communications between the tiers – and ideally between nodes within the same tier. That level of isolation should block everything … should 🙂

Tips for virtual networks and subnets:

  • Deploy 1 virtual network per workload: Not only will this follow Azure Cloud Adoption Framework concepts, but it will help your overall security and governance design. Each workload is placed into a spoke virtual network and peered with a hub. The hub is used only for external connectivity, the firewall, and Azure Bastion (assuming this is not a vWAN hub).
  • Assign a single prefix to your hub & spoke: Firewall and NSG rules will be easier.
  • Keep the virtual newtorks small: Don’t waste your address space.
  • Flatten your subnets: Only deploy subnets when there is a technical need, for example VMs and private endpoints are in one subnet, VNet integration for an App Services plan is in another, a SQL managed instance, is in a third.

Resource Firewalls

It’s sad to see how many people disable operating system firewalls. For example, Group Policy is used to diable Windows Firewall. Don’t you know that Microsoft and Linux added those firewalls to protect machines from internal attacks? Those firewalls should remain operational and only permit required traffic.

Many Azure resources also offer firewalls. App Services have firewalls. Azure SQL has a firewall. Use them! The one messy resource is the storage account. The location of the endpoints for storage clusters is in a weird place – and this causes interesting situations. For example, a Logic App’s storage account with a configured firewall will prevent workflows from being created/working correctly.

Network Security Groups

Take a look at the default inbound rules in an NSG. You’ll find there is a Deny All rule which is the lowest possible priority. Just up from that rule, is a built in rule to allow traffic from VirtualNetwork. VirtualNetwork includes the subnet, the virtual network, and all routed networks, including peers and site-to-site connections. So all traffic from internal networks is … permitted! This is why every NSG that I create has a custom DenyAll rule with a priority of 4000. Higher priority rules are created to permit required traffic and only that required traffic.

Tips with your NSGs:

  • Use 1 NSG per subnet: Where the subnet resources will support an NSG. You will reduce your overall complexity and make troubleshooting easier. Remember, all NSG rules are actually applied at the source (outbound rules) or target (inbound rules) NIC.
  • Limit the use of “any”: Rules should be as accurate as possible. For example: Allow TCP 445 from source A to destination B.
  • Consider the use of Application Security Groups: You can abstract IP addresses with an Application Security Group (ASG) in an NSG rule. ASGs can be used with NICs – virtual machines and private endpoints.
  • Enable NSG Flow Logs & Traffic Analytics: Great for troubleshooting networking (not just firewall stuff) and for feeding data to a SIEM. VNet Flow Logs will be a superior replacement when it is ready for GA.

The Hub

As I’ve implied already, you should employ a hub & spoke design. The hub should be simple, small and free of compute. The hub:

  • Makes connections using site-to-site networking using SD-WAN, VPN, and/or ExpressRoute.
  • Hosts the firewall. The firewall blocks everything in every direction by default,
  • Hosts Azure Bastion, unless you are running Azure Virtual WAN – then deploy it to a spoke.
  • Is the “Public IP” for egress traffic for workloads trying to reach the Internet. All egress traffic is via the firewall. Azure Policy should be used to restrict Public IP Addresses to just those requires that require it – things like Azure Bastion require a public IP and you should create a policy override for each required resource ID.

My preference is to use Azure Firewall. That’s a long conversation so let’s move on to another topic; Azure Bastion.

Most folks will go into Azure thinking that they will RDP/SSH straight to their VMs. RDP and SSH are not perfect. This is something that the secure zone concept recognised. It was not unusual for admins/operators to use a bastion host to hop via RDP or SSH from their PC to the required server via another server. RDP/SSH were not open directly to the protected machines.

Azure Bastion should offer the same isolation. Your NSG rules should only permit RDP/SSH from:

  • The AzureBastionSubnet
  • Any other bastion hosts that might be employed, typically by developers who will deploy specialist tools.

Azure Bastion requires:

  • An Entra ID sign-in, ideally protected by features such as conditional access and MFA, to access the bastion service.
  • The destination machine’s credentials.

Routing

Now we get to one of my favourite topics in Azure. In the on-prem world we can control how packets get from A to B using cables. But as you’ve learned, we can run cables in Azure. But we can control the next hop of a packet.

We want to control flows:

  • Ingress from site-to-site networking to flow through the hub firewall: A route in the GatewaySubnet to use the hub firewall as the next hop.
  • All traffic leaving a spoke (workload virtual network) to flow through the hub firewall: A route to 0.0.0.0/0 using the firewall backend/private IP as the next hop.
  • All traffic between hub & spokes to flow through the remote hub firewall: A route to the remote hub & spoke IP prefix (see above tip) with a next hop of the remote hub firewall.

If you follow my tips, especially with the simple hub, then the routing is actually quite easy to implement and maintain.

Tips:

  • Keep the hub free of compute.
  • NSG Traffic Analytics helps to troubleshoot.

Web Application Firewall

The hub firewall shold not be used to present web applications to the Internet. If a web app is classified as requireing network security, then it should be reverse proxied using a Web Application Firewall (WAF). This specialised firewall inspects traffic at the application layer and can block threats.

The WAF will have a lot of false positives. Heavy traffic applications can produce a lot of false positives in your logs; in the case of Log Analytics, the ingestion charge can be huge so get to optimising those false positives as quickly as you can.

My preference is to route the WAF through the hub firewall to the backend applications. The WAF is a form of compte, even the Azure WAF. If you do not need end-to-end TLS, then the firewall could be used to inspect the HTTP traffic from the WAF to the backend using Intrusion Detection Prevention System (IDPS), offering another layer of protection.

Azure offers a couple of WAF options. Front Door with WAF is architecturally interesting, but the default design is that the backend has a public endpoint that limits access to your Front Door instance at the application layer. What if the backend is network connected for max protection? Then you get into complexities with Private Link/Private Endpoint.

A regional WAF is network connected and offers simpler networking, but it sacrifices the performance boosts from Front Door. You can combine Front Door with a regional WAF, but there are more costs with this.

Third party solutions are posisble Services such as Cloud Flare offer performance and security features. One could argue that Cloud Flare offers more features. From the performance perspective, keep in mind that Cloud Flare has only a few peering locations with the Microsoft WAN, so a remote user might have to take a detour to get to your Azure resources, increasing latency.

You can seek out WAF solutions from the likes of F5 and Citrix in the Azure Marketplace. Keep in mind that NVAs can continue skills challenges by siloing the skill – native cloud skills are easier to develop and contract/hire.

Summary

I was going to type something like “this post gives you a quick tour of the micro-segmentation approach/features that you can use in Azure” but then I reaslised that I’ve had keyboard diarrhea and this post is quite Sinofskian. What I’ve tried to explain is that the ways of the past:

  • Don’t do much for security anymore
  • Are actually more complex in architecture than Azure-native patterns and solutions that will work.

If you implement security at three layers, assuming that a breach will happen and could happen anywhere then you limit the blast area of a threat:

  • The edge, using the firewall and a WAF
  • The NIC, using a Network Security Group
  • The resource, using a guest OS/resource firewall

This trust-no-one approach that denies all but the minimum required traffic will make life much harder for an attacker. Including logging and the use of a well configured SIEM will create trip wires that an attacker must trip over to attempt an expansion. You will make their expansion harder & slower, and make it easier to detect them. You will also limit how much they can spread and how much the damage that the attack can create. Furthermore, you will be following the guidance the likes of the FBI are recommending.

There is so much more to consider when it comes to security, but I’ve focused on micro-segmentation in a network context. People do think about Entra ID and management solutions (such as Defender for Cloud and/or SIEM) but they rarely think through the network design by assuming that what they did on-prem will still be fine. It won’t because on-prem isn’t fine right now! So take my advice, transform your network, and protect your assets, shareholders, and your career.