Manage Existing Azure Firewall With Firewall Policy Using Bicep

In this post, I want to discuss how I recently took over the management of an existing Azure Firewall using Firewall Policy/Azure Firewall Manager and Bicep.

Background

We had a customer set up many years ago using our old templated Azure deployment based on ARM. At the centre of their network is Azure Firewall. That firewall plays a big role in the customer’s micro-segmented network, with over 40,000 lines of ARM code defining the many firewall rules.

The firewall was deployed before Azure Firewall Manager (AFM) was released. AFM is a pretty GUI that enables the management of several Azure networking resource types, including Azure Firewall. But when it comes to managing the firewall, AFM uses a resource called Firewall Policy; you don’t have to touch AFM at all – you can deploy a Firewall Policy, link the firewall to it (via Resource ID), and edit the Firewall Policy directly (Azure Portal or code) to manage the firewall settings or code.


One of the nicest features of Azure Firewall is a result of it being an Azure PaaS resource. Like every other resource type (there are exceptions sometimes) Azure Firewall is completely manageable via code. Not only can you deploy the firewall. You can operate it on a day-to-day basis using ARM/Bicep/Terraform/Pulumi if you want: the settings and the firewall rules. That means you can have complete change control and rollback using the features of Git in DevOps, GitHub, etc.


All new features in Azure Firewall have surfaced only via Firewall Policy since the general availability release of AFM. A legacy Azure Firewall that doesn’t have a Firewall Policy is missing many security and management features. The team that works regularly with this customer approached me about adding Firewall Policy to the customer’s deployment and including that in the code.

The Old Code

As I said before, the old code was written in ARM. I won’t get into it here, but we couldn’t add the required code to do the following without significant risk:

  • A module for Firewall Policy
  • Updating the module for Azure Firewall to include the link to the FIrewall Policy.

I got a peer to give me a second opinion and he agreed with my original assessment. We should:

  1. Create a new set of code to manage the Azure Firewall using Bicep.
  2. Introduce Firewall Policy via Bicep.
  3. Remove the ARM module for Azure Firewall from the ARM code.
  4. Leave the rest of the hub as is (ARM) because this is a mission-critical environment.

The High-Level Plan

I decided to do the following:

  1. Set up a new repo just for the Azure Firewall and Firewall Policy.
  2. Deploy the new code in there.
  3. Create a test environment and test like crazy there.
  4. The existing Azure Firewall public IP could not change because it was used in DNAT rules and by remote parties in their firewall rules.
  5. We agreed that there should be “no” downtime in the process but I wanted time for a rollback just in case. I would create non-parameterised ARM exports of the entire hub, the GatewaySubnet route table (critical to routing intent and a risk point in this kind of work), and the Azure Firewall. Our primary rollback plan would be to run the un-modified ARM code to restore everything as it was.

The Build

I needed an environment to work in. I did a non-parameterised export of the hub, including the Azure Firewall. I decompiled that to Bicep and deployed it to a dedicated test subscription. This did require some clean-up:

  • The public IP of the firewall would be different so DNAT rules would need a new destination IP.
  • Every rules collection group (many hundreds of them) had a resource ID that needed to be removed – see regex searches in Visual Studio Code.

The deployment into the test environment was a two-stage job – I needed the public IP address to obtain the destination address value for the DNAT rules.

Now I had a clone of the production environment, including all the settings and firewall rules.

The Bicep Code

I’ve been doing a lot of Bicep since the Spring of this year (2024). I’ve been using Azure Verified Modules (AVM) since early Summer – it’s what we’ve decided should be our standard approach, emulating the styling of Azure Verified Solutions.


We don’t use Microsoft’s landing zones. I have dug into them and found a commonality. The code is too impressive. The developer has been too clever. Very often, “customer configuration” is hard-coded into the Bicep. For example, the image template for Azure Image Builder (in the AVD landing zone) is broken up across many variables which are unioned until a single variable is produced. The image template is file that should be easy to get at and commonly updated.

A managed service provider knows that architecture (the code) should be separated from customer configuration. This allows the customer configuration to be frequently updated separately from the architecture. And, in turn, it should be possible to update the architecture without having to re-import the customer configuration.


My code design is simple:

  • Main.bicep which deploys the Azure Firewall (AVM) and the Firewal Policy (AVM).
  • A two-property paramater controls the true/false (bool) condition of whether or not the two resources are deployed.
  • A main.bicepparam supplies parameters to configure the SKUs/features/settings of the Azure Firewall and Firewall Policy using custom types (enabling complete Intellisense in VS Code).
  • A simple module documents the Rules Collections in single array. This array is returned as an output to main.bicep and fed as a single value to the Firewall Policy module.

I did attempt to document the Rules Collections as ARM and use the Bicep function to load an ARM file. This was my preference because it would simplify producing the firewall rules from the Azure Portal and inputting them into the file, both for the migration and for future operations. However, the Bicep function to load a file is limited to too few characters. The eventual Rules Colleciton Group module had over 40,000 lines!

My test process eventually gave me a clean result from start to finish.

The Migration

The migration was scheduled for late at night. Earlier in the afternoon, a freeze was put in place on the firewall rules. That enabled me to:

  1. Use Azure Firewall Manager to start the process of producing a Firewall Policy. I chose the option to import the rules from the existing production firewall. I then clicked the link to export the rules to ARM and saved the file locally.
  2. I decompiled the ARM code to Bicep. I copied and pasted the 3 Rules Collection Groups into my Rules Collection Group module.
  3. I then ran the deployment with no resources enabled. This told me that the pipeline was function correctly against the production environment.
  4. When the time came, I made my “backups” of the production hub and firewall.
  5. I updated the parameters to enable the deployment of the Firewall Policy. That was a quick run – the Azure Firewall was not touched so there was no udpate to the Firewall. This gave me one last chance to compare the firewall settings and rules before the final steps began.
  6. I removed the DNS settings from the Azure Firewall. I found in testing that I could not attach a Firewall Policy to an Azure Firewall if both contained DNS settings. I had to remove those settings from the production firewall. This could have caused some downtime to any clients using the firewall as their DNS server but the feature is not rolled out yet.
  7. I updated the parameters to enable management of the Azure Firewall. The code here included the name of the in-place Public IP Address. The parameters also included the resource IDs of the hub Virtual Network and the Log Analytics Workspace (Resource Specfic tables in the code). The pipeline ran … this was the key part because the Bicep code was updating the firewall with the resource ID of the Firewall Policy. Everything worked perfectly … almost … the old diagnostics settings were still there and had to be removed because the new code used a new naming standard. One quick deletion and a re-run and all was good.
  8. One of my colleagues ran a bunch of pre-documented and pre-verified tests to confirm that all was was.
  9. I then commented out the code for the Azure Firewall from the old ARM code for the hub. I re-ran the pipeline and cleaned up some errors until we had a repeated clean run.

The technical job was done:

  • Azure Firewall was managed using a Firewall Policy.
  • Azure Firewall had modern diagnostics settings.
  • The configuration is being done using code (Bicep).

You might say “Aidan, there’s a PowerShell script to do that job”. Yes there is, but it wasn’t going to produce the code that we needed to leave in place. This task did the work and has left the customer with code that is extremely flexible with every resource property available as a mandatory/optional property through a documented type specific to the resource type. As long as no bugs are found, the code can be used as is to configure any settings/features/rules in Azure Firewall or Azure Firewall manager either through the parameters files (SKUs and settings) or the Rules Collection Groups module (firewall rules).

Azure Firewall Deep Dive Training

If you thought that this post was interesting then please do check out my Azure Firewall Deep Dive course that is running on February 12th – February 13th, 2025 from 09:30-16:00 UK/Irish time/10:30-17:00 Amsterdam/Berlin time. I’ve run this course twice in the last two weeks and the feedback has been super.

Azure Firewall Deep Dive Training

I’ll tell you about my new virtual training course on Azure Firewall and share some schedule information in this post.

Background

I’ve been talking about Azure Firewall for years. I’ve done lots of sessions at user groups and conferences. I’ve done countless handovers with customers and colleagues. One of my talking points is that I reckoned that I could teach someone with a little Azure/networking knowledge everything there is to know about Azure Firewall in 2 days. And that’s what I decided to do!

I was updating one of my sessions earlier in the year when I realised that it was pretty must the structure of a training couse. Instead of me just listing out features or barely dicusssing architecture to squeeze it into a 45-60 minute-long session, I could take the time to dive deep and share all that I know or could research.

The Course

I produced a 2-day course that could be taught in-person, but my primary vector is virtual/online – it’s hard to get a bunch of people from all over into one place and there is also a cost to me in hosting a physical event that would increse the cost of the course. I decided that virtual was best, with an option off doing it in person if a suitable opportunity arose.

The course content is delivered using a combination of presentation and demo. Presentation lets me explain the what’s, why’s and so on. Demonstration lets me show you how.

The demo lab is built from a Bicep deployment, based on Azure Verified Modules (AVM). A hub & spoke network architecture is created with an Application Gateway, a simple VM workload, and a simple App Services (Private Endpoint) workload. The demonstrations follow a “hands-on guide”; this guide is written as if this was a step-by-step hands-on course, instructing the reader exactly which button to click and what/where to type. Each exercise builds on the last, eventually resulting in a secure network architecture with all of the security, monitoring, and management bells and whistles.

Why did I opt for demo instead of hands-on? Hands-on works for in-person classes. But you cannot assist in the same way when people struggle. In addition, waiting for attendees to complete labs would add another day (and cost) to the class.

Before and class, I share all of the content that I use:

  • System requirements and setup instructions.
  • The Bicep files for the demo lab.
  • The hands-on lab instructions
  • The PowerPoint
  • And a few more useful bits

I always update content – for example, my first run of this class was during Microsoft Ignite 2024 and I added a few bits from the news. Therefore I share the updated content with attendees after the course.

The First Run

I ran the class for the first time earlier this week, Novemer 20-21 2024. Attendees from all around Europe joined me for 2 days. At first they were quiet. Online is tough for speakers like me because I look for visual feedback on how I’m doing. But then the questions started coming – people were interested in what I was saying. Interaction also makes the class more interesting for me – sometimes you get comments that coer things you didn’t originally include and everyone benefits – I updated the course with one such item at the end of day 1!

I shared a 4-question anonymouse survey to learn what people thought. The feedback was awesome.

Feedback

This course was previously run in November 2024 for a European audience. The survey feedback was as follows:

How would you rate this course?

  • Excellent: 83%
  • Good: 17%

Was This Course Worth Your Time?

  • Yes: 100%

Would you recommend this course to others?

  • Yes: 100%

Some of the comments:

“I think it is a very good introduction to Azure Firewall, but it goes beyond foundational concepts so medium- experienced admins will also get value from this. I like the sections on architecture and explanations of routing and DNS. I think this course will enable people to do a good job more than for example az 700 because of the more practical approach. You are good at explaining the material”.

“Just what I wanted from a Deep dive course.”

“Perfectly delivered. Crystal clear content and very well explained”.

Future Classes

I have this class scheduled for two more runs, each timed for different parts of the world:

The classes are ultra-affordable. A few hundred Euros/dollars gets you custom content based on real-world usage. I did fint a virtual 2-day course on Palo Alto firewalls that cost $1700! You’ll also find that I run early-bird registration costs and discounts for more than 1 booking. If you have a large group (5+) then we might be able to figure out a lower rate 🙂

More To Come

More classes are coming! I have an old one to reinvent based on lots of experience over the years and at least 1 new one to write from scratch. Watch out for more!

Azure Image Builder Job Fails With TCP 60000, 5986 or 22 Errors

In this post, I will explain how to solve the situation when an Azure Image Builder job fails with the following errors:

[ERROR] connection error: unknown error Post “https://10.1.10.9:5986/wsman”: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.9:5986/wsman”: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout

[ERROR]connection error: unknown error Post “https://10.1.10.8:5986/wsman”: context deadline exceeded
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.8:5986/wsman”: context deadline exceeded

Note that the second error will probably be the following if you are building a Linux VM:

[ERROR]connection error: unknown error Post “https://10.1.10.8:22/ssh”: context deadline exceeded
[ERROR] SSH connection err: unknown error Post “https://10.1.10.8:22/ssh”: context deadline exceeded

The Scenario

I’m using Azure Image Builder to prepare a reusable image for Azure Virtual Desktop with some legacy software packages from external vendors. Things like re-packaging (MSIX) will be a total support no-no, so the software must be pre-installed into the image.

I need a secure solution:

  • The virtual machine should not be reachable on the Internet.
  • Software packages will be shared from a storage account with a private endpoint for the blob service.

This scenario requires that I prepare a virtual network and customise the Image Template to use an existing subnet ID. That’s all good. I even looked at the PowerShell example from Microsoft which told me to allow TCP 60000-60001 from Azure Load Balancer to Virtual Network:

I also added my customary DenyAll rule at priority 4000 – the built in Deny rule doesn’t deny all that much!

I did that … and the job failed, initially with the first of the errors above, related to TCP 60000. Weird!

Troubleshooting Time

Having migrated countless legacy applications with missing networking documentation into micro-segmented Azure networks, I knew what my next steps were:

  1. Deploy Log Analytics and a Storage Account for logs
  2. Enable VNet Flow Logs with Traffic Analytics on the staging subnet (where the build is happening) NSG
  3. Recreate the problem (do a new build)
  4. Check the NTANetAnalytics table in Log Analytics

And that’s what I did. Immediately I found that there were comms problems between the Private Endpoint (Azure Container Instance) and the proxy VM. TCP 60000 was attempted and denied because the source was not the Azure Load Balancer.

I added a rule to solve the first issue:

I re-ran the test (yes, this is painfully slow) and the job failed.

This time the logs showed failures from the proxy VM to the staging (build) VM on TCP 5986. If you’re building a Linux VM then this will be TCP 22.

I added a third rule:

Now when I test I see the status switch from Running, to Distributing, to Success.

Root Cause?

Adding my DenyAll rule caused the scenario to vary from the Microsoft docs. The built-in AllowVnetInBound rule is too open because it allows all sources in a routed “network”, including other networks in a hub & spoke. So I micro-segment using a low priority DenyAll rule.

The default AllowVnetInBound rule would have allowed the container>proxy and proxy>VM traffic, but I had overridden it. So I need to create rules to allow that traffic.

Lots of Speaking Activity

After a quiet few pandemic years with no in-person events and the arrival of twins, my in-person presentation activity was minimal. My activity has started to increase, and there have been plenty of recent events and more are scheduled for the near future.

The Recent Past

Experts Live Netherlands 2024

It was great to return to The Netherlands to present at Experts Live Netherlands. Many European IT workers know the Experts Live brand; Isidora Maurer (MVP) has nurtured & shepherded this conference brand over the years, starting with a European-wide conference and then working with others to branch it out to localised events that give more people a chance to attend. I presented at this event a few years ago but personal plans prevented me from submitting again until this year. And I was delighted to be accepted as a speaker.

Hosted in Nieuwegein, just a short train ride from Schiphol airport in Amsterdam (Dutch public transport is amazing) the conference featured a packed expo hall and many keen attendees. I presented my “Azure Firewall: The Legacy Firewall Killer” session to a standing-room-only room.

TechMentor Microsoft HQ 2024

The first conference I attended was WinConnections 2004 in Lake Las Vegas. That conference changed my career. I knew that TechMentor had become something like that – the quality of the people I knew who were presenting at the event in the past was superb. I had the chance to submit some sessions this time around and was happy to have 3 accepted, including a pre-conference all-day seminar.

I worked my tail off on that “pre-con”. It’s an expansion of one of my favourite sessions that many events are scared of, probably because they think it’s too niche or too technical: “Routing – The Virtual Cabling of Secure Azure Networking”. Expanding a 1 hour session to a full day might seem daunting but I had to limit how much content I included! Plus I had to make this a demo session. I worked endless hours on a Bicep deployment to build a demo lab for the attendees. This was necessary because it would take too long to build by hand. I had issues with Azure randomly failing and with version stuff changing inside Microsoft. As one might expect, the demo gods were not kind on the day and I quickly had to pivot from hands-on labs to demos. While the questions were few during the class, there were lots of conversations during the breaks and even on the following days.

My second session was “Azure Firewall: The Legacy Firewall Killer” – this is a popular session. I like doing this topic because it gives me a chance to crack a few jokes – my family will groan at that thought!

My final session was the one that I was most worried about. “Your Azure Migration Project Is Doomed To FAIL” was never accepted by any event before. I think the title might seem negative but it’s meant to be funny. The content is based on my experience dealing with mid-large organisations who never quite understand the difference between cloud migration and cloud adoption. I explain this through several fictional stories. There is liberal use of images from Unsplash and opportunities to make some laughter. This have been the session that I was least confident in, but it worked.

TechMentor puts a lot of effort into mixing the attendees and the presenters. On the first night, attendees and presenters went to a local pizza place/bar and sat in small booths. We had to talk to each other. The booth that I was at featured people from all over the USA with different backgrounds. People came and went, but we talked and were the last to leave. On the second day, lunch was an organised affair where each presenter was host to a table. Attendees could grab lunch and sit with a presenter to discuss what was on their minds. I knew that migrations were a hot topic. And I also knew that some of those attendees were either doing their first migration or first re-attempt at a migration. I was able to tune my session a tiny bit to the audience and it hit home. I think the best thing about this was the attention I saw in the room, the verbal feedback that I heard just after the session, and the folks who came up to talk to me after.

A Break

I brought my family to the airport the day before I flew to TechMentor. They were going to Spain for 4 weeks and I joined them a few days later after a l-o-n-g Seattle-Las Angeles-Dublin-Alicante journey (I really should have stayed one extra night in Seattle and taken the quicker 1-hop via Iceland).

33+ Celsius heat, sunshine, a pool, a relaxed atmosphere in a residential town (we didn’t go to a “hotel town”) was a great place to work for a week and then do two weeks of vacation.

I went running most mornings, doing 5-7KMs. I enjoy getting up early in places like this, picking a route to explore on a map, and hitting the streets to discover the locality and places to go with my family. It’s so different to home where I have just two routes with footpaths that I can use.

Coming home was a shock. Ireland isn’t the sunniest or the warmest place in the world, but it feels like mid-winter at the moment. I think I acclimatised to Spain as much as a pasty Irish person can. This morning I even had to put a jacket on and do a couple of KMs to wait for my legs to warm up before picking up the pace.

Upcoming Events

There are three confirmed events coming up:

Nieuwegein (Netherlands) September 11: Azure Fest 2025

I return to this Dutch city in a few days to do a new session “Azure Virtual Network Manager”. I’ve been watching this product develop since the private preview. It’s not quite ready (pricing is hopefully being fixed) but it could be a complete game changer for managing complex Azure networks for secure/compliant PaaS and IaaS deployments. I’ll discuss and demo the product, sharing what I like and don’t like.

Dublin (Ireland) October 7: Microsoft Azure Community & AI Day

Organised by Nicolas Chang (MVP) this event will feature a long list of Irish MVPs discussing Azure and AI in a rapid-fire set of short sessions. I don’t think that the event page has gone live yet so watch out for it. I will be presenting the “Azure Virtual Network Manager” again at this event.

TBA: Nordics

I’ve confirmed my speaking slots for 2 sessions at an event that has not publicly announced the agenda yet. I look forward to heading north and sharing some of my experiences.

My Sessions

If you are curious, then you can see my Sessionize public profile here, which is where you’ll see my collection of available sessions.

Azure Back To School 2024 – Govern Azure Networking Using Azure Virtual Network Manager

This post about Azure Virtual Network Manager is a part of the online community event, Azure Back To School 2024. In this post, I will discuss how you can use Azure Virtual Network Manager (AVNM) to centrally manage large numbers of Azure virtual networks in a rapidly changing/agile and/or static environment.

Challenges

Organisations around the globe have a common experience: dealing with a large number of networks that rapidly appear/disappear is very hard. If those networks are centrally managed then there is a lot of re-work. If the networks are managed by developers/operators then there is a lot of governance/verification work.

You need to ensure that networks are connected and are routed according to organisation requirements. Mandatory security rules must be put in place to either allow required traffic or to block undesired flows.

That wasn’t a big deal in the old days when there were maybe 3-4 huge overly trusting subnets in the data centre. Network designs change when we take advantage of the ability to transform when deploying to the cloud; we break those networks down into much smaller Azure virtual networks and implement micro-segmentation. This approach introduces simplified governance and a superior security model that can reliably build barriers to advanced persistent threats. Things sound better until you realise that there are no many more networks and subnets that there ever were in the on-premises data centre, and each one requires management.

This is what Azure Virtual Network Manager was created to help with.

Introducing Azure Virtual Network Manager

AVNM is not a new product but it has not gained a lot of traction yet – I’ll get into that a little later. Spoiler alert: things might be changing!

The purpose of AVNM is to centralise configuration of Azure virtual networks and to introduce some level of governance. Don’t get me wrong, AVNM does not replace Azure Policy. In fact, AVNM uses Azure Policy to do some of the leg work. The concept is to bring a network-specialist toolset to the centralised control of networks instead of using a generic toolset (Azure Policy) that can be … how do I say this politely … hmm … mysterious and a complete pain in the you-know-what to troubleshoot.

AVNM has a growing set of features to assist us:

  • Network groups: A way to identify virtual networks or subnets that we want to manage.
  • Connectivity configurations: Manage how multiple virtual networks are connected.
  • Security admin rules: Enforce security rules at the point of subnet connection (the NIC).
  • Routing configurations: Deploy user-defined routes by policy.
  • Verifier: Verify the networks can allow required flows.

Deployment Methodology

The approach is pretty simple:

  1. Identify a collection of networks/subnets you want to configure by creating a Network Group.
  2. Build a configuration, such as connectivity, security admin rules, or routing.
  3. Deploy the configuration targeting a Network Group and one or more Azure regions.

The configuration you build will be deployed to the network group members in the selected region(s).

Network Groups

Part of a scalable configuration feature of AVNM is network groups. You will probably build several or many network groups, each collecting a set of subnets or networks that have some common configuration requirement. This means that you can have ea large collection of targets for one configuration deployment.

Network Groups can be:

  • Static: You manually add specific networks to the group. This is ideal for a limited and (normally) unchanging set of targets to receive a configuration.
  • Dynamic: You will define a query based on one or more parameters to automatically discover current and future networks. The underlying mechanism that is used for this discovery is Azure Policy – the query is created as a policy and assigned to the scope of the query.

Dynamic groups are what you should end up using most of the time. For example, in a governed environment, Azure resources are often tagged. One can query virtual networks with specific tags and in specific Azure regions and have them automatically appear in a network group. If a developer/operator creates a new network, governance will kick in and tag those networks. Azure Policy will discover the networks and instantly inform AVNM that a new group member was discovered – any configurations applied to the group will be immediately deployed to the new network. That sounds pretty nice, right?

Connectivity Configurations

Before we continue, I want you to understand that virtual network peering is not some magical line or pipe. It’s simply an instruction to the Azure network fabric to say “A collection of NICs A can now talk with a collection of NICs B”.

We often want to either simplify the connectivity of networks or to automate desired connectivity. Doing this at scale can be done using code, but doing it in an agile environment requires trust. Failure usually happens between the chair and the keyboard, so we want to automate desired connectivity, especially when that connectivity enables integration or plays a role in security/compliance.

Connectivity Configurations enable three types of network architecture:

  • Hub-and-spoke: This is the most common design I see being required and the only one I’ve ever implemented for mid-large clients. A central regional hub is deployed for security/transit. Workloads/data are placed in spokes and are peered only with the hub (the network core). A router/firewall is normally (not always) the next hop to leave a spoke.
  • Full mesh: Every virtual network is connected directly to every other virtual network.
  • Hub-and-spoke with mesh: All spokes are connected to the hub. All spokes are connected to each other. Traffic to/from the outside world must go through the hub. Traffic to other spokes goes directly to the destination.

Mesh is interesting. Why would one use it? Normally one would not – a firewall in the hub is a desirable thing to implement micro-segmentation and advanced security features such as Intrusion Detection and Prevention System (IDPS). But there are business requirements that can override security for limited scenarios. Imagine you have a collection of systems that must integrate with minimised latency. If you force a hop through a firewall then latency will potentially be doubled. If that firewall is deemed an unnecessary security barrier for these limited integrations by the business, then this is a scenario where a full mesh can play a role.

This is why I started off discussing peering. Whether a system is in the same subnet/network or not, it doesn’t matter. The physical distance matters, not the virtual distance. Peering is not a cable or a connection – it’s just an instruction.

However, Virtual Network Peering is not even used in mesh! It’s something different that can handle the scale of many virtual networks being interconnected called a Connected Group. One configuration inter-connects all the virtual networks without having to create 1-1 peerings between many virtual networks.

A very nice option with this configuration is the ability to automatically remove pre-existing peering connections to clean up unwanted previous designs.

Security Admin Rules

What is a Network Security Group (NSG) rule? It’s a Hyper-V port ACL that is implemented at the NIC of the virtual machine (yours or in the platform hosting your PaaS service). The subnet or NIC association is simply a scaling/targeting system; the rules are always implemented at the NIC where the virtual switch port is located.

NSGs do not scale well. Imagine you need to deploy a rule to all subnets/NICs to allow/block a flow. How many edits will you need to do? And how much time will you waste on prioritising rules to ensure that your rule is processed first?

Security Admin Rules are also implemented using Port ACLs but they are always processed first. You can create a rule or a set or rules and deploy it to a Network Group. All NICs will be updated and your rules will always be processed first.

Tip: Consider using VNet Flow Logs to troubleshoot Security Admin Rules.

Routing Configurations

This is one of the newer features in AVNM and was a technical blocker for me until it was introduced. Routing plays a huge role in a security design, forcing traffic from the spoke through a firewall in the hub. Typically, in VNet-based hub deployments, we place one user-defined route (UDR) in each subnet to make that flow happen. That doesn’t scale well and relies on trust. Some have considered using BGP routing to accomplish this but that can be easily overridden after quite a bit of effort/cost to get the route propagated in the first place.

AVNM introduced a preview to centrally configure UDRs and deploy them to Network Groups with just a few clicks. There are a few variations on this concept to decide how granular you want the resulting Route Tables to be:

  • One is shared with virtual networks.
  • One is shared with all subnets in a virtual network.
  • One per subnet.

Verification

This is a feature that I’m a little puzzled about and I am left wondering if it will play a role in some other future feature. The idea is that you can test your configurations to ensure that they are working. There is a LOT of cross-over with Network Watcher and there is a common limitation: it only works with virtual machines.

What’s The Bad News?

Once routing configurations go generally available, I would want to use AVNM in every deployment that I do in the future. But there is a major blocker: pricing. AVNM is priced per subscription at $73/month. For those of you with a handful of subscriptions, that’s not much at all. But for those of us who saw that the subscription is a natural governance boundary and use LOTS of subscriptions (like in Microsoft Cloud Adoption Framework), this is a huge deal – it can make AVNM the most expensive thing we do in Azure!

The good news is that the message has gotten through to Microsoft and some folks in Azure networking have publicly commented that they are considering changes to the way that the pricing of AVNM is calculated.

The other bit bad news is an oldie: Azure Policy. Dynamic network group membership is built by Azure Policy. If a new virtual network is created by a developer, it can be hours before policy detects it and informs AVNM. In my testing, I’ve verified that once AVNM sees the new member, it triggers the deployment immediately, but the use of Azure Policy does create latency, enabling some bad practices to be implemented in the meantime.

Summary

I was a downer on AVNM early on. But recent developments and some of the ideas that the team is working on have won me over. The only real blocker is pricing, but I think that the team is serious about fixing that. I stated earlier that AVNM hasn’t gotten a lot of traction. I think that this should change once pricing is fixed and routing configurations are GA.

I recently demonstrated using AVNM to build out the connectivity and routing of a hub-and-spoke with micro-segmentation at a conference. Using Azure Portal, the entire configuration probably took less than 10 minutes. Imagine that: 10 minutes to build out your security and compliance model for now and for the future.

Azure Route Server Saves The Day

In this post, I will discuss a recent scenario where we used Azure Route Server branch-to-branch routing to rescue a client.

The Original Network Design

This client is a large organisation with a global footprint. They had a previous WAN design that was out of scope for our engagement. The heart of the design was Meraki SD-WAN, connecting their global locations. I like Meraki – it’s relatively simple and it just works – that’s coming from me, an Azure networking person with little on-premises networking experience.

The client started using the services of a cloud provider (not Microsoft). The client followed the guidance of the vendor and deployed a leased line connection to a cloud region that was close to their headquarters and to their own main data centre. The leased line provides low latency connectivity between applications hosted on-premises and applications/data hosted in the other cloud.

Adding Azure

The customer wanted to start using Azure for general compute/data tasks. My employer was engaged to build the original footprint and to get them started on their journey.

I led the platform build-out, delegating most of the hands-on and focusing on the design. We did some research and determined the best approach to integrate with the other cloud vendor was via ExpressRoute. The Azure footprint was placed in an Azure region very close to the other vendor’s region.

An ExpressRoute circuit was deployed between a VNet-based hub in Azure – always my preference because of the scalability, security/governance concepts, and the superiority over Virtual WAN hub when it comes to flexibility and troubleshooting. The Meraki solution from the Azure Marketplace was added to the hub to connect Azure to the SD-WAN and BGP propagation with Azure was enabled using Azure Route Server. To be honest – that was relatively simple.

The customer had two clouds:

  • The other vendor via a leased line.
  • Azure via SD-WAN.
  • And an interconnect between Azure and the other cloud via ExpressRoute.

Along Came a Digger

My day-to-day involvement with the client was over months previously. I got a message early one morning from a colleague. The client was having a serious networking issue and could I get online. The issue was that an excavator/digger had torn up the lines that provided connectivity between the client’s data centre and the other cloud.

Critical services in the other Cloud were unavailable:

  • App integration and services with the on-premises data centre.
  • App availability to end users in the global offices.

I thought about it for a short while and checked out my theory online. One of the roles of Azure Route server is to enable branch to branch connectivity between “on-premises” locations between ExpressRoute/VPN.

Forget that the other cloud is a cloud – think of the other cloud’s region as an on-premises site that is connected via ExpressRoute and the above Microsoft diagram makes sense – we can interconnect the two locations via BGP propagation through Azure Route Server:

  • The “on-premises” location via ExpressRoute
  • The SD-WAN via the Meraki which is already peered with Azure Route Server

I presented the idea to the client. They processed the information quickly and the plan was implemented quickly. How quickly? It’s one setting in Azure Route Server!

The Solution

The workaround was to use Azure as a temporary route to the other Cloud. The client had routes from their data centre and global offices to Azure via the Meraki SD-WAN. BGP routes were propagating between the SD-WAN connected locations, thanks to the peering between the Meraki NVA in the Azure hub and Azure Route Server.

BGP routes were also propagating between the other cloud and Azure thanks to ExpressRoute.

The BGP routes that did exist between the SD-WAN and the other cloud were gone because the leased line was down – and was going to be down for some time.

We wanted to fill the gap – get routes from the other cloud and the SD-WAN to propagate through Azure. If we did that then the SD-WAN locations and the other cloud could route via the Meraki and the ExpressRoute gateway in the Azure Hub – Azure would become the gateway between the SD-WAN and the other cloud.

The solution was very simple: enable branch-to-branch connectivity in Azure Route Server. There’s a little wait when you do that and then you run a command to check the routes that are being advertised to the Route Server peer (the Meraki NVA in this case).

The result was near instant. Routes were advertised. We checked Azure Monitor metrics on the ExpressRoute circuit and could see a spike in traffic that coincided with the change. The plan had worked.

The Results

I had not heard anything in a while. This morning I heard that the client was happy with the fix. In fact, user experience was faster.

Go back to the original diagram before Azure and I can explain. Users are located in the branch offices around the world. Their client applications are connecting to services/data in the other cloud. Their route is a “backhaul”:

  1. SD-WAN to central data centre
  2. Leased line over long distance to the other cloud

When we introduced the “Azure bypass” after the leased line failure, a new route appeared for end users:

  1. SD-WAN to Azure
  2. A very short distance hop over ExpressRoute

Latency was reduced quite a bit so user experience improved. On the contrary, latency between the on-premises data centre and the other cloud has increased because the SD-WAN is a new hop but at least the path is available. The original leased line is still down after a few weeks – this is not the fault of the client!

Some Considerations

Ideally one would have two leased lines in place for failover. That incurs costs and it was not possible. What about Azure ExpressRoute Metro? That is still in preview at this time and is not available in the Azure metro in question.

However, this workaround has offered a triangle of connectivity. When the lease line in repaired, I will recommend that the triangle becomes their failover – if any one path fails, the other two will take the place, bringing the automatic recoverability that was part of the concept of the original ARPANET.

The other change is that the other cloud should become another site in the Meraki SD-WAN to improve the user app experience.

If we do keep branch-to-branch connectivity then we need to consider “what is the best path”? For example, we want the data centre to route directly to the other cloud when the leased line is available because that offers the lowest latency. But what if a route via Azure is accidentally preferred? We need control.

In Azure Route Server, we have the option to control connectivity from the Azure perspective (my focus):

  • (Default) Prefer ExpressRoute: Any routes received over ExpressRoute will be used. This would offer sub-optimal routes because on-premises prefixes will be received from the other cloud.
  • Prefer VPN: Any routes received over VPN will be used. This would offer sub-optimal routes because other cloud prefixes will be received from on-premises.
  • Use AS path: Let the admin/network advertise a preferred path. This would offer the desired control – “use this path unless something goes wrong”.

Azure’s Software Defined Networking

In this post, I will explain why Azure’s software-defined networking (virtual networks) differs from the cable-defined networking of on-premises networks.

Background

Why am I writing this post? I guess that this one has been a long time coming. I noticed a trend early in my working days with Azure. Most of the people who work with Azure from the infrastructure/platform point of view are server admins. Their work includes doing all of the resource stuff you’d expect, such as Azure SQL, VMs, App Services, … virtual networks, Network Security Groups, Azure Firewall, routing, … wait … isn’t that networking stuff? Why isn’t the network admin doing that?

I think the answer to that question is complicated. A few years ago I added a question to the audience to some of my presentations on Azure networking. I asked who was a ON-PREMISES networking admin versus an ON-PREMISES something-else. And then I said “the ‘server admins’ are going to understand what I will tech more easily than the network admins will”. I could see many heads nodding in agreement. Network admins typically struggle with Azure networking because it is very different.

Cable-Defined Networking

Normally, on-premises networking is “cable-defined”. That phrase means that packets go from source to destination based on physical connections. Those connections might be indirect:

  • Appliances such as routers decide what turn to take at a junction point
  • Firewalls either block or allow packets
  • Other appliances might convert signals from electrons to photons or radio waves.

A connection is always there and, more often than not, it’s a cable. Cables make packet flow predictable.

Look at the diagram of your typical on-premises firewall. It will have ethernet ports for different types of networks:

  • External
  • Management
  • Site-to-site connectivity
  • DMZ
  • Internal
  • Secure zone

Each port connects to a subnet that is a certain network. Each subnet has one or more switches that only connect to servers in that subnet. The switches have uplinks to the appropriate port in the firewall, thus defining the security context of that subnet. It also means that a server in the DMZ network must pass through the firewall, via the cable to the firewall, to get to another subnet.

In short, if a cable does not make the connection, then the connection is not possible. That makes things very predictable – you control the security and performance model by connecting or not connecting cables.

Software-Defined Networking

Azure is a cloud, and as a cloud, it must enable self-service. Imagine being a cloud subscriber, and having to open a support call to create a network or a subnet. Maybe they need to wait 3 days while some operators plug in cables and run Cisco commands. Or they need to order more switches because they’ve run out of capacity and you might need to wait weeks. Is this the hosting of the 2000’s or is it The Cloud?

Azure’s software-defined networking enables the customer to run a command themselves (via the Portal, script, infrastructure-as-code, or API) to create and configure networks without any involvement from Microsoft staff. If I need a new network, a subnet, a firewall, a WAF, or almost anything networking in Azure (with the exception of a working ExpressRoute circuit) then I don’t need any human interaction from a support staff member – I do it and have the resource anywhere from a few seconds to 45 minutes later, depending on the resource type.

This is because the physical network of Azure is overlayed with a software-defined network based on VXLAN. In simple terms, you have no visibility of the physical network. You use simulated networks that hide the underlying complexities, scale, and addressing. You create networks of your own address/prefix choice and use them. Your choice of addresses affects only your networks because they actually have nothing to do with how packets route at the physical layer – that’s handled by traditional networking at the physical layer – but that’s a matter only for the operators of the Microsoft global network/Azure.

A diagram helps … and here’s one that I use in my Azure networking presentations.

In this diagram, we see a source and a destination running in Azure. In case you were not aware:

  • Just about everything in Azure runs in a virtual machine, even so-called serverless computing. That virtual machine might be hidden in the platform but it is there. Exceptions might include some very expensive SKUs for SAP services and Azure VMware hosts.
  • The hosts for those virtual machines are running (drumroll please) Hyper-V, which as one may now be forced to agree, is scalable 😀

The source wants to send a packet to a destination. The source is connected to a Virtual Network and has the address of 10.0.1.4. The destination is connected to another virtual network (the virtual networks are peered) and has an address of 10.10.1.4. The virtual machine guest OS sends the packet to the NIC where the Azure fabric takes over. The fabric knows what hosts the source and destination are running on. The packet is encapsulated by the fabric – the letter is put into a second envelope. The envelope has a new source address, that of the source host, and a new destination, the address of the destination host. This enables the packet to traverse the physical network of Microsoft’s data centres even if 1000s of tenants are using the 10.x.x.x prefixes. The packet reaches the destination host where it is decapsulated, unpacking the original packet and enabling the destination host to inject the packet into the NIC of the destination.

This is why you cannot implement GRE networking in Azure.

Virtual Networks Aren’t What You Think

The software-defined networking in Azure maintains a mapping. When you create a virtual network, a new map is created. It tells Azure that NICs (your explicitly created NICs or those of platform resources that are connected to your network) that connect to the virtual network are able to talk to each other. The map also tracks what Hyper-V hosts the NICs are running on. The purpose of the virtual network is to define what NICs are allowed to talk to each other – to enforce the isolation that is required in a multi-tenant cloud.

What happens when you peer two virtual networks? Does a cable monkey run out with some CAT6 and create a connection? Is the cable monkey creating a virtual connection? Does that connection create a bottleneck?

The answer to the second question is a hint as to what happens when you implement virtual network peering. The speed of connections between a source and destination in different virtual networks is the potential speed of their NICs – the slowest NIC (actually the slowest VM, based on things like RSS/VMQ/SR-IOV) in any source/destination flow is the bottleneck.

VNet peering does not create a “connection”. Instead, the mapping that is maintained by the fabric is altered. Think of it being like a Venn Diagram. Once you implement peering, the loops that define what can talk to what has a new circle. VNet1 has a circle encompassing its NICs. VNet2 has a circle encompassing its NICs. Now a new circle is created that encompasses VNet1 and VNet2 – any source in VNet1 can talk directly, using encapsulation/decapsulation) to any destination in VNet2 and vice versa without going through some resource in the virtual networks.

You might have noticed before now that you cannot ping the default gateway in an Azure virtual network. It doesn’t exist because there is no cable to a subnet appliance to reach other subnets.

You also might have noticed that tools like traceroute are pretty useless in Azure. That’s because the expected physical hops are not there. This is why using tools like test-netconnection (Windows PowerShell) or Network Watcher Connection Troubleshoot/Connection Monitor are very important.

Direct Connections

Now you know what’s happening under the covers. What does that mean? When a packet goes from source to destination, there is no hop. Have a look at the diagram below.

It’s not an unusual diagram. There’s an on-prem network on the left that connects to Azure virtual networks using a VPN tunnel that is terminated in Azure by a VPN Gateway. The VPN Gateway is deployed into a hub VNet. There’s some stuff in the hub, including a firewall. Services/data are deployed into spoke VNets – the spoke VNets are peered with the hub.

One can immediately see that the firewall, in the middle, is intended to protect the Azure VNets from the on-premises network(s). That’s all good. But this is where the problems begin. Many will look at that diagram and think that this protection will just work.

If we take what I’ve explained above we’ll understand really what will happen. The VPN Gateway is implemented in the platform as two Azure virtual machines. Packets will come in over the tunnel to one of those VMs. Then the packets will hit the NIC of the VM to route to a spoke VNet. What path will those packets take? There’s a firewall in the pretty diagram. The firewall is placed right in the middle! And that firewall is ignored. That’s because packets leaving the VPN Gateway VM will be encapsulated and go straight to the NIC of the destination NIC in one of the spokes as if it were teleported.

To get the flow that you require for security purposes you need to understand Azure routing and either implement the flow via BGP or User-Defined Routing.

Now have a look at this diagram of a virtual appliance firewall running in Azure from Palo Alto.

Look at all those pretty subnets. What is the purpose of them? Oh I know that there’s public, management, VPN, etc. But why are they all connecting to different NICs? Are there physical cables to restrict/control the flow of packets between some spoke virtual network and a DMZ virtual network? Nope. What forces packets to the firewall? Azure routing does. So those NICs in the firewall do what? They don’t isolate, they complicate! They aren’t for performance, because the VM size controls overall NIC throughput and speed. They don’t add performance, they complicate!

The real reason for all those NICs is to simulate eth0, eth1, etc that are referenced by the Palo Alto software. It enables Palo Alto to keep the software consistent between on-prem appliances and their Azure Marketplace appliance. That’s it – it saves Palo Alto some money. Meanwhile, Azure Firewall using a single IP address on the virtual network (via the Standard tier load balancer, but you might notice each compute instance IP as a source) and there is no sacrifice in security.

Wrapping Up

There have been countless times over the years when having some level of understanding of what is happening under the covers has helped me. If you grasp the fundamentals of how packets rally get from A to B then you are better prepared to design, deploy, operate, or troubleshoot Azure networking.

Network Rules Versus Application Rules for Internal Traffic

This post is about using either Network Rules or Application Rules in Azure Firewall for internal traffic. I’m going to discuss a common scenario, a “problem” that can occur, and how you can deal with that problem.

The Rules Types

There are three kinds of rules in Azure Firewall:

  • DNAT Rules: Control traffic originating from the Internet, directed to a public IP address attached to Azure Firewall, and translated to a private IP Address/port in an Azure virtual network. This is implicitly applied as a Network Rule. I rarely use DNAT Rules – most modern applications are HTTP/S and enter the virtual network(s) via an Application Gateway/WAF.
  • Application Rules: Control traffic going to HTTP, HTTPS, or MSSQL (including Azure database services).
  • Network Rules: Control anything going anywhere.

The Scenario

We have an internal client, which could be:

  • On-premises over private networking
  • Connecting via point-to-site VPN
  • Connected to a virtual network, either the same as the Azure Firewall or to a peered virtual network.

The client needs to connect to a server, with SSL authentication, to a server. The server is connected to another virtual network/subnet. The route to the server goes through the Azure Firewall. I’ll complicate things by saying that the server is a PaaS resource with a Private Endpoint – this doesn’t affect the core problem but it makes troubleshooting more difficult 🙂

NSG rules and firewall rules have been accounted for and created. The essential connection is either HTTPS or MSSQL and is implemented as an Application Rule.

The Problem

The client attempts a connection to the server. You end up with some type of application error stating that there was either a timeout or a problem with SSL/TLS authentication.

You begin to troubleshoot:

  • Azure Firewall shows the traffic is allowed.
  • NSG Flow Logs show nothing – you panic until you remember/read that Private Endpoints do not generate flow logs – I told you that I’d complicate things 🙂 You can consider VNet Flow Logs to try to capture this data and then you might discover the cause.

You try and discover two things:

  • If you disconnect the NSG from the subnet then the connection is allowed. Hmm – the rules are correct so the traffic should be allowed: traffic from the client prefix(es) is permitted to the server IP address/port. The rules match the firewall rules.
  • You change the Application Rule to a Network Rule (with the NSG still associated and unchanged) and the connection is allowed.

So, something is going on with the Application Rules.

The Root Cause

In this case, the Application Rule is SNATing the connection. In other words, when the connection is relayed from the Azure Firewall instance to the server, the source IP is no longer that of the client; the source IP address is a compute instance in the AzureFirewallSubnet.

That is why:

  • The connection works when you remove the NSG
  • The connection works when you use a Network Rule with the NSG – the Network Rule does not SNAT the connection.

Solutions

There are two solutions to the problem:

Using Application Rules

If you want to continue to use Application Rules then the fix is to modify the NSG rule. Change the source IP prefix(es) to be the AzureFirewallSubnet.

The downsides to this are:

  • The NSG rules are inconsistent with the Azure Firewall rules.
  • The NSG rules are no longer restricting traffic to documented approved clients.

Using Network Rules

My preference is to use Network Rules for all inbound and east-west traffic. Yes, we lose some of the “Layer-7” features but we still have core features, including IDPS in the Premium SKU.

Contrary to using Application Rules:

  • The NSG rules are consistent with the Azure Firewall rules.
  • The NSG rules restrict traffic to the documented approved clients.

When To Use Application Rules?

In my sessions/classes, I teach:

  • Use DNAT rules for the rare occasion where Internet clients will connect to Azure resources via the public IP address of Azure Firewall.
  • Use Application Rules for outbound connections to Internet, including Azure resources via public endpoints, through the Azure Firewall.
  • Use Network Rules for everything else.

This approach limits “weird sh*t” errors like what is described above and means that NSG rules are effectively clones of the Azure Firewall rules, with some additional stuff to control stuff inside of the Virtual Network/subnet.

Azure Virtual Network Manager – Routing Configuration Preview

Microsoft recently announced a public preview of User-Defined Route (UDR) management using Azure Virtual Network Manager. I’ve taken some time to play with it, and here are my thoughts.

Azure Virtual Network Manager (AVNM)

AVNM has been around for a while but I have mostly ignored it up to now because:

  • The connectivity configuration feature (centrally manage VNet connections) was pointless to me without route management – what’s the point of a hub & spoke in a business setting without a firewall?
  • I liked the Security Admin Rule configuration (same tech as NSG rules in the Hyper-V switch port, but processed before NSG rules) but pricing of AVNM was too much – more on this later.

Connectivity was missing something – the ability to deploy UDRs or BGP routes from a central policy that would force a next hop to a routing/firewall appliance.

AVNM is deployed centrally but can operate potentially across all virtual networks in your tenant (defined by a scope at the time of deployment) and even across other tenants via mutually agreed guest access – the latter would be useful in acquisition or managed services scenarios.

Routing Configuration Preview

Routing Configuration was introduced as a preview on May 2nd. Immediately I was drawn to it. But I needed to spend some time with it – to dig a little deeper and not just start spouting off without really understanding what was happening. I spent quite a bit of time reading and playing last week and now I feel happier about it.

Network Groups

Network Groups power everything in AVNM. A Network Group is a listing of either subnets or virtual networks. It can be a static list that you define or it can be a dynamic query.

At first, dynamic query looks cool. You can build up a dynamic query using one or a number of parameters:

  • Name
  • Id
  • Tags
  • Location
  • Subscription Name
  • Subscription ID
  • Subscription Tags
  • Resource Group Name
  • Resource Group Id

When you add members via a query, an Azure Policy is created.

When that policy (re)evaluates a notification is sent to AVNM and any policies that target the updated network group are applied to the group members. That creates a possible negative scenario:

  • You build a workload in code from VNet all the way through to resource/code
  • You deploy the workload IaC
  • The VNet is deployed, without any peering/routing configurations because that’s the job of AVNM
  • The workload components that rely on routing/peering fail and the deployment fails
  • Azure Policy runs some time later and then you can run your code.

Ick! You don’t want to code peering/routing if it’s being deployed by AVNM – you could end up with a mess when code runs and then AVNM runs and so on.

What do you do? AVNM has a very nice code structure under the covers. The AVNM resource is simple – all the configurations, the rules collections, and rules, and groups are defined as sub-resources. One could build the group membership using static membership and place the sub-resource with the workload code. That will mean that the app registration used by the pipeline will require rights in the central AVNM – that could be an issue because AVNM is supposed to be a governance tool.

Ideally, Azure Policy would trigger much faster than it does (not scientific, but it was taking 15-ish minutes in my tests) and update the group membership with less latency. Once the membership is updated, configurations are deployed nearly instantly – faster than I could measure it.

Routing Configuration

I like how AVNM has structured the configurations for Security Admin Rules and Routing Configurations. It reminds me of how Azure Firewall has handled things.

Rule Collection

A Routing Configuration is deployed to a scope. The configuration is like a bucket – it has little in the way of features – all that happens in the Routing Rule Collections. The configuration contains one or more Rule Collections. Each collection targets a specific group. So I could have three groups defined:

  • Production
  • Dev
  • Secure

Each would have a different set of routes, the rules, defined. I have only one deployment (the configuration) which automatically applies to the correct VNets/subnets based on the group memberships. If I am using dynamic group membership, I can use governance features like tags (which can be controlled from management groups, subscriptions, resource groups or at the resource level) for large-scale automation and control.

There are 3 kinds of Local Route Setting – this configures:

  • How many Route Tables are deployed per VNet and how they are associated
  • Whether or not a route to the prefix of the target resource is created
None SpecifiedDirect Routing Within Virtual NetworkDirect Routing Within Subnet
How Many Route Tables?One per VNetOne per VNetOne per subnet
AssociationAll subnets in the VNetAll subnets in the VNetWith the subnet
Local_0 RouteN/AYes > VNet PrefixYes > Subnet Prefix
Local Route Setting in a Rule Collection

The Route Tables are created an AVNM-managed resource group in the target subscription. If you choose one of the “Direct …” Local Route Setting options then a UDR is created for the target prefix:

Direct Routing Within Virtual NetworkDirect Routing Within Subnet
Address PrefixThe VNet PrefixThe subnet prefix
Next Hop TypeVirtual NetworkVirtual Network
Using the Direct Routing options

The concept is that you can force routing to stay within the target VNet/Subnet if the destination is local, while routing via a different next hop when leaving the target. For example, force traffic to the local VNet via the firewall while staying in the subnet (Direct Routing Withing Subnet). Note that the Default rules for VNet via Virtual Network are not deactivated by default, which you can see below – localRoute_0 is created by AVNM to implement the “Direct …” option.

You have the option to control BGP propagation – which is important when using a firewall to isolate site-to-site connections from your Azure services.

Some Notes

AVNM isn’t meant to be the “I’ll manage all the routes centrally” solution. It manages what is important to the organisation – the governance of the network security model. You have the ability to edit routes in the resulting Route Table. So if I need to create custom routes for PaaS services or for a special network design then I can do that. The resulting Route Tables are just regular Azure Route Tables so I can add/edit/remove routes as I desire.

If you manually create a route in the Route Table and AVNM then tries to create a route to the same destination then AVNM will ignore the new rule – it’s a “what’s the point?” situation.

If someone updates an AVNM-managed rule then AVNM will not correct it until there is a change to the Rule Collection. I do not like this. I deem this to be a failure in the application of governance.

Pricing

This is the graveyard of AVNM. If you run Azure like a small business then you lump lots of workloads into a few subscriptions. If you, like I started doing years ago, have a “1 workload per subscription” model (just like in the Azure Cloud Adoption Framework) then AVNM is going to be pricey!

AVNM costs $0.10/subscription/hour. At 730 hours per average month, AVNM for a single subscription will cost $73/month. Let’s say that I have 100 workloads. That will cost me $7300/month! Azure Firewall Premium (compute only) costs $1277.50/month so how could some policy tool cost nearly 6 times more!?!?!

Quite honestly, I would have started to use AVNM last year for a customer when we wanted to roll out “NSG rules” to every subnet in Azure. I didn’t want to do an IaC edit and a DevOps pull request for every workload. That would have taken days/hours (and it did take days). I could have rolled out the change using AVNM in minutes. But the cost/benefit wasn’t worth it – so I spent days doing code and pull requests.

I hear it again and again. AVNM is not perfect, but its usable (feature improvements will come). But the pricing kills it before customer evaluation can even happen.

Conclusion

If a better triggering system for dynamic member Network Groups can be created then I think the routing solution is awesome. But with the pricing structure that is there today, the product is dead to me, which makes me sad. Come on Microsoft, don’t make me sad!

Why Are There So Many Default Routes In Azure?

Have you wondered why an Azure subnet with no route table has so many default routes? What the heck is 25.176.0.0/13? Or What is 198.18.0.0/15? And why are they routing to None?

The Scenario

You have deployed a virtual machine. The virtual machine is connected to a subnet with no Route Table. You open the NIC of the VM and view Effective Routes. You expect to see a few routes for the non-RFC1918 ranges (10.0.0.0/8, 172.16.0.0/12, etc) and “quad zero” (0.0.0.0/0) but instead you find this:

What in the nelly is all that? I know I was pretty freaked out when I first saw it some time ago. Here are the weird addresses in text, excluding quad zero and the virtual network prefix:

10.0.0.0/8
172.16.0.0/12
192.168.0.0/16
100.64.0.0/10
104.146.0.0/17
104.147.0.0/16
127.0.0.0/8
157.59.0.0/16
198.18.0.0/15
20.35.252.0/22
23.103.0.0/18
25.148.0.0/15
25.150.0.0/16
25.152.0.0/14
25.156.0.0/16
25.159.0.0/16
25.176.0.0/13
25.184.0.0/14
25.4.0.0/14
40.108.0.0/17
40.109.0.0/16

Next Hop = None

The first thing that you might notice is the next hop which is sent to None.

Remember that there is no “router” by default in Azure. The network is software-defined so routing is enacted by the Azure NIC/the fabric. When a packet is leaving the VM (and everything, including “serverless”, is a VM in the end unless it is physical) the Azure NIC figures out the next hop/route.

When traffic hits a NIC, the best route is selected. If that route has a next hop set to None then the traffic is dropped like it disappeared into a black hole. We can use this feature as a form of “firewall – we don’t want the traffic so “Abracadabra – make it go away”.

A Microsoft page (and some googling) gives us some more clues.

RFC-1918 Private Addresses

We know these well-known addresses, even if we don’t necessarily know the RFC number:

  • 10.0.0.0/8
  • 172.16.0.0/12
  • 192.168.0.0/16

These addresses are intended to be used privately. But why is traffic to them dropped? If your network doesn’t have a deliberate route to other address spaces then there is no reason to enable routing to them. So Azure takes a “secure by default” stance and drops the traffic.

Remember that if you do use a subset of one of those spaces in your VNet or peered VNets, then the default routes for those prefixes will be selected ahead of the more general routes that dropping the traffic.

RFC-6598 Carrier Grade NAT

The subnet, 100.64.0.0/10, is defined as being used for carrier-grade NAT. This block of addresses is specifically meant to be used by Internet service providers (or ISPs) that implement carrier-grade NAT, to connect their customer-premises equipment (CPE) to their core routers. Therefore we want nothing to do with it – so drop traffic to there.

Microsoft Prefixes

20.35.252.0/22 is registered in Redmond, Washington, the location of Microsoft HQ. Other prefixes in 20.235 are used by Exchange Online for the US Government. That might give us a clue … maybe Microsoft is firewalling sensitive online prefixes from Azure? It’s possible someone could hack a tenant, fire up lots of machines to act as bots and then attack sensitive online services that Microsoft operates. This kind of “route to None” approach would protect those prefixes unless someone took the time to override the routes.

104.146.0.0/17 is a block that is owned by Microsoft with a location registered as Boydton, Virginia, the home of the East US region. I do not know why it is dropped by default. The zone that resolves names is hosted on Azure Public DNS. It appears to be used by Office 365, maybe with sharepoint.com.

104.147.0.0/16 is also owned by Microsoft which is also in Boydton, Virginia. This prefix is even more mysterious.

Doing a google search for 157.59.0.0/16 on the Microsoft.com domain results in the fabled “google whack”: a single result with no adverts. That links to a whitepaper on Microsoft.com which is written in Russian. The single mention translates to “Redirecting MPI messages of the MyApp.exe application to the cluster subnet with addresses 157.59.x.x/255.255.0.0.” . This address is also in Redmond.

23.103.0.0/18 has more clues in the public domain. This prefix appears to be split and used by different parts of Exchange Online, both public and US Government.

The following block is odd:

  • 25.148.0.0/15
  • 25.150.0.0/16
  • 25.152.0.0/14
  • 25.156.0.0/16
  • 25.159.0.0/16
  • 25.176.0.0/13
  • 25.184.0.0/14
  • 25.4.0.0/14

They are all registered to Microsoft in London and I can find nothing about them. But … I have a sneaky tin (aluminum) foil suspicion that I know what they are for.

40.108.0.0/17 and 40.109.0.0/16 both appear to be used by SharePoint Online and OneDrive.

Other Special Purpose Subnets

RFC-5735 specifies some prefixes so they are pretty well documented.

127.0.0.0/8 is the loopback address. The RFC says “addresses within the entire 127.0.0.0/8 block do not legitimately appear on any network anywhere” so it makes sense to drop this traffic.

198.18.0.0/15 “has been allocated for use in benchmark tests of network interconnect devices … Packets with source addresses from this range are not
meant to be forwarded across the Internet”.

Adding User-Defined Routes (UDRs)

Something interesting happens if you start to play with User-Defined Routes. Add a table to the subnet. Now add a UDR:

  • Prefix: 0.0.0.0/0
  • Next Hop: Internet

When you check Effective Routes, the default route to 0.0.0.0/0 is deactivated (as expected) and the UDR takes over. All the other routes are still in place.

If you modify that UDR just a little, something different happens:

  • Prefix: 0.0.0.0/0
  • Next Hop: Virtual Appliance
  • Next Hop IP Address: {Firewall private IP address}

All the mysterious default routes are dropped. My guess is that the Microsoft logic is “This is a managed network – the customer put in a firewall and that will block the bad stuff”.

The magic appears only to happen if you use the prefix 0.0.0.0/0 – try a different prefix and all the default routes re-appear.