Azure | Aidan Finn, IT Pro

Azure Firewall Deep Dive Training

I’ll tell you about my new virtual training course on Azure Firewall and share some schedule information in this post.

Background

I’ve been talking about Azure Firewall for years. I’ve done lots of sessions at user groups and conferences. I’ve done countless handovers with customers and colleagues. One of my talking points is that I reckoned that I could teach someone with a little Azure/networking knowledge everything there is to know about Azure Firewall in 2 days. And that’s what I decided to do!

I was updating one of my sessions earlier in the year when I realised that it was pretty must the structure of a training couse. Instead of me just listing out features or barely dicusssing architecture to squeeze it into a 45-60 minute-long session, I could take the time to dive deep and share all that I know or could research.

The Course

I produced a 2-day course that could be taught in-person, but my primary vector is virtual/online – it’s hard to get a bunch of people from all over into one place and there is also a cost to me in hosting a physical event that would increse the cost of the course. I decided that virtual was best, with an option off doing it in person if a suitable opportunity arose.

The course content is delivered using a combination of presentation and demo. Presentation lets me explain the what’s, why’s and so on. Demonstration lets me show you how.

The demo lab is built from a Bicep deployment, based on Azure Verified Modules (AVM). A hub & spoke network architecture is created with an Application Gateway, a simple VM workload, and a simple App Services (Private Endpoint) workload. The demonstrations follow a “hands-on guide”; this guide is written as if this was a step-by-step hands-on course, instructing the reader exactly which button to click and what/where to type. Each exercise builds on the last, eventually resulting in a secure network architecture with all of the security, monitoring, and management bells and whistles.

Why did I opt for demo instead of hands-on? Hands-on works for in-person classes. But you cannot assist in the same way when people struggle. In addition, waiting for attendees to complete labs would add another day (and cost) to the class.

Before and class, I share all of the content that I use:

System requirements and setup instructions.
The Bicep files for the demo lab.
The hands-on lab instructions
The PowerPoint
And a few more useful bits

I always update content – for example, my first run of this class was during Microsoft Ignite 2024 and I added a few bits from the news. Therefore I share the updated content with attendees after the course.

The First Run

I ran the class for the first time earlier this week, Novemer 20-21 2024. Attendees from all around Europe joined me for 2 days. At first they were quiet. Online is tough for speakers like me because I look for visual feedback on how I’m doing. But then the questions started coming – people were interested in what I was saying. Interaction also makes the class more interesting for me – sometimes you get comments that coer things you didn’t originally include and everyone benefits – I updated the course with one such item at the end of day 1!

I shared a 4-question anonymouse survey to learn what people thought. The feedback was awesome.

Feedback

This course was previously run in November 2024 for a European audience. The survey feedback was as follows:

How would you rate this course?

Excellent: 83%
Good: 17%

Was This Course Worth Your Time?

Yes: 100%

Would you recommend this course to others?

Yes: 100%

Some of the comments:

“I think it is a very good introduction to Azure Firewall, but it goes beyond foundational concepts so medium- experienced admins will also get value from this. I like the sections on architecture and explanations of routing and DNS. I think this course will enable people to do a good job more than for example az 700 because of the more practical approach. You are good at explaining the material”.

“Just what I wanted from a Deep dive course.”

“Perfectly delivered. Crystal clear content and very well explained”.

Future Classes

I have this class scheduled for two more runs, each timed for different parts of the world:

The classes are ultra-affordable. A few hundred Euros/dollars gets you custom content based on real-world usage. I did fint a virtual 2-day course on Palo Alto firewalls that cost $1700! You’ll also find that I run early-bird registration costs and discounts for more than 1 booking. If you have a large group (5+) then we might be able to figure out a lower rate 🙂

More To Come

More classes are coming! I have an old one to reinvent based on lots of experience over the years and at least 1 new one to write from scratch. Watch out for more!

Azure Back To School 2024 – Govern Azure Networking Using Azure Virtual Network Manager

This post about Azure Virtual Network Manager is a part of the online community event, Azure Back To School 2024. In this post, I will discuss how you can use Azure Virtual Network Manager (AVNM) to centrally manage large numbers of Azure virtual networks in a rapidly changing/agile and/or static environment.

Challenges

Organisations around the globe have a common experience: dealing with a large number of networks that rapidly appear/disappear is very hard. If those networks are centrally managed then there is a lot of re-work. If the networks are managed by developers/operators then there is a lot of governance/verification work.

You need to ensure that networks are connected and are routed according to organisation requirements. Mandatory security rules must be put in place to either allow required traffic or to block undesired flows.

That wasn’t a big deal in the old days when there were maybe 3-4 huge overly trusting subnets in the data centre. Network designs change when we take advantage of the ability to transform when deploying to the cloud; we break those networks down into much smaller Azure virtual networks and implement micro-segmentation. This approach introduces simplified governance and a superior security model that can reliably build barriers to advanced persistent threats. Things sound better until you realise that there are no many more networks and subnets that there ever were in the on-premises data centre, and each one requires management.

This is what Azure Virtual Network Manager was created to help with.

Introducing Azure Virtual Network Manager

AVNM is not a new product but it has not gained a lot of traction yet – I’ll get into that a little later. Spoiler alert: things might be changing!

The purpose of AVNM is to centralise configuration of Azure virtual networks and to introduce some level of governance. Don’t get me wrong, AVNM does not replace Azure Policy. In fact, AVNM uses Azure Policy to do some of the leg work. The concept is to bring a network-specialist toolset to the centralised control of networks instead of using a generic toolset (Azure Policy) that can be … how do I say this politely … hmm … mysterious and a complete pain in the you-know-what to troubleshoot.

AVNM has a growing set of features to assist us:

Network groups: A way to identify virtual networks or subnets that we want to manage.
Connectivity configurations: Manage how multiple virtual networks are connected.
Security admin rules: Enforce security rules at the point of subnet connection (the NIC).
Routing configurations: Deploy user-defined routes by policy.
Verifier: Verify the networks can allow required flows.

Deployment Methodology

The approach is pretty simple:

Identify a collection of networks/subnets you want to configure by creating a Network Group.
Build a configuration, such as connectivity, security admin rules, or routing.
Deploy the configuration targeting a Network Group and one or more Azure regions.

The configuration you build will be deployed to the network group members in the selected region(s).

Network Groups

Part of a scalable configuration feature of AVNM is network groups. You will probably build several or many network groups, each collecting a set of subnets or networks that have some common configuration requirement. This means that you can have ea large collection of targets for one configuration deployment.

Network Groups can be:

Static: You manually add specific networks to the group. This is ideal for a limited and (normally) unchanging set of targets to receive a configuration.
Dynamic: You will define a query based on one or more parameters to automatically discover current and future networks. The underlying mechanism that is used for this discovery is Azure Policy – the query is created as a policy and assigned to the scope of the query.

Dynamic groups are what you should end up using most of the time. For example, in a governed environment, Azure resources are often tagged. One can query virtual networks with specific tags and in specific Azure regions and have them automatically appear in a network group. If a developer/operator creates a new network, governance will kick in and tag those networks. Azure Policy will discover the networks and instantly inform AVNM that a new group member was discovered – any configurations applied to the group will be immediately deployed to the new network. That sounds pretty nice, right?

Connectivity Configurations

Before we continue, I want you to understand that virtual network peering is not some magical line or pipe. It’s simply an instruction to the Azure network fabric to say “A collection of NICs A can now talk with a collection of NICs B”.

We often want to either simplify the connectivity of networks or to automate desired connectivity. Doing this at scale can be done using code, but doing it in an agile environment requires trust. Failure usually happens between the chair and the keyboard, so we want to automate desired connectivity, especially when that connectivity enables integration or plays a role in security/compliance.

Connectivity Configurations enable three types of network architecture:

Hub-and-spoke: This is the most common design I see being required and the only one I’ve ever implemented for mid-large clients. A central regional hub is deployed for security/transit. Workloads/data are placed in spokes and are peered only with the hub (the network core). A router/firewall is normally (not always) the next hop to leave a spoke.
Full mesh: Every virtual network is connected directly to every other virtual network.
Hub-and-spoke with mesh: All spokes are connected to the hub. All spokes are connected to each other. Traffic to/from the outside world must go through the hub. Traffic to other spokes goes directly to the destination.

Mesh is interesting. Why would one use it? Normally one would not – a firewall in the hub is a desirable thing to implement micro-segmentation and advanced security features such as Intrusion Detection and Prevention System (IDPS). But there are business requirements that can override security for limited scenarios. Imagine you have a collection of systems that must integrate with minimised latency. If you force a hop through a firewall then latency will potentially be doubled. If that firewall is deemed an unnecessary security barrier for these limited integrations by the business, then this is a scenario where a full mesh can play a role.

This is why I started off discussing peering. Whether a system is in the same subnet/network or not, it doesn’t matter. The physical distance matters, not the virtual distance. Peering is not a cable or a connection – it’s just an instruction.

However, Virtual Network Peering is not even used in mesh! It’s something different that can handle the scale of many virtual networks being interconnected called a Connected Group. One configuration inter-connects all the virtual networks without having to create 1-1 peerings between many virtual networks.

A very nice option with this configuration is the ability to automatically remove pre-existing peering connections to clean up unwanted previous designs.

Security Admin Rules

What is a Network Security Group (NSG) rule? It’s a Hyper-V port ACL that is implemented at the NIC of the virtual machine (yours or in the platform hosting your PaaS service). The subnet or NIC association is simply a scaling/targeting system; the rules are always implemented at the NIC where the virtual switch port is located.

NSGs do not scale well. Imagine you need to deploy a rule to all subnets/NICs to allow/block a flow. How many edits will you need to do? And how much time will you waste on prioritising rules to ensure that your rule is processed first?

Security Admin Rules are also implemented using Port ACLs but they are always processed first. You can create a rule or a set or rules and deploy it to a Network Group. All NICs will be updated and your rules will always be processed first.

Tip: Consider using VNet Flow Logs to troubleshoot Security Admin Rules.

Routing Configurations

This is one of the newer features in AVNM and was a technical blocker for me until it was introduced. Routing plays a huge role in a security design, forcing traffic from the spoke through a firewall in the hub. Typically, in VNet-based hub deployments, we place one user-defined route (UDR) in each subnet to make that flow happen. That doesn’t scale well and relies on trust. Some have considered using BGP routing to accomplish this but that can be easily overridden after quite a bit of effort/cost to get the route propagated in the first place.

AVNM introduced a preview to centrally configure UDRs and deploy them to Network Groups with just a few clicks. There are a few variations on this concept to decide how granular you want the resulting Route Tables to be:

One is shared with virtual networks.
One is shared with all subnets in a virtual network.
One per subnet.

Verification

This is a feature that I’m a little puzzled about and I am left wondering if it will play a role in some other future feature. The idea is that you can test your configurations to ensure that they are working. There is a LOT of cross-over with Network Watcher and there is a common limitation: it only works with virtual machines.

What’s The Bad News?

Once routing configurations go generally available, I would want to use AVNM in every deployment that I do in the future. But there is a major blocker: pricing. AVNM is priced per subscription at $73/month. For those of you with a handful of subscriptions, that’s not much at all. But for those of us who saw that the subscription is a natural governance boundary and use LOTS of subscriptions (like in Microsoft Cloud Adoption Framework), this is a huge deal – it can make AVNM the most expensive thing we do in Azure!

The good news is that the message has gotten through to Microsoft and some folks in Azure networking have publicly commented that they are considering changes to the way that the pricing of AVNM is calculated.

The other bit bad news is an oldie: Azure Policy. Dynamic network group membership is built by Azure Policy. If a new virtual network is created by a developer, it can be hours before policy detects it and informs AVNM. In my testing, I’ve verified that once AVNM sees the new member, it triggers the deployment immediately, but the use of Azure Policy does create latency, enabling some bad practices to be implemented in the meantime.

Summary

I was a downer on AVNM early on. But recent developments and some of the ideas that the team is working on have won me over. The only real blocker is pricing, but I think that the team is serious about fixing that. I stated earlier that AVNM hasn’t gotten a lot of traction. I think that this should change once pricing is fixed and routing configurations are GA.

I recently demonstrated using AVNM to build out the connectivity and routing of a hub-and-spoke with micro-segmentation at a conference. Using Azure Portal, the entire configuration probably took less than 10 minutes. Imagine that: 10 minutes to build out your security and compliance model for now and for the future.

Azure Route Server Saves The Day

In this post, I will discuss a recent scenario where we used Azure Route Server branch-to-branch routing to rescue a client.

The Original Network Design

This client is a large organisation with a global footprint. They had a previous WAN design that was out of scope for our engagement. The heart of the design was Meraki SD-WAN, connecting their global locations. I like Meraki – it’s relatively simple and it just works – that’s coming from me, an Azure networking person with little on-premises networking experience.

The client started using the services of a cloud provider (not Microsoft). The client followed the guidance of the vendor and deployed a leased line connection to a cloud region that was close to their headquarters and to their own main data centre. The leased line provides low latency connectivity between applications hosted on-premises and applications/data hosted in the other cloud.

Adding Azure

The customer wanted to start using Azure for general compute/data tasks. My employer was engaged to build the original footprint and to get them started on their journey.

I led the platform build-out, delegating most of the hands-on and focusing on the design. We did some research and determined the best approach to integrate with the other cloud vendor was via ExpressRoute. The Azure footprint was placed in an Azure region very close to the other vendor’s region.

An ExpressRoute circuit was deployed between a VNet-based hub in Azure – always my preference because of the scalability, security/governance concepts, and the superiority over Virtual WAN hub when it comes to flexibility and troubleshooting. The Meraki solution from the Azure Marketplace was added to the hub to connect Azure to the SD-WAN and BGP propagation with Azure was enabled using Azure Route Server. To be honest – that was relatively simple.

The customer had two clouds:

The other vendor via a leased line.
Azure via SD-WAN.
And an interconnect between Azure and the other cloud via ExpressRoute.

Along Came a Digger

My day-to-day involvement with the client was over months previously. I got a message early one morning from a colleague. The client was having a serious networking issue and could I get online. The issue was that an excavator/digger had torn up the lines that provided connectivity between the client’s data centre and the other cloud.

Critical services in the other Cloud were unavailable:

App integration and services with the on-premises data centre.
App availability to end users in the global offices.

I thought about it for a short while and checked out my theory online. One of the roles of Azure Route server is to enable branch to branch connectivity between “on-premises” locations between ExpressRoute/VPN.

Forget that the other cloud is a cloud – think of the other cloud’s region as an on-premises site that is connected via ExpressRoute and the above Microsoft diagram makes sense – we can interconnect the two locations via BGP propagation through Azure Route Server:

The “on-premises” location via ExpressRoute
The SD-WAN via the Meraki which is already peered with Azure Route Server

I presented the idea to the client. They processed the information quickly and the plan was implemented quickly. How quickly? It’s one setting in Azure Route Server!

The Solution

The workaround was to use Azure as a temporary route to the other Cloud. The client had routes from their data centre and global offices to Azure via the Meraki SD-WAN. BGP routes were propagating between the SD-WAN connected locations, thanks to the peering between the Meraki NVA in the Azure hub and Azure Route Server.

BGP routes were also propagating between the other cloud and Azure thanks to ExpressRoute.

The BGP routes that did exist between the SD-WAN and the other cloud were gone because the leased line was down – and was going to be down for some time.

We wanted to fill the gap – get routes from the other cloud and the SD-WAN to propagate through Azure. If we did that then the SD-WAN locations and the other cloud could route via the Meraki and the ExpressRoute gateway in the Azure Hub – Azure would become the gateway between the SD-WAN and the other cloud.

The solution was very simple: enable branch-to-branch connectivity in Azure Route Server. There’s a little wait when you do that and then you run a command to check the routes that are being advertised to the Route Server peer (the Meraki NVA in this case).

The result was near instant. Routes were advertised. We checked Azure Monitor metrics on the ExpressRoute circuit and could see a spike in traffic that coincided with the change. The plan had worked.

The Results

I had not heard anything in a while. This morning I heard that the client was happy with the fix. In fact, user experience was faster.

Go back to the original diagram before Azure and I can explain. Users are located in the branch offices around the world. Their client applications are connecting to services/data in the other cloud. Their route is a “backhaul”:

SD-WAN to central data centre
Leased line over long distance to the other cloud

When we introduced the “Azure bypass” after the leased line failure, a new route appeared for end users:

SD-WAN to Azure
A very short distance hop over ExpressRoute

Latency was reduced quite a bit so user experience improved. On the contrary, latency between the on-premises data centre and the other cloud has increased because the SD-WAN is a new hop but at least the path is available. The original leased line is still down after a few weeks – this is not the fault of the client!

Some Considerations

Ideally one would have two leased lines in place for failover. That incurs costs and it was not possible. What about Azure ExpressRoute Metro? That is still in preview at this time and is not available in the Azure metro in question.

However, this workaround has offered a triangle of connectivity. When the lease line in repaired, I will recommend that the triangle becomes their failover – if any one path fails, the other two will take the place, bringing the automatic recoverability that was part of the concept of the original ARPANET.

The other change is that the other cloud should become another site in the Meraki SD-WAN to improve the user app experience.

If we do keep branch-to-branch connectivity then we need to consider “what is the best path”? For example, we want the data centre to route directly to the other cloud when the leased line is available because that offers the lowest latency. But what if a route via Azure is accidentally preferred? We need control.

In Azure Route Server, we have the option to control connectivity from the Azure perspective (my focus):

(Default) Prefer ExpressRoute: Any routes received over ExpressRoute will be used. This would offer sub-optimal routes because on-premises prefixes will be received from the other cloud.
Prefer VPN: Any routes received over VPN will be used. This would offer sub-optimal routes because other cloud prefixes will be received from on-premises.
Use AS path: Let the admin/network advertise a preferred path. This would offer the desired control – “use this path unless something goes wrong”.

Azure’s Software Defined Networking

In this post, I will explain why Azure’s software-defined networking (virtual networks) differs from the cable-defined networking of on-premises networks.

Background

Why am I writing this post? I guess that this one has been a long time coming. I noticed a trend early in my working days with Azure. Most of the people who work with Azure from the infrastructure/platform point of view are server admins. Their work includes doing all of the resource stuff you’d expect, such as Azure SQL, VMs, App Services, … virtual networks, Network Security Groups, Azure Firewall, routing, … wait … isn’t that networking stuff? Why isn’t the network admin doing that?

I think the answer to that question is complicated. A few years ago I added a question to the audience to some of my presentations on Azure networking. I asked who was a ON-PREMISES networking admin versus an ON-PREMISES something-else. And then I said “the ‘server admins’ are going to understand what I will tech more easily than the network admins will”. I could see many heads nodding in agreement. Network admins typically struggle with Azure networking because it is very different.

Cable-Defined Networking

Normally, on-premises networking is “cable-defined”. That phrase means that packets go from source to destination based on physical connections. Those connections might be indirect:

Appliances such as routers decide what turn to take at a junction point
Firewalls either block or allow packets
Other appliances might convert signals from electrons to photons or radio waves.

A connection is always there and, more often than not, it’s a cable. Cables make packet flow predictable.

Look at the diagram of your typical on-premises firewall. It will have ethernet ports for different types of networks:

External
Management
Site-to-site connectivity
DMZ
Internal
Secure zone

Each port connects to a subnet that is a certain network. Each subnet has one or more switches that only connect to servers in that subnet. The switches have uplinks to the appropriate port in the firewall, thus defining the security context of that subnet. It also means that a server in the DMZ network must pass through the firewall, via the cable to the firewall, to get to another subnet.

In short, if a cable does not make the connection, then the connection is not possible. That makes things very predictable – you control the security and performance model by connecting or not connecting cables.

Software-Defined Networking

Azure is a cloud, and as a cloud, it must enable self-service. Imagine being a cloud subscriber, and having to open a support call to create a network or a subnet. Maybe they need to wait 3 days while some operators plug in cables and run Cisco commands. Or they need to order more switches because they’ve run out of capacity and you might need to wait weeks. Is this the hosting of the 2000’s or is it The Cloud?

Azure’s software-defined networking enables the customer to run a command themselves (via the Portal, script, infrastructure-as-code, or API) to create and configure networks without any involvement from Microsoft staff. If I need a new network, a subnet, a firewall, a WAF, or almost anything networking in Azure (with the exception of a working ExpressRoute circuit) then I don’t need any human interaction from a support staff member – I do it and have the resource anywhere from a few seconds to 45 minutes later, depending on the resource type.

This is because the physical network of Azure is overlayed with a software-defined network based on VXLAN. In simple terms, you have no visibility of the physical network. You use simulated networks that hide the underlying complexities, scale, and addressing. You create networks of your own address/prefix choice and use them. Your choice of addresses affects only your networks because they actually have nothing to do with how packets route at the physical layer – that’s handled by traditional networking at the physical layer – but that’s a matter only for the operators of the Microsoft global network/Azure.

A diagram helps … and here’s one that I use in my Azure networking presentations.

In this diagram, we see a source and a destination running in Azure. In case you were not aware:

Just about everything in Azure runs in a virtual machine, even so-called serverless computing. That virtual machine might be hidden in the platform but it is there. Exceptions might include some very expensive SKUs for SAP services and Azure VMware hosts.
The hosts for those virtual machines are running (drumroll please) Hyper-V, which as one may now be forced to agree, is scalable 😀

The source wants to send a packet to a destination. The source is connected to a Virtual Network and has the address of 10.0.1.4. The destination is connected to another virtual network (the virtual networks are peered) and has an address of 10.10.1.4. The virtual machine guest OS sends the packet to the NIC where the Azure fabric takes over. The fabric knows what hosts the source and destination are running on. The packet is encapsulated by the fabric – the letter is put into a second envelope. The envelope has a new source address, that of the source host, and a new destination, the address of the destination host. This enables the packet to traverse the physical network of Microsoft’s data centres even if 1000s of tenants are using the 10.x.x.x prefixes. The packet reaches the destination host where it is decapsulated, unpacking the original packet and enabling the destination host to inject the packet into the NIC of the destination.

This is why you cannot implement GRE networking in Azure.

Virtual Networks Aren’t What You Think

The software-defined networking in Azure maintains a mapping. When you create a virtual network, a new map is created. It tells Azure that NICs (your explicitly created NICs or those of platform resources that are connected to your network) that connect to the virtual network are able to talk to each other. The map also tracks what Hyper-V hosts the NICs are running on. The purpose of the virtual network is to define what NICs are allowed to talk to each other – to enforce the isolation that is required in a multi-tenant cloud.

What happens when you peer two virtual networks? Does a cable monkey run out with some CAT6 and create a connection? Is the cable monkey creating a virtual connection? Does that connection create a bottleneck?

The answer to the second question is a hint as to what happens when you implement virtual network peering. The speed of connections between a source and destination in different virtual networks is the potential speed of their NICs – the slowest NIC (actually the slowest VM, based on things like RSS/VMQ/SR-IOV) in any source/destination flow is the bottleneck.

VNet peering does not create a “connection”. Instead, the mapping that is maintained by the fabric is altered. Think of it being like a Venn Diagram. Once you implement peering, the loops that define what can talk to what has a new circle. VNet1 has a circle encompassing its NICs. VNet2 has a circle encompassing its NICs. Now a new circle is created that encompasses VNet1 and VNet2 – any source in VNet1 can talk directly, using encapsulation/decapsulation) to any destination in VNet2 and vice versa without going through some resource in the virtual networks.

You might have noticed before now that you cannot ping the default gateway in an Azure virtual network. It doesn’t exist because there is no cable to a subnet appliance to reach other subnets.

You also might have noticed that tools like traceroute are pretty useless in Azure. That’s because the expected physical hops are not there. This is why using tools like test-netconnection (Windows PowerShell) or Network Watcher Connection Troubleshoot/Connection Monitor are very important.

Direct Connections

Now you know what’s happening under the covers. What does that mean? When a packet goes from source to destination, there is no hop. Have a look at the diagram below.

It’s not an unusual diagram. There’s an on-prem network on the left that connects to Azure virtual networks using a VPN tunnel that is terminated in Azure by a VPN Gateway. The VPN Gateway is deployed into a hub VNet. There’s some stuff in the hub, including a firewall. Services/data are deployed into spoke VNets – the spoke VNets are peered with the hub.

One can immediately see that the firewall, in the middle, is intended to protect the Azure VNets from the on-premises network(s). That’s all good. But this is where the problems begin. Many will look at that diagram and think that this protection will just work.

If we take what I’ve explained above we’ll understand really what will happen. The VPN Gateway is implemented in the platform as two Azure virtual machines. Packets will come in over the tunnel to one of those VMs. Then the packets will hit the NIC of the VM to route to a spoke VNet. What path will those packets take? There’s a firewall in the pretty diagram. The firewall is placed right in the middle! And that firewall is ignored. That’s because packets leaving the VPN Gateway VM will be encapsulated and go straight to the NIC of the destination NIC in one of the spokes as if it were teleported.

To get the flow that you require for security purposes you need to understand Azure routing and either implement the flow via BGP or User-Defined Routing.

Now have a look at this diagram of a virtual appliance firewall running in Azure from Palo Alto.

Look at all those pretty subnets. What is the purpose of them? Oh I know that there’s public, management, VPN, etc. But why are they all connecting to different NICs? Are there physical cables to restrict/control the flow of packets between some spoke virtual network and a DMZ virtual network? Nope. What forces packets to the firewall? Azure routing does. So those NICs in the firewall do what? They don’t isolate, they complicate! They aren’t for performance, because the VM size controls overall NIC throughput and speed. They don’t add performance, they complicate!

The real reason for all those NICs is to simulate eth0, eth1, etc that are referenced by the Palo Alto software. It enables Palo Alto to keep the software consistent between on-prem appliances and their Azure Marketplace appliance. That’s it – it saves Palo Alto some money. Meanwhile, Azure Firewall using a single IP address on the virtual network (via the Standard tier load balancer, but you might notice each compute instance IP as a source) and there is no sacrifice in security.

Wrapping Up

There have been countless times over the years when having some level of understanding of what is happening under the covers has helped me. If you grasp the fundamentals of how packets rally get from A to B then you are better prepared to design, deploy, operate, or troubleshoot Azure networking.

Azure Virtual Network Manager – Routing Configuration Preview

Microsoft recently announced a public preview of User-Defined Route (UDR) management using Azure Virtual Network Manager. I’ve taken some time to play with it, and here are my thoughts.

Azure Virtual Network Manager (AVNM)

AVNM has been around for a while but I have mostly ignored it up to now because:

The connectivity configuration feature (centrally manage VNet connections) was pointless to me without route management – what’s the point of a hub & spoke in a business setting without a firewall?
I liked the Security Admin Rule configuration (same tech as NSG rules in the Hyper-V switch port, but processed before NSG rules) but pricing of AVNM was too much – more on this later.

Connectivity was missing something – the ability to deploy UDRs or BGP routes from a central policy that would force a next hop to a routing/firewall appliance.

AVNM is deployed centrally but can operate potentially across all virtual networks in your tenant (defined by a scope at the time of deployment) and even across other tenants via mutually agreed guest access – the latter would be useful in acquisition or managed services scenarios.

Routing Configuration Preview

Routing Configuration was introduced as a preview on May 2nd. Immediately I was drawn to it. But I needed to spend some time with it – to dig a little deeper and not just start spouting off without really understanding what was happening. I spent quite a bit of time reading and playing last week and now I feel happier about it.

Network Groups

Network Groups power everything in AVNM. A Network Group is a listing of either subnets or virtual networks. It can be a static list that you define or it can be a dynamic query.

At first, dynamic query looks cool. You can build up a dynamic query using one or a number of parameters:

Name
Id
Tags
Location
Subscription Name
Subscription ID
Subscription Tags
Resource Group Name
Resource Group Id

When you add members via a query, an Azure Policy is created.

When that policy (re)evaluates a notification is sent to AVNM and any policies that target the updated network group are applied to the group members. That creates a possible negative scenario:

You build a workload in code from VNet all the way through to resource/code
You deploy the workload IaC
The VNet is deployed, without any peering/routing configurations because that’s the job of AVNM
The workload components that rely on routing/peering fail and the deployment fails
Azure Policy runs some time later and then you can run your code.

Ick! You don’t want to code peering/routing if it’s being deployed by AVNM – you could end up with a mess when code runs and then AVNM runs and so on.

What do you do? AVNM has a very nice code structure under the covers. The AVNM resource is simple – all the configurations, the rules collections, and rules, and groups are defined as sub-resources. One could build the group membership using static membership and place the sub-resource with the workload code. That will mean that the app registration used by the pipeline will require rights in the central AVNM – that could be an issue because AVNM is supposed to be a governance tool.

Ideally, Azure Policy would trigger much faster than it does (not scientific, but it was taking 15-ish minutes in my tests) and update the group membership with less latency. Once the membership is updated, configurations are deployed nearly instantly – faster than I could measure it.

Routing Configuration

I like how AVNM has structured the configurations for Security Admin Rules and Routing Configurations. It reminds me of how Azure Firewall has handled things.

Rule Collection

A Routing Configuration is deployed to a scope. The configuration is like a bucket – it has little in the way of features – all that happens in the Routing Rule Collections. The configuration contains one or more Rule Collections. Each collection targets a specific group. So I could have three groups defined:

Production
Dev
Secure

Each would have a different set of routes, the rules, defined. I have only one deployment (the configuration) which automatically applies to the correct VNets/subnets based on the group memberships. If I am using dynamic group membership, I can use governance features like tags (which can be controlled from management groups, subscriptions, resource groups or at the resource level) for large-scale automation and control.

There are 3 kinds of Local Route Setting – this configures:

How many Route Tables are deployed per VNet and how they are associated
Whether or not a route to the prefix of the target resource is created

	None Specified	Direct Routing Within Virtual Network	Direct Routing Within Subnet
How Many Route Tables?	One per VNet	One per VNet	One per subnet
Association	All subnets in the VNet	All subnets in the VNet	With the subnet
Local_0 Route	N/A	Yes > VNet Prefix	Yes > Subnet Prefix

Local Route Setting in a Rule Collection

The Route Tables are created an AVNM-managed resource group in the target subscription. If you choose one of the “Direct …” Local Route Setting options then a UDR is created for the target prefix:

	Direct Routing Within Virtual Network	Direct Routing Within Subnet
Address Prefix	The VNet Prefix	The subnet prefix
Next Hop Type	Virtual Network	Virtual Network

Using the Direct Routing options

The concept is that you can force routing to stay within the target VNet/Subnet if the destination is local, while routing via a different next hop when leaving the target. For example, force traffic to the local VNet via the firewall while staying in the subnet (Direct Routing Withing Subnet). Note that the Default rules for VNet via Virtual Network are not deactivated by default, which you can see below – localRoute_0 is created by AVNM to implement the “Direct …” option.

You have the option to control BGP propagation – which is important when using a firewall to isolate site-to-site connections from your Azure services.

Some Notes

AVNM isn’t meant to be the “I’ll manage all the routes centrally” solution. It manages what is important to the organisation – the governance of the network security model. You have the ability to edit routes in the resulting Route Table. So if I need to create custom routes for PaaS services or for a special network design then I can do that. The resulting Route Tables are just regular Azure Route Tables so I can add/edit/remove routes as I desire.

If you manually create a route in the Route Table and AVNM then tries to create a route to the same destination then AVNM will ignore the new rule – it’s a “what’s the point?” situation.

If someone updates an AVNM-managed rule then AVNM will not correct it until there is a change to the Rule Collection. I do not like this. I deem this to be a failure in the application of governance.

Pricing

This is the graveyard of AVNM. If you run Azure like a small business then you lump lots of workloads into a few subscriptions. If you, like I started doing years ago, have a “1 workload per subscription” model (just like in the Azure Cloud Adoption Framework) then AVNM is going to be pricey!

AVNM costs $0.10/subscription/hour. At 730 hours per average month, AVNM for a single subscription will cost $73/month. Let’s say that I have 100 workloads. That will cost me $7300/month! Azure Firewall Premium (compute only) costs $1277.50/month so how could some policy tool cost nearly 6 times more!?!?!

Quite honestly, I would have started to use AVNM last year for a customer when we wanted to roll out “NSG rules” to every subnet in Azure. I didn’t want to do an IaC edit and a DevOps pull request for every workload. That would have taken days/hours (and it did take days). I could have rolled out the change using AVNM in minutes. But the cost/benefit wasn’t worth it – so I spent days doing code and pull requests.

I hear it again and again. AVNM is not perfect, but its usable (feature improvements will come). But the pricing kills it before customer evaluation can even happen.

Conclusion

If a better triggering system for dynamic member Network Groups can be created then I think the routing solution is awesome. But with the pricing structure that is there today, the product is dead to me, which makes me sad. Come on Microsoft, don’t make me sad!

Why Are There So Many Default Routes In Azure?

Have you wondered why an Azure subnet with no route table has so many default routes? What the heck is 25.176.0.0/13? Or What is 198.18.0.0/15? And why are they routing to None?

The Scenario

You have deployed a virtual machine. The virtual machine is connected to a subnet with no Route Table. You open the NIC of the VM and view Effective Routes. You expect to see a few routes for the non-RFC1918 ranges (10.0.0.0/8, 172.16.0.0/12, etc) and “quad zero” (0.0.0.0/0) but instead you find this:

What in the nelly is all that? I know I was pretty freaked out when I first saw it some time ago. Here are the weird addresses in text, excluding quad zero and the virtual network prefix:

10.0.0.0/8
172.16.0.0/12
192.168.0.0/16
100.64.0.0/10
104.146.0.0/17
104.147.0.0/16
127.0.0.0/8
157.59.0.0/16
198.18.0.0/15
20.35.252.0/22
23.103.0.0/18
25.148.0.0/15
25.150.0.0/16
25.152.0.0/14
25.156.0.0/16
25.159.0.0/16
25.176.0.0/13
25.184.0.0/14
25.4.0.0/14
40.108.0.0/17
40.109.0.0/16

Next Hop = None

The first thing that you might notice is the next hop which is sent to None.

Remember that there is no “router” by default in Azure. The network is software-defined so routing is enacted by the Azure NIC/the fabric. When a packet is leaving the VM (and everything, including “serverless”, is a VM in the end unless it is physical) the Azure NIC figures out the next hop/route.

When traffic hits a NIC, the best route is selected. If that route has a next hop set to None then the traffic is dropped like it disappeared into a black hole. We can use this feature as a form of “firewall – we don’t want the traffic so “Abracadabra – make it go away”.

A Microsoft page (and some googling) gives us some more clues.

RFC-1918 Private Addresses

We know these well-known addresses, even if we don’t necessarily know the RFC number:

10.0.0.0/8
172.16.0.0/12
192.168.0.0/16

These addresses are intended to be used privately. But why is traffic to them dropped? If your network doesn’t have a deliberate route to other address spaces then there is no reason to enable routing to them. So Azure takes a “secure by default” stance and drops the traffic.

Remember that if you do use a subset of one of those spaces in your VNet or peered VNets, then the default routes for those prefixes will be selected ahead of the more general routes that dropping the traffic.

RFC-6598 Carrier Grade NAT

The subnet, 100.64.0.0/10, is defined as being used for carrier-grade NAT. This block of addresses is specifically meant to be used by Internet service providers (or ISPs) that implement carrier-grade NAT, to connect their customer-premises equipment (CPE) to their core routers. Therefore we want nothing to do with it – so drop traffic to there.

Microsoft Prefixes

20.35.252.0/22 is registered in Redmond, Washington, the location of Microsoft HQ. Other prefixes in 20.235 are used by Exchange Online for the US Government. That might give us a clue … maybe Microsoft is firewalling sensitive online prefixes from Azure? It’s possible someone could hack a tenant, fire up lots of machines to act as bots and then attack sensitive online services that Microsoft operates. This kind of “route to None” approach would protect those prefixes unless someone took the time to override the routes.

104.146.0.0/17 is a block that is owned by Microsoft with a location registered as Boydton, Virginia, the home of the East US region. I do not know why it is dropped by default. The zone that resolves names is hosted on Azure Public DNS. It appears to be used by Office 365, maybe with sharepoint.com.

104.147.0.0/16 is also owned by Microsoft which is also in Boydton, Virginia. This prefix is even more mysterious.

Doing a google search for 157.59.0.0/16 on the Microsoft.com domain results in the fabled “google whack”: a single result with no adverts. That links to a whitepaper on Microsoft.com which is written in Russian. The single mention translates to “Redirecting MPI messages of the MyApp.exe application to the cluster subnet with addresses 157.59.x.x/255.255.0.0.” . This address is also in Redmond.

23.103.0.0/18 has more clues in the public domain. This prefix appears to be split and used by different parts of Exchange Online, both public and US Government.

The following block is odd:

25.148.0.0/15
25.150.0.0/16
25.152.0.0/14
25.156.0.0/16
25.159.0.0/16
25.176.0.0/13
25.184.0.0/14
25.4.0.0/14

They are all registered to Microsoft in London and I can find nothing about them. But … I have a sneaky tin (aluminum) foil suspicion that I know what they are for.

40.108.0.0/17 and 40.109.0.0/16 both appear to be used by SharePoint Online and OneDrive.

Other Special Purpose Subnets

RFC-5735 specifies some prefixes so they are pretty well documented.

127.0.0.0/8 is the loopback address. The RFC says “addresses within the entire 127.0.0.0/8 block do not legitimately appear on any network anywhere” so it makes sense to drop this traffic.

198.18.0.0/15 “has been allocated for use in benchmark tests of network interconnect devices … Packets with source addresses from this range are not
meant to be forwarded across the Internet”.

Adding User-Defined Routes (UDRs)

Something interesting happens if you start to play with User-Defined Routes. Add a table to the subnet. Now add a UDR:

Prefix: 0.0.0.0/0
Next Hop: Internet

When you check Effective Routes, the default route to 0.0.0.0/0 is deactivated (as expected) and the UDR takes over. All the other routes are still in place.

If you modify that UDR just a little, something different happens:

Prefix: 0.0.0.0/0
Next Hop: Virtual Appliance
Next Hop IP Address: {Firewall private IP address}

All the mysterious default routes are dropped. My guess is that the Microsoft logic is “This is a managed network – the customer put in a firewall and that will block the bad stuff”.

The magic appears only to happen if you use the prefix 0.0.0.0/0 – try a different prefix and all the default routes re-appear.

Script – Document All Azure Private DNS Zones

This post contains a script that will find all Azure Private DNS Zones in a tenant and export information on screen and as markdown in a file.

I found myself in a situation where I needed to document a lot of Azure Private DNS Zones. I needed the following information:

Name of the zone
Subscription name
Resource group name
Name of associated virtual networks

The list was long so a copy and paste from the Azure Portal was going to take too long. Instead, I put a few minutes into a script to do the job – it even writes the content as a Markdown table in a .md file, making it super simple to copy/paste the entire piece of text into my documentation in VS Code.

cls

$subs = Get-AzSubscription

$outString = "| Zone Name | Subscription Name | Resource Group Name | Linked Virtual Network |"
Write-Host $outString
$outString | Out-File "dnsLinks.md"

$outString = "| --------- | ----------------- | ------------------- | ---------------------- |"
Write-Host $outString
$outString | Out-File "dnsLinks.md" -Append

foreach ($sub in $subs)
{

    try
    {
        $context = Set-AzContext -subscription $sub.id
   
        $zones = Get-AzPrivateDnsZone

        foreach ($zone in $zones)
        {
            if ($sub.Name -eq "connectivity" -or $sub.Name -eq "connectivity-canary")
    {
        break
    }
            try
            {

                $links = Get-AzPrivateDnsVirtualNetworkLink -ResourceGroupName $zone.ResourceGroupName -ZoneName $zone.Name
       
                foreach ($link in $links)
                {
                    try
                    {
                        $vnetName = ($link.VirtualNetworkId.Split("/")) | Select-Object -Last 1

                        $outString  = "| " + $zone.name + " | " + $context.Subscription.Name + " | " + $zone.ResourceGroupName + " | " + $vnetName + " |"
                        Write-Host $outString
                        $outString | Out-File "dnsLinks.md" -Append
                    }
                    catch {}
                }
             }
             catch {}
        }  
    }
    catch {}
}

It probably wouldn’t take a whole lot more work to add any DNS records if you needed that information too.

Designing Network Security To Combat Modern Threats

In this post, I want to discuss how one should design network security in Microsoft Azure, dispensing with past patterns and combatting threats that are crippling businesses today.

The Past

Network security did not change much for a very long time. The classic network design is focused on an edge firewall.”All the bad guys are trying to penetrate our network from the Internet” so we’ll put up a very strong wall at the edge. With that approach, you’ll commonly find the “DMZ” network; a place where things like web proxies and DNS proxies isolate interior users and services from the Internet.

The internal network might be made up of two/more VLANs. For example, one or more client device VLANs and a server VLAN. While the route between those VLANs might pass through the firewall, it probably didn’t; they really “routed” through a smart core switch stack and there was limited to no firewall isolation between those VLANs.

This network design is fertile soil for malware. Ports usually are not let open to attack on the edge firewall. Hackers aren’t normally going to brute force their way through a firewall. There are easier ways in such as:

Send an “invoice” PDF to the accounting department that delivers a trojan horse.
Impersonate someone, ideally someone that travels and shouts a lot, to convince a helpful IT person to reset a password.
Target users via phishing or spear phishing.
Cimpromise some upstream include that developers use and use it to attack from the servers.
Use a SQL injection attack to open a command prompt on an internal server.
And on and on and …

In each of those cases, the attack comes from within. The spread of the blast (the attack) is unfettered. The blast area (a term used to describe the spread of an attack) is the entire network.

Secure Zones To The Rescue!

Government agencies love a nice secure zone architecture. This is a design where sensitive systems, such as GDRP data or secrets are stored on an isolated network.

Some agencies will even create a whol duplicate network that is isolated, forcing users to have two PCs – one “regular” one on the Internet-connected network and a “secure” PC that is wired onto an isolated network with limited secret services.

Realistically, that isolated network is of little value to most, but if you have that extreme a need – then good luck. By the way, that won’t work in The Cloud 🙂 Back to the more regular secure zone …

A special VLAN will be deployed and firewall rules will block all traffic into and out of that secure zone. The user experience might be to use Citrix desktops, hosted in the secure zone, to access services and data in that secure zone. But then reality starts cracking holes in the firewall’s deny all rules. No line of business app lives alone. They all require data from somewhere. Or there are integrations. Printers must be used. Scanners need to scan and share data. And legacy apps often use:

Domain (ADDS) credentials (how many ports do you need for that!!!)
SMB (TCP 445) for data transfer and integration

Over time, “deny all” becomes a long list of allow * from X to *, and so on, with absolutely no help from the app vendors.

The theory is that if an attack is commenced, then the blast area will be limited to the client network and, if it reaches the servers, it will be limtied to the Internal network. But this design fails to understand that:

An attack can come from within. Consider the scneario where compromised runtimes are used or a SQL injection attack breaks out from a database server.
All the required integrations open up holes between the secure zone and the other networks, including those legacy protocols that things like ransomware live on.
If one workload in the secure zone is compromised, they all are because there is no network segmentation inside of the VLAN.

And eventually, the “secure zone” is no more secure than the Internal network.

Don’t Block The Internet!!!

I’m amazed how many organisations do not block outbound access to the Internet. It’s just such hard work to open up firewall rules for all these applications that have Internet dependencies. I can understand that for a client VLAN. But the server VLAN such be a controlled space – if it’s not known & controlled (i.e. governed) then it should not be permitted.

A modern attack, an advanced persistent threat (APT), isn’t just some dumb blast, grab, and run. It is a sneaky process of:

Penetration
Discovery, often manually controlled
Spread, often manually controlled
Steal
Destroy/encrypt/etc

Once an APT gets in, it usually wants to call home to pull instructions down from a rogue IP address or compromised bot. When the APT wants to steal data, to be used as blackmail and/or to be sold on the Darknet, the malware will seek to upload data to the Internet. Both of these actions are taking advantage of the all-too-common open access to the Internet.

Azure is Different

Years of working with clients has taught me that there are three kinds of people when it comes to Azure networking:

Those who managed on-premises networks: These folks struggle with Azure networking.
Those who didn’t do on-premises networking, but knew what to ask for: These folks take to Azure networking quite quickly.
Everyone else: Irrelevant to this topic

What makes Azure networking so difficult for the network admins? There is no cabling in the fabric – obviously there is cabling in the data centres but it’s all abstracted by the VXLAN software-defined networks. Packets are encapsulated on the source virtual machine’s host, transmitted over the physical network, decapstulated on the destination virtual machine host, and presented to the destination virtual machine’s NIC. In short, packets leave the source NIC and magically arrive on the destination NIC with no hops in between – this is why traceroute is pointless in Azure and why the default gateway doesn’t really exist.

I’m not going to use virtual machines, Aidan. I’m doing PaaS and serverless computing. In Azure, everything is based on virtual machines, unless they are explcitly hosted on physical hosts (Azure VMware Services and some SAP stuff, for example). Even Functions run on a VM somewhere hidden in the platform. Serverless means that you don’t need to manage it.

The software-defined thing is why:

Partitioned subnets for a firewall appliance (front, back, VPN, and management) offer nothing from a security perspective in Azure.
ICMP isn’t as useful as you’d imagine in Azure.
The concept of partitioning workloads for security using subnets is not as useful as you might think – it’s actually counter-productive over time.

Transformation

I like to remind people during a presentation or a project kickoff that going on a cloud journey is supposed to result in transformation. You now re-evaluate everything and find better ways to do old things using cloud-native concepts. And that applies to network security designs too.

Micro-Segmentation Is The Word

Forget “Greece”, get on board with what you need to counter today’s threats: micro-segmentation. This is a concept where:

We protect the edge, inbound and outbound, permitting only required traffic.
We apply network isolation within the workload, permitting only required traffic.
We route traffic between workloads through the edge firewall, , permitting only required traffic.

Yes, more work will be required when you migrate existing workloads to Azure. I’d suggest using Azure Migrate to map network flows. I never get to do that – I always get the “messy migration projects” and I never get to use Azure Migrate – so testing and accessing and understanding NSG Traffic Analytics and the Azure Firewall/firewall logs via KQL is a necessary skill.

Security Classification

Every workload should go through a security classification process. You need to weigh risk verus complexity. If you max the security, you will increase costs and difficulty for otherwise simple operations. For example, a dev won’t be able to connect Visual Studio straight to an App Service if you deploy that App Service on a private or isolated App Service Plan. You also will have to host your own DevOps agents/GitHub runners because the Microsoft-hosted containers won’t be able to reach your SCM endpoints.

Every piece of compute is a potential attack vector: a VM, an App Service, a Function, a Container, a Logic App. The question is, if it is compromised, will the attacker be able to jump to something else? Will the data that is accessible be secret, subject to regulation, or reputational damage?

This measurement process will determine if a workload should use resources that:

Have public endpoints (cheapest and easiest).
Use private endpoints (medium levels of cost, complexity, and security).
Use full VNet integration, such as an App Service Environment or a virtual machine (highest cost/complexity but most secure).

The Virtual Network & Subnet

Imagine you are building a 3-tier workload that will be isolated from the Internet using Azure virtual networking:

Web servers on the Internet
Middle tier
Databases

Not that long ago, we would have deployed that workload on 3 subnets, one for each tier. Then we would have built isolation using Network Security Groups (NSGs), one for each subnet. But you just learned that a SD-network routes packets directly from NIC to NIC. An NSG is a Hyper-V Port ACL that is implemented at the NIC, even if applied at the subnet level. We can create all the isolation we want using an NSG within the subnet. That means we can flatten the network design for the workload to one subnet. A subnet-associated subnet will restrict communications between the tiers – and ideally between nodes within the same tier. That level of isolation should block everything … should 🙂

Tips for virtual networks and subnets:

Deploy 1 virtual network per workload: Not only will this follow Azure Cloud Adoption Framework concepts, but it will help your overall security and governance design. Each workload is placed into a spoke virtual network and peered with a hub. The hub is used only for external connectivity, the firewall, and Azure Bastion (assuming this is not a vWAN hub).
Assign a single prefix to your hub & spoke: Firewall and NSG rules will be easier.
Keep the virtual newtorks small: Don’t waste your address space.
Flatten your subnets: Only deploy subnets when there is a technical need, for example VMs and private endpoints are in one subnet, VNet integration for an App Services plan is in another, a SQL managed instance, is in a third.

Resource Firewalls

It’s sad to see how many people disable operating system firewalls. For example, Group Policy is used to diable Windows Firewall. Don’t you know that Microsoft and Linux added those firewalls to protect machines from internal attacks? Those firewalls should remain operational and only permit required traffic.

Many Azure resources also offer firewalls. App Services have firewalls. Azure SQL has a firewall. Use them! The one messy resource is the storage account. The location of the endpoints for storage clusters is in a weird place – and this causes interesting situations. For example, a Logic App’s storage account with a configured firewall will prevent workflows from being created/working correctly.

Network Security Groups

Take a look at the default inbound rules in an NSG. You’ll find there is a Deny All rule which is the lowest possible priority. Just up from that rule, is a built in rule to allow traffic from VirtualNetwork. VirtualNetwork includes the subnet, the virtual network, and all routed networks, including peers and site-to-site connections. So all traffic from internal networks is … permitted! This is why every NSG that I create has a custom DenyAll rule with a priority of 4000. Higher priority rules are created to permit required traffic and only that required traffic.

Tips with your NSGs:

Use 1 NSG per subnet: Where the subnet resources will support an NSG. You will reduce your overall complexity and make troubleshooting easier. Remember, all NSG rules are actually applied at the source (outbound rules) or target (inbound rules) NIC.
Limit the use of “any”: Rules should be as accurate as possible. For example: Allow TCP 445 from source A to destination B.
Consider the use of Application Security Groups: You can abstract IP addresses with an Application Security Group (ASG) in an NSG rule. ASGs can be used with NICs – virtual machines and private endpoints.
Enable NSG Flow Logs & Traffic Analytics: Great for troubleshooting networking (not just firewall stuff) and for feeding data to a SIEM. VNet Flow Logs will be a superior replacement when it is ready for GA.

The Hub

As I’ve implied already, you should employ a hub & spoke design. The hub should be simple, small and free of compute. The hub:

Makes connections using site-to-site networking using SD-WAN, VPN, and/or ExpressRoute.
Hosts the firewall. The firewall blocks everything in every direction by default,
Hosts Azure Bastion, unless you are running Azure Virtual WAN – then deploy it to a spoke.
Is the “Public IP” for egress traffic for workloads trying to reach the Internet. All egress traffic is via the firewall. Azure Policy should be used to restrict Public IP Addresses to just those requires that require it – things like Azure Bastion require a public IP and you should create a policy override for each required resource ID.

My preference is to use Azure Firewall. That’s a long conversation so let’s move on to another topic; Azure Bastion.

Most folks will go into Azure thinking that they will RDP/SSH straight to their VMs. RDP and SSH are not perfect. This is something that the secure zone concept recognised. It was not unusual for admins/operators to use a bastion host to hop via RDP or SSH from their PC to the required server via another server. RDP/SSH were not open directly to the protected machines.

Azure Bastion should offer the same isolation. Your NSG rules should only permit RDP/SSH from:

The AzureBastionSubnet
Any other bastion hosts that might be employed, typically by developers who will deploy specialist tools.

Azure Bastion requires:

An Entra ID sign-in, ideally protected by features such as conditional access and MFA, to access the bastion service.
The destination machine’s credentials.

Routing

Now we get to one of my favourite topics in Azure. In the on-prem world we can control how packets get from A to B using cables. But as you’ve learned, we can run cables in Azure. But we can control the next hop of a packet.

We want to control flows:

Ingress from site-to-site networking to flow through the hub firewall: A route in the GatewaySubnet to use the hub firewall as the next hop.
All traffic leaving a spoke (workload virtual network) to flow through the hub firewall: A route to 0.0.0.0/0 using the firewall backend/private IP as the next hop.
All traffic between hub & spokes to flow through the remote hub firewall: A route to the remote hub & spoke IP prefix (see above tip) with a next hop of the remote hub firewall.

If you follow my tips, especially with the simple hub, then the routing is actually quite easy to implement and maintain.

Tips:

Keep the hub free of compute.
NSG Traffic Analytics helps to troubleshoot.

Web Application Firewall

The hub firewall shold not be used to present web applications to the Internet. If a web app is classified as requireing network security, then it should be reverse proxied using a Web Application Firewall (WAF). This specialised firewall inspects traffic at the application layer and can block threats.

The WAF will have a lot of false positives. Heavy traffic applications can produce a lot of false positives in your logs; in the case of Log Analytics, the ingestion charge can be huge so get to optimising those false positives as quickly as you can.

My preference is to route the WAF through the hub firewall to the backend applications. The WAF is a form of compte, even the Azure WAF. If you do not need end-to-end TLS, then the firewall could be used to inspect the HTTP traffic from the WAF to the backend using Intrusion Detection Prevention System (IDPS), offering another layer of protection.

Azure offers a couple of WAF options. Front Door with WAF is architecturally interesting, but the default design is that the backend has a public endpoint that limits access to your Front Door instance at the application layer. What if the backend is network connected for max protection? Then you get into complexities with Private Link/Private Endpoint.

A regional WAF is network connected and offers simpler networking, but it sacrifices the performance boosts from Front Door. You can combine Front Door with a regional WAF, but there are more costs with this.

Third party solutions are posisble Services such as Cloud Flare offer performance and security features. One could argue that Cloud Flare offers more features. From the performance perspective, keep in mind that Cloud Flare has only a few peering locations with the Microsoft WAN, so a remote user might have to take a detour to get to your Azure resources, increasing latency.

You can seek out WAF solutions from the likes of F5 and Citrix in the Azure Marketplace. Keep in mind that NVAs can continue skills challenges by siloing the skill – native cloud skills are easier to develop and contract/hire.

Summary

I was going to type something like “this post gives you a quick tour of the micro-segmentation approach/features that you can use in Azure” but then I reaslised that I’ve had keyboard diarrhea and this post is quite Sinofskian. What I’ve tried to explain is that the ways of the past:

Don’t do much for security anymore
Are actually more complex in architecture than Azure-native patterns and solutions that will work.

If you implement security at three layers, assuming that a breach will happen and could happen anywhere then you limit the blast area of a threat:

The edge, using the firewall and a WAF
The NIC, using a Network Security Group
The resource, using a guest OS/resource firewall

This trust-no-one approach that denies all but the minimum required traffic will make life much harder for an attacker. Including logging and the use of a well configured SIEM will create trip wires that an attacker must trip over to attempt an expansion. You will make their expansion harder & slower, and make it easier to detect them. You will also limit how much they can spread and how much the damage that the attack can create. Furthermore, you will be following the guidance the likes of the FBI are recommending.

There is so much more to consider when it comes to security, but I’ve focused on micro-segmentation in a network context. People do think about Entra ID and management solutions (such as Defender for Cloud and/or SIEM) but they rarely think through the network design by assuming that what they did on-prem will still be fine. It won’t because on-prem isn’t fine right now! So take my advice, transform your network, and protect your assets, shareholders, and your career.

Azure & Oracle Cloud Interconnect

This post will explain how you can connect your Azure network(s) with Oracle Cloud Infrastructure (OCI) via the Oracle Cloud Interconnect.

Background

Many mid-large organisations run applications that are based on Oracle software. When these organisations move to the cloud, they may choose to use Oracle Cloud for their Oracle workloads and Azure for everything else.

But that raises some interesting questions:

How do we connect Azure workloads to Oracle workloads?
If Oracle is hosting data services, how do we minimise latency?

The answer is: The Oracle Cloud Interconnect (OCI).

Azure ExpressRoute and Oracle FastConnect

Microsoft and Oracle are inter-connected via their respective private “site-to-site” connection mechanisms:

Azure: ExpressRoute
Oracle: FastConnect

This is achieved by both service providers sharing a “meet me” location where each cloud’s edge networks allow a “cross-connection”. So, there is no need to contact an ISP to lease an ExpressRoute circuit. The circuit already exists. There is no need to sign a circuit contract. The ISP is “Oracle” and you pay for the usage of it – in the case of Azure by paying for the ExpressRoute circuit Azure resource.

Location, Location, Location

The inter-connect mechanism is obviously play a role in where you can deploy your ExpressRoute Circuit and FastConnect resource. But performance also comes into play here – latency must be kept to a minimum. As a result, there is a support restriction on which Azure/Oracle regions can be inter-connected and where the circuit must be terminated.

At the time of writing, the below list was published by Microsoft:

What does this?

Let’s imagine that we are using OCI Amsterdam. If we want to connect Azure to it then we must use Azure West Europe.

Now, what about keeping that latency low? The trick there is in selecting a Peering Location that is closeby. Note that the Oracle docs do a better job at defining the Azure peering location (see under Availability).

In my scenario, the peering location would be Amsterdam2. According to Microsoft:

Connectivity is only possible where an Azure ExpressRoute peering location is in proximity to or in the same peering location as the OCI FastConnect.

That means you must always keep the following close to be able to use this solution:

The Oracle Cloud Infrastructure region
The Azure region
The peering location of the ExpressRoute circuit & FastConnect circuit

Configuring ExpressRoute

You have few options to decide between. The first is the SKU of ExpressRoute that you will choose.

Type

Billing

Connections

Local

Unlimited

1 or 2 Azure regions in the same metro as the peering location.

Standard

Metered or Unlimited

Up to 10 connection in the same geo zone as the peering location.

You also have to choose one of the supported speeds for this solution: 1, 2, 5, or 10 Gbps.

The ISP will be Oracle Cloud FastConnect.

So do you choose Local or Standard? I think that really comes down to balancing the cost. Local has unlimited data transfer but it is billed based on bandwidth. The entry cost per month in Zone 1 is €1,111.27/month with 1 Gbps and unlimited data transfer.

The entry point for a Standard metered plan is €403.76/month. That is €707.51 cheaper than the Local SKU but that savings has to cover your outbound data transfer cost in Azure. At €0.024/GB, that leaves you with (707.51/0.024) 29,479 GB of outbound data transfer per month until the Local SKU is more affordable.

The safe tip here is choose Local, monitor data usage, and consider jumping to Standard if you are using a small enough amount of outbound data transfer to make the metered Standard SKU more affordable.

Note that you can upgrade from Local but you cannot downgrade to Local.

Getting Connected (From Azure)

I’ll talk about the Azure side of things because that’s what I know. I will cover a little bit about Oracle, from what I have learned.

You will need an ExpressRoute Gateway in the selected Azure region. Then you will create an ExpressRoute Circuit in the same region:

Your chosen SKU/billing model.
The speed from 1, 2, 5, or 10 Gbps.
The Provider is Oracle Cloud FastConnect.
The peering location from the Oracle docs.

Retrieve the service key and then continue the process in the OCI portal. There is one screen that is very confusing: configuring the BGP addresses.

You are going to need two /30 prefixes that are not used in your OCI/Azure networks. I’m going to use 192.168.0.0/30 and 192.168.0.4/32 for my example. You need two prefixes because Azure and Oracle are running highly available resources under the covers. The ExpressRoute Gateway is two active/active compute instances. Each will require an IP address to advertise/receive addresses prefixes via BGP from the OCI gateway, and vice versa.

What addresses do you need? Oracle requires you to enter:

Customer (Azure) BGP IP Address 1
Oracle BGP IP Address 1
Customer (Azure) BGP IP Address 2
Oracle BGP IP Address 2

Here’s how you calculate them:

Customer (Azure) BGP IP Address 1: Usable IP #2 from Prefix 1
Oracle BGP IP Address 1: Usable IP #1 from Prefix 1.
Customer (Azure) BGP IP Address 2: Usable IP #2 from Prefix 2
Oracle BGP IP Address 2: Usable IP #1 from Prefix 1

The below is not the final answer yet! But we’re getting there. That would lead us to caclulating:

Customer BGP IP Address 1: 192.168.0.2
Oracle BGP IP Address 1: 192.168.0.1
Customer BGP IP Address 2: 192.168.0.6
Oracle BGP IP Address 2: 192.168.0.5

But the Oracle GUI has an illogical check and will tell you that those addresses are wrong. They are correct – it’s just the Oracle GUI is broken by design! Here is what you need to enter:

Customer BGP IP Address 1: 192.168.0.2/30
Oracle BGP IP Address 1: 192.168.0.1/30
Customer BGP IP Address 2: 192.168.0.6/30
Oracle BGP IP Address 2: 192.168.0.5/30

You finish the process and wait a little bit. The ExpressRoute circuit will eventually change status to Provisioned. Now you can create a connection between the circuit and the ExpressRoute Gateway. When I did it, the Private Peering was automatically configured, using 192.168.0.0/30 and 192.168.04/30 as the peering subnets.

Check your ARP records and route tables in the circuit (under Private Peering) and you should see that Oracle has propagated its known addresses to your Azure ExpressRoute Gateway, and on to any subnets that are not blocking propagation from the gateway.

And that’s it!

Other Support Things

The following Oracle services are supported:

E-Business Suite
JD Edwards EnterpriseOne
PeopleSoft
Oracle Retail applications
Oracle Hyperion Financial Management

Naturally, your OCI and Azure networks must not have overlapping prefixes.

You can do transitive routing. For example, you can route through the interconnect to an Oracle network and then on to a peered Oracle network (a hub and spoke).

You cannot use the interconnect to route to on-premises from Azure or from OCI.

Getting Private Endpoints To WORK In The Real World

In this Festive Tech Calendar post, I am going to explain how to get Private Endpoints working in the real world.

Thank you to the team that runs Festive Tech Calendar every year for the work that they do and for raising funds for worthy causes.

Private Endpoints

When The Cloud was first envisioned, it was made a platform that didn’t really take network security seriously. The resources that developers want to use, Platform-as-a-Service (PaaS), were built to only have public endpoints. In the case of Microsoft Azure, if I deploy an App Service Plan, the compute that is provisioned for me shares a public IP address(es) with plans from other tenants. The App Service Plan is accessible directly on the Internet – that’s even true when you enable “firewall rules” in an App Service because those rules only control what HTTP/S requests will be responded to so raw TCP connections (zero day attacks) are still possible.

If I want to protect that App Service Plan I need to make it truly private by connecting it to a virtual network, using a private IP address, and maybe placing a Web Application Firewall in the flow oc the client connection.

The purpose of Private Endpoint is to alter the IP address that is used to connect to a platform resource. The public endpoint is, preferably, disabled for inbound connections and clients are redirected to a private IP address.

When we enable a Private Endpoint for a PaaS resource, a Private Endpoint resource is added and a NIC is created. The NIC is connected to a subnet in a virtual network and obtains or is supplied with an IP address for that subnet. All client connections will be via that private IP address. And this is where it all goes wrong in the real world.

If I browse myapp.azurewebsites.net my PC will resolve that name to the public endpoint IP address – even after I have implemented a Private Endpoint. That means that I have to redirect my client to the new IP address. Nothing on The Internet knows that private IP address mapping. The only way to map the FQDN of the App Service to the private endpoint is to use Private DNS.

You might remember this phrase for troubleshooting on-premises networks: “it’s always DNS”. In Azure, “it’s always routing, then it’s always DNS”, but the DNS part is what we need to figure out, not just for this App Service but for all workloads/resource types.

The Problems

There are three main issues:

Microsoft Documentation
Developers don’t do infrastructure
Who does DNS in The Cloud?

Microsoft Documentation

The documentation for Private Endpoint ranges from excellent to awful. That variance depends on the team/resource type that is covered by the documentation. Each resource team is responsible for their own implementation/documentation. And that means some documentation is good and clear, while some documentation should never have made it past a pull request.

The documentation on how to use Private Endpoint focuses on single workloads. You’ll find the same is true in the certifcation exams on Microsoft networking. In the real world, we have many workloads. Clients need to access those workloads over virtual networks. Those workloads integrate with each other, and that means that they must also resolve each others names. This name resolution must work for resources inside of individual workloads, for workload-to-workload communications, and on-premises clients-to-workload communications. You can eventually figure out how to do this from Microsoft documentation but, in my experience, many organisations give up during this journey and assume that Private Endpoint does not work.

Developers Don’t Do Infrastructure

Imagine asking a developer to figure out virtual networks and subnetting! OK, let’s assume you have reimagined IT processes and structures (like you are supposed to) and have all that figured out.

Now you are going to ask a developer to understand how DNS works. In the real world, most devs know their market verticals, language(s) and (quite complex) IDE toolset, and everything else is not important. I’ve had the pleasure of talking devs through running NSLOOKUP (something we IT pros often consider simple) and I basically ran a mini-class.

Assuming that a dev knows how DNS works and should be architected in The Cloud is a path to failure.

Who Does DNS In The Cloud?

I have lost track of how many cloud jorneys that I have been a part of, either from the start or where I joined a struggling project. A common wish for many of those customers is that they won’t run any virtual machines (some organisations even “ban” VMs) – I usually laugh and promise them some VMs later. Their DNS is usually based on Windows Server/Active Directory and with no VMs in their future, they assume that don’t need any DNS system.

If there is no DNS architecture, then how will a system, such as Private Endpoint, work?

A Working Architecture

I’m going to jump straight to a working archticture. I’ll start with a high-level design and then talk about some of the low-level design options.

This design works. It might not be exactly what you require but simple changes can be made for specific scenarios.

High-Level Design

Private DNS Zones are created for each resource type and service type in that resource that a Private Endpoint is deployed for. Those zones are deployed centrally and are associated with a virtual network/subnet that will be dedicated to a DNS service.

The DNS service of your choice will be deployed to the DNS virtual newtork/subnet. Forwarders wll be configured on that DNS Service to point to the “magic Azure virtual IP address” 168.63.129.16. That is an address that is dedicated to Azure services – if you send DNS requests to it then they will be handled by:

Azure Private DNS zones, looking for a matching zone/record
Azure DNS, which can resolve Azure Public DNS Zones or resolve Internet requests – ah you don’t need proxy DNS servers in a DMZ now because Azure becomes that proxy DNS server!

Depending on the detailed design, your DNS servers can also resolve on-premises records to enable Azure-to-on-premises connections – important for migration windows while services exist in two locations, connections to partners via private connections, and when some services will stay on-premises.

All other virtual networks in your deployment (my design assumes you have a hub & spoke for a mid/large scale deployment) will have custom DNS servers configured to point at the DNS servers in the DNS Workload.

One intersting option here is Azure Firewall in the hub. If you want to enable FQDNs in Network Rules then you will:

Enable DNS Proxy mode in the Azure Firewall.
Configure the DNS server IP addresses in the Azure Firewall.
Use the private IP address of the Azure Firewall (a HA resource type) as your DNS server in the virtual networks.

Low-Level Design

There are different options for your DNS servers:

Azure Private DNS Resolver
Active Directory Domain Services (ADDS) Domain Controllers
Simple DNS Servers

In an ideal world, you would choose Azure Private DNS Resolver. This is a pure PaaS resource that can be managed as code – remember “VMs are banned”. You can forward to Azure Private DNS Zones and forward to on-premises/remote DNS servers. Unfortunately, Azure Private DNS Resolver is a relatively expensive resource and the design and requirements are complex. I haven’t really used Azure Private DNS Resolver in the real world so I cannot comment on compatibility with complex on-premises DNS architectures, but I can imagine there being issues with organisations such as universities where every DNS technology known to mankind since the early 1990’s is probably employed.

Most of the customers that I have worked with have opted to use Domain Controllers (DCs) in Azure as their DNS servers. The DCs store all the on-premises AD-integrated zones and can resolve records independently of on-premises DNS server. The intereface is familiar to Windows admins and easily configured and managed. This increases usability and compatibility. If you choose a modest B-series SKU then the cost will be quite a bit lower than Azure Private DNS Resolver. You’ll also have an ADDS presence in Azure enabling legacy workloads to use their required authenetication/aauthorisation methods.

The third option is to just use either a simple Windows/Linux VM as the DNS server. This is a good choice where ADDS is not required or where Linux DNS is required.

The Private Endpoint

I metioned that a Private Endpoint/NIC combination would be deployed for each resource/service type that requires private connectivity. For example, a Storage Account can have blob, table, queue, web, file, dsf, afs, and disks services. We need to be able to redirect the client to the specific service – that means creating a NDS record in the correct Azure Private DNS Zone, such as privatelink.blob.core.windows.net. Some workloads, such as Cosmos DB, can require multiple DNS records – how do you know what to create?

Luckily, their is a feature in Private Endpoint that handles auto-registration for you:

All of the required DNS records are created in the correct DNS zones – you must have the Azure Private DNS Zones deployed beforehand.
If your resource changes IP address , the DNS records will be updated automatically.

Sadly, I could not find anydocumentation for this feature while writing this article. However, it’s an easy feature to configure. Open your new Private Endpoint and browse to DNS Configuration. There you can see the required DNS records for this Private Endpoint.

Click Add Configuration and supply the requested information. From now on, that Private Endpoint will handle record registration/updates for you. Nice!

With a central handler for DNS name resolution, on-premises clients have the ability to connect to your Private Endpoints – subject to network security rules. On-premises DNS servers should be configured with conditional forwarders (one for each Private Link Azure Private DNS Zone) to point at your Azure DNS servers – they can point at a Azure Firewall if the previously mentioned DNS options are used.

Some Complexities

Like everything, this design is not perfect. Centralised anything comes with authorisation/governance issues. Anyone deploying a Private Endpoint will require the rights to access the Azure Private DNS Zones/records. In the wrong hands, that could become a ticketing nightmare where simple tasks take 6 weeks – far from the agility that we dream of in The Cloud.

Conclusion

The above design is one that I have been using for years. It ahs evolved a little as new features/resources have been added to Azure but the core design has remained the same. It works and it is scalable. Importantly, once it is built, there is little for the devs to know about – just enable DNS Configuration in the Private Endpoint.

Tweaks can be made. I’ve discussed some DNS server options – some choose to dispense with DNS Servers altogether and use Azure Firewall as the DNS server, which forwards to the default Azure DNS services. On-premises DNS servers can forward to Azure Firewall or to the DNS servers. But the core design remains the same.