NSG | Aidan Finn, IT Pro

Micro-Segmentation Security In Azure Networks

In this post, I want to discuss the importance of designing and implementing micro-segmentation in Azure networks.

Repeating The Same Mistakes

In 2002-2003, the world was being hammered by malware. So much so, that Microsoft did a reset on their Windows development processes and effectively built a new version of Windows XP with Windows XP Service Pack 2. The main security feature of that release was the Windows Firewall – the purpose of this was to isolate each Windows machine in the network by default. It’s a pity that nearly every Windows admin then used Group Policy to disable the Windows Firewall!

Times have moved on and so have the bad guys. Malware isn’t just an anarchist or hobby activity. Malware is a billion-dollar business (ransomware/data theft) and a military activity. Naturally, defences have evolved .. wait .. no … most admins/consultants are still deploying networks that your Daddy/Mommy deployed 22 years ago but I’ll deal with that in another post.

Instead, I want to discuss a part of the defensive solution: micro-segmentation.

Assume Penetration

We must assume that the attacker will always find a way in. Not every attack will be by Sandra Bullock clicking some magical symbol on a website to penetrate the firewall. Most attacks have relatively simple vectors such as stealing a password, hash highjacking, or getting an accountant to open a PDF. Determined attackers aren’t just “driving by”; they will look for an entry. Maybe it’s malware in vendor software that you will deploy! Maybe, it’s a vulnerability in open-source software that your developers will deploy via GitHub? Maybe a managed service provider’s Entra ID tenant has been penetrated and they have Lighthouse access to your Azure subscriptions? Each of those examples bypasses your firewall and any advanced scanning features that it may have. How do you stop them?

Micro-Segmentation

Let me conjure an image for you. A submarine is on patrol. It has a wartime mission. The submarine is always under orders to continue that mission. The submarine is detected by the enemy and is attacked. The attack causes damage which creates a flood. If left unchecked, the flood will sink the ship. What happens? The crew is trained to isolate the flood by sealing the leaking compartment – doors are slammed, seals are locked, and the water is contained in that compartment. Sure, the sailors and ship functions in that compartment are dead, but the ship can continue its mission.

That is a way to visualise micro-segmentation.

Microsoft Zero-Trust

Microsoft has a relatively small collection of documentation on zero-trust architecture for Azure. There are 3 useful bullet points:

Be ready to handle attacks before they happen.

Minimize the extent of the damage and how fast it spreads.

Increase the difficulty of compromising your cloud footprint.

Let’s expand on that a little.

Be Ready

You will be ready for an attack because you assume that you already are under attack. You don’t wait to deploy security systems and configurations; you design them with your workloads. You deploy security with your workloads. You maintain security with your workloads.

Increase The Difficulty of Compromising Your Cloud Footprint

You should put in the defences that are appropriate to your actual risks and ability to install/manage. A bad example is a medical organisation choosing a more affordable firewall to save a few bucks – this is the sort of organisation that will be targeted.

Minimise The Extent of Damage

This can also be referred to as minimising the blast zone. You want to limit how much damage the bad guys cause, just like the submarine limited flooding to the damaged compartment. This means that we make it harder to get from any one point on the network to the next.

It’s one thing to put in the security defences, but you must also:

Enable/configure the security features: it shocks me how many organisations/consultants opt not to or don’t know how to enable essential features in their security solution.
Monitor your security systems: If we assume that the attacker will get in, then we should monitor our security features to detect and shut down the attack. Again, I’m shocked every time I see security features in Azure that have no logging or alerting enabled.

Microsoft lays out a path to zero-trust where step number one is network segmentation. The basic pattern is laid out:

Applications are partitioned to different Azure Virtual Networks (VNets) and connected using a hub-spoke model

Microsoft uses the term “application”. I prefer the term “workload”. Some, like ITIL, might use the term “service”. A workload is a collection of resources that work together to provide a service to or for the organisation. Maybe it’s a bunch of Azure resources that create a retail site. Maybe it’s a CRM system. Maybe it’s an identity management & governance workload.

The pattern that Microsoft is recommending is one that I have been promoting through my employer for the last 6 years. Each workload gets a dedicated “small” virtual network. The workload VNet is peered with a hub (and only the hub by default). The hub firewall provides isolation and deeper inspection than NSGs can offer.

Step 4 tells us:

Fully distributed ingress/egress cloud micro-perimeters and deeper micro-segmentation

NSGs micro-segment the single or small set of subnet(s) in the VNet, restriocting resource-to-resource connections to just what is required. Isolation is now done centrally and at the NIC, thanks to NSGs. You should also consider network protections on PaaS resources such as Storage Accounts or Key Vaults.

If we revisit the submarine comparison, the workload-specific virtual network is one of the compartments in the boat. If there is a leak (an attack), the NSGs limit or slow down expansion in the subnet(s). The firewall isolates the workload/compartment from other workloads/compartments and the Internet by default to prevent command and control or downloads by the attacker. Deeper firewall inspection searches for attack patterns.

Don’t Forget Monitoring

Microsoft zero-trust has more than just networking. One other step I want to highlight is monitoring/alerting because it ties into the micro-segmentation features of networking. Consider the mechanisms we can put in place:

Paas resource firewalls with logging
NSG with VNet Flow Logging
(Azure) Firewall with logging for firewall rules and deep inspection features (Azure Firewall has Threat Intelligence and IDPS).

Each of those barriers or detection systems can be thought of as a string with a bell on it. The attacker will tickle or trip over those strings. If the bell rings, we should be paying attention. When you fail to put in the barriers or configure monitoring then you don’t know that the attacker is there doing something – and we assume that the attacker will get in and do something – so aren’t we failing to do our job?

It’s Not Just Me Telling You

You can say “There goes Aidan, rattling on about micro-segmentation. Why should I listen to him?”. It would be one thing if it were just me sharing my opinion on Azure network security but what if others told you to do the same things?

Microsoft tells you to implement micro-segmentation. The US NSA tells you to do it. The Canadian Centre for Cyber Security tells you to do it. The UK NCSC tells you to do it. I could keep googling (binging, of course) national security agencies and I’d find the same recommendation with each result. If you are not implementing this security technique designed for today’s threats (not for the Blaster worm of 2003) then you are not only not doing your job but you are choosing to leave the door open for attackers; that could be viewed very poorly by employers, by shareholders, or by informed compliance auditors.

How Do Network Security Groups Work?

A Greek Phalanx, protected by a shield wall made up of many individuals working under 1 instruction as a unit – like an NSG.

Yesterday, I explained how packets travel in Azure networking while telling you Azure virtual networks do not exist. The purpose was to get readers closer to figuring out how to design good and secure Azure networks without falling into traps of myths and misbeliefs. The next topic I want to tackle is Network Security Groups – I want you to understand how NSGs work … and this will also include Admin Rules from Azure Virtual Network Manager (AVNM).

Port ACLs

In my previous post, Azure Virtual Networks Do Not Exist, I said that Azure was based on Hyper-V. Windows Server 2012 introduced loads of virtual networking features that would go on to become something bigger in Azure. One of them was a mostly overlooked-by-then-customers feature called Port ACLs. I liked Port ACLs; it was mostly unknown, could only be managed using PowerShell and made for great demo content in some TechEd/Ignite sessions that I did back in the day.

Remember: Everything in Azure is a virtual machine somewhere in Azure, even “serverless” functions.

The concept of Port ACLs was it gave you a simple firewall feature controlled through the virtualisation platform – the virtual machine and the guest OS had no control and had to comply. You set up simple rules to allow or deny transport layer (TCP/UDP) traffic on specific ports. For example, I could block all traffic to a NIC by default with a low-priority inbound rule and introduce a high-priority inbound rule to allow TCP 443 (HTTPS). Now I had a web service that could receive HTTPS traffic only, no matter what the guest OS admin/dev/operator did.

Where are Port ACLs implemented? Obviously, it is somewhere in the virtualisation product, but the clue is in the name. Port ACLs are implemented by the virtual switch port. Remember that a virtual machine NIC connects to a virtual switch in the host. The virtual switch connects to the physical NIC in the host and the external physical network.

A virtual machine NIC connects to a virtual switch using a port. You probably know that a physical switch contains several ports with physical cables plugged into them. If a Port ACL is implemented by a switch port and a VM is moved to another host, then what happens to the Port ACL rules? The Hyper-V networking team played smart and implemented the switch port as a property of the NIC! That means that any Port ACL rules that are configured in the switch port move with the NIC and the VM from host to host.

NSG and Admin Rules Are Port ACLs

Along came Azure and the cloud needed a basic rules system. Network Security Groups (NSGs) were released and gave us a pretty interface to manage security at the transport layer; now we can allow or deny inbound or outbound traffic on TCP/UDP/ICMP/Any.

What technology did Azure use? Port ACLs of course. By the way, Azure Virtual Network Manager introduced a new form of basic allow/deny control that is processed before NSG rules called Admin Rules. I believe that this is also implemented using Port ACLs.

A Little About NSG Rules

This is a topic I want to dive deep into later, but let’s talk a little about NSG rules. We can implement inbound (allow or deny traffic coming in) or outbound (allow or deny traffic going out) rules.

A quick aside: I rarely use outbound NSG rules. I prefer using a combination of routing and a hub firewall (dey all by default) to control egress traffic.

When I create a NSG I can associate it with:

A NIC: Only that NIC is affected
A subnet: All NICs, including Vnet integrated PaaS resources and Private Endpoints, are affected

The association is simply a management scaling feature. When you associate a NSG with a subnet the rules are not processed at the subnet.

Tip: virtual networks do not exist!

Associating a NSG resource with a subnet propagates the rules from the NSG to all NICs that are connected to that subnet. The processing is done by Port ACLs at the NIC.

This means:

Inbound rules prevent traffic from entering the virtual machine.
Outbound rules prevent traffic from leaving the virtual machine.

Which association should you choose? I advise you to use subnet association. You can see/manage the entire picture in one “interface” and have an easy-to-understand processing scenario.

If you want to micro-manage and have an unpredictable future then go ahead and associate NSGs with each NIC.

If you hate yourself and everyone around you, then use both options at the same time:

The subnet NSG is processed first for inbound traffic.
The NIC NSG is processed first for outbound traffic.

Keep it simple, stupid (the KISS principle).

Micro-Segmentation

As one might grasp, we can use NSGs to micro-segment a subnet. No matter what the resources do, they cannot bypass the security intent of the NSG rules. That means we don’t need to have different subnets for security zones:

We zone using NSG rules.
Virtual networks and their subnets do not exist!

The only time we need to create additional subnets is when there are compatibility issues such as NSG/Route table association or a PaaS resource requires a dedicated subnet.

Watch out for more content shortly where I break some myths and hopefully simplify some of this stuff for you. And if I’m doing this right, you might start to look at some Azure networks (like I have) and wonder “Why the heck was that implemented that way?”.

Azure Image Builder Job Fails With TCP 60000, 5986 or 22 Errors

In this post, I will explain how to solve the situation when an Azure Image Builder job fails with the following errors:

[ERROR] connection error: unknown error Post “https://10.1.10.9:5986/wsman”: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.9:5986/wsman”: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout

[ERROR]connection error: unknown error Post “https://10.1.10.8:5986/wsman”: context deadline exceeded
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.8:5986/wsman”: context deadline exceeded

Note that the second error will probably be the following if you are building a Linux VM:

[ERROR]connection error: unknown error Post “https://10.1.10.8:22/ssh”: context deadline exceeded
[ERROR] SSH connection err: unknown error Post “https://10.1.10.8:22/ssh”: context deadline exceeded

The Scenario

I’m using Azure Image Builder to prepare a reusable image for Azure Virtual Desktop with some legacy software packages from external vendors. Things like re-packaging (MSIX) will be a total support no-no, so the software must be pre-installed into the image.

I need a secure solution:

The virtual machine should not be reachable on the Internet.
Software packages will be shared from a storage account with a private endpoint for the blob service.

This scenario requires that I prepare a virtual network and customise the Image Template to use an existing subnet ID. That’s all good. I even looked at the PowerShell example from Microsoft which told me to allow TCP 60000-60001 from Azure Load Balancer to Virtual Network:

I also added my customary DenyAll rule at priority 4000 – the built in Deny rule doesn’t deny all that much!

I did that … and the job failed, initially with the first of the errors above, related to TCP 60000. Weird!

Troubleshooting Time

Having migrated countless legacy applications with missing networking documentation into micro-segmented Azure networks, I knew what my next steps were:

Deploy Log Analytics and a Storage Account for logs
Enable VNet Flow Logs with Traffic Analytics on the staging subnet (where the build is happening) NSG
Recreate the problem (do a new build)
Check the NTANetAnalytics table in Log Analytics

And that’s what I did. Immediately I found that there were comms problems between the Private Endpoint (Azure Container Instance) and the proxy VM. TCP 60000 was attempted and denied because the source was not the Azure Load Balancer.

I added a rule to solve the first issue:

I re-ran the test (yes, this is painfully slow) and the job failed.

This time the logs showed failures from the proxy VM to the staging (build) VM on TCP 5986. If you’re building a Linux VM then this will be TCP 22.

I added a third rule:

Now when I test I see the status switch from Running, to Distributing, to Success.

Root Cause?

Adding my DenyAll rule caused the scenario to vary from the Microsoft docs. The built-in AllowVnetInBound rule is too open because it allows all sources in a routed “network”, including other networks in a hub & spoke. So I micro-segment using a low priority DenyAll rule.

The default AllowVnetInBound rule would have allowed the container>proxy and proxy>VM traffic, but I had overridden it. So I need to create rules to allow that traffic.

Azure Back To School 2024 – Govern Azure Networking Using Azure Virtual Network Manager

This post about Azure Virtual Network Manager is a part of the online community event, Azure Back To School 2024. In this post, I will discuss how you can use Azure Virtual Network Manager (AVNM) to centrally manage large numbers of Azure virtual networks in a rapidly changing/agile and/or static environment.

Challenges

Organisations around the globe have a common experience: dealing with a large number of networks that rapidly appear/disappear is very hard. If those networks are centrally managed then there is a lot of re-work. If the networks are managed by developers/operators then there is a lot of governance/verification work.

You need to ensure that networks are connected and are routed according to organisation requirements. Mandatory security rules must be put in place to either allow required traffic or to block undesired flows.

That wasn’t a big deal in the old days when there were maybe 3-4 huge overly trusting subnets in the data centre. Network designs change when we take advantage of the ability to transform when deploying to the cloud; we break those networks down into much smaller Azure virtual networks and implement micro-segmentation. This approach introduces simplified governance and a superior security model that can reliably build barriers to advanced persistent threats. Things sound better until you realise that there are no many more networks and subnets that there ever were in the on-premises data centre, and each one requires management.

This is what Azure Virtual Network Manager was created to help with.

Introducing Azure Virtual Network Manager

AVNM is not a new product but it has not gained a lot of traction yet – I’ll get into that a little later. Spoiler alert: things might be changing!

The purpose of AVNM is to centralise configuration of Azure virtual networks and to introduce some level of governance. Don’t get me wrong, AVNM does not replace Azure Policy. In fact, AVNM uses Azure Policy to do some of the leg work. The concept is to bring a network-specialist toolset to the centralised control of networks instead of using a generic toolset (Azure Policy) that can be … how do I say this politely … hmm … mysterious and a complete pain in the you-know-what to troubleshoot.

AVNM has a growing set of features to assist us:

Network groups: A way to identify virtual networks or subnets that we want to manage.
Connectivity configurations: Manage how multiple virtual networks are connected.
Security admin rules: Enforce security rules at the point of subnet connection (the NIC).
Routing configurations: Deploy user-defined routes by policy.
Verifier: Verify the networks can allow required flows.

Deployment Methodology

The approach is pretty simple:

Identify a collection of networks/subnets you want to configure by creating a Network Group.
Build a configuration, such as connectivity, security admin rules, or routing.
Deploy the configuration targeting a Network Group and one or more Azure regions.

The configuration you build will be deployed to the network group members in the selected region(s).

Network Groups

Part of a scalable configuration feature of AVNM is network groups. You will probably build several or many network groups, each collecting a set of subnets or networks that have some common configuration requirement. This means that you can have ea large collection of targets for one configuration deployment.

Network Groups can be:

Static: You manually add specific networks to the group. This is ideal for a limited and (normally) unchanging set of targets to receive a configuration.
Dynamic: You will define a query based on one or more parameters to automatically discover current and future networks. The underlying mechanism that is used for this discovery is Azure Policy – the query is created as a policy and assigned to the scope of the query.

Dynamic groups are what you should end up using most of the time. For example, in a governed environment, Azure resources are often tagged. One can query virtual networks with specific tags and in specific Azure regions and have them automatically appear in a network group. If a developer/operator creates a new network, governance will kick in and tag those networks. Azure Policy will discover the networks and instantly inform AVNM that a new group member was discovered – any configurations applied to the group will be immediately deployed to the new network. That sounds pretty nice, right?

Connectivity Configurations

Before we continue, I want you to understand that virtual network peering is not some magical line or pipe. It’s simply an instruction to the Azure network fabric to say “A collection of NICs A can now talk with a collection of NICs B”.

We often want to either simplify the connectivity of networks or to automate desired connectivity. Doing this at scale can be done using code, but doing it in an agile environment requires trust. Failure usually happens between the chair and the keyboard, so we want to automate desired connectivity, especially when that connectivity enables integration or plays a role in security/compliance.

Connectivity Configurations enable three types of network architecture:

Hub-and-spoke: This is the most common design I see being required and the only one I’ve ever implemented for mid-large clients. A central regional hub is deployed for security/transit. Workloads/data are placed in spokes and are peered only with the hub (the network core). A router/firewall is normally (not always) the next hop to leave a spoke.
Full mesh: Every virtual network is connected directly to every other virtual network.
Hub-and-spoke with mesh: All spokes are connected to the hub. All spokes are connected to each other. Traffic to/from the outside world must go through the hub. Traffic to other spokes goes directly to the destination.

Mesh is interesting. Why would one use it? Normally one would not – a firewall in the hub is a desirable thing to implement micro-segmentation and advanced security features such as Intrusion Detection and Prevention System (IDPS). But there are business requirements that can override security for limited scenarios. Imagine you have a collection of systems that must integrate with minimised latency. If you force a hop through a firewall then latency will potentially be doubled. If that firewall is deemed an unnecessary security barrier for these limited integrations by the business, then this is a scenario where a full mesh can play a role.

This is why I started off discussing peering. Whether a system is in the same subnet/network or not, it doesn’t matter. The physical distance matters, not the virtual distance. Peering is not a cable or a connection – it’s just an instruction.

However, Virtual Network Peering is not even used in mesh! It’s something different that can handle the scale of many virtual networks being interconnected called a Connected Group. One configuration inter-connects all the virtual networks without having to create 1-1 peerings between many virtual networks.

A very nice option with this configuration is the ability to automatically remove pre-existing peering connections to clean up unwanted previous designs.

Security Admin Rules

What is a Network Security Group (NSG) rule? It’s a Hyper-V port ACL that is implemented at the NIC of the virtual machine (yours or in the platform hosting your PaaS service). The subnet or NIC association is simply a scaling/targeting system; the rules are always implemented at the NIC where the virtual switch port is located.

NSGs do not scale well. Imagine you need to deploy a rule to all subnets/NICs to allow/block a flow. How many edits will you need to do? And how much time will you waste on prioritising rules to ensure that your rule is processed first?

Security Admin Rules are also implemented using Port ACLs but they are always processed first. You can create a rule or a set or rules and deploy it to a Network Group. All NICs will be updated and your rules will always be processed first.

Tip: Consider using VNet Flow Logs to troubleshoot Security Admin Rules.

Routing Configurations

This is one of the newer features in AVNM and was a technical blocker for me until it was introduced. Routing plays a huge role in a security design, forcing traffic from the spoke through a firewall in the hub. Typically, in VNet-based hub deployments, we place one user-defined route (UDR) in each subnet to make that flow happen. That doesn’t scale well and relies on trust. Some have considered using BGP routing to accomplish this but that can be easily overridden after quite a bit of effort/cost to get the route propagated in the first place.

AVNM introduced a preview to centrally configure UDRs and deploy them to Network Groups with just a few clicks. There are a few variations on this concept to decide how granular you want the resulting Route Tables to be:

One is shared with virtual networks.
One is shared with all subnets in a virtual network.
One per subnet.

Verification

This is a feature that I’m a little puzzled about and I am left wondering if it will play a role in some other future feature. The idea is that you can test your configurations to ensure that they are working. There is a LOT of cross-over with Network Watcher and there is a common limitation: it only works with virtual machines.

What’s The Bad News?

Once routing configurations go generally available, I would want to use AVNM in every deployment that I do in the future. But there is a major blocker: pricing. AVNM is priced per subscription at $73/month. For those of you with a handful of subscriptions, that’s not much at all. But for those of us who saw that the subscription is a natural governance boundary and use LOTS of subscriptions (like in Microsoft Cloud Adoption Framework), this is a huge deal – it can make AVNM the most expensive thing we do in Azure!

The good news is that the message has gotten through to Microsoft and some folks in Azure networking have publicly commented that they are considering changes to the way that the pricing of AVNM is calculated.

The other bit bad news is an oldie: Azure Policy. Dynamic network group membership is built by Azure Policy. If a new virtual network is created by a developer, it can be hours before policy detects it and informs AVNM. In my testing, I’ve verified that once AVNM sees the new member, it triggers the deployment immediately, but the use of Azure Policy does create latency, enabling some bad practices to be implemented in the meantime.

Summary

I was a downer on AVNM early on. But recent developments and some of the ideas that the team is working on have won me over. The only real blocker is pricing, but I think that the team is serious about fixing that. I stated earlier that AVNM hasn’t gotten a lot of traction. I think that this should change once pricing is fixed and routing configurations are GA.

I recently demonstrated using AVNM to build out the connectivity and routing of a hub-and-spoke with micro-segmentation at a conference. Using Azure Portal, the entire configuration probably took less than 10 minutes. Imagine that: 10 minutes to build out your security and compliance model for now and for the future.

Network Rules Versus Application Rules for Internal Traffic

This post is about using either Network Rules or Application Rules in Azure Firewall for internal traffic. I’m going to discuss a common scenario, a “problem” that can occur, and how you can deal with that problem.

The Rules Types

There are three kinds of rules in Azure Firewall:

DNAT Rules: Control traffic originating from the Internet, directed to a public IP address attached to Azure Firewall, and translated to a private IP Address/port in an Azure virtual network. This is implicitly applied as a Network Rule. I rarely use DNAT Rules – most modern applications are HTTP/S and enter the virtual network(s) via an Application Gateway/WAF.
Application Rules: Control traffic going to HTTP, HTTPS, or MSSQL (including Azure database services).
Network Rules: Control anything going anywhere.

The Scenario

We have an internal client, which could be:

On-premises over private networking
Connecting via point-to-site VPN
Connected to a virtual network, either the same as the Azure Firewall or to a peered virtual network.

The client needs to connect to a server, with SSL authentication, to a server. The server is connected to another virtual network/subnet. The route to the server goes through the Azure Firewall. I’ll complicate things by saying that the server is a PaaS resource with a Private Endpoint – this doesn’t affect the core problem but it makes troubleshooting more difficult 🙂

NSG rules and firewall rules have been accounted for and created. The essential connection is either HTTPS or MSSQL and is implemented as an Application Rule.

The Problem

The client attempts a connection to the server. You end up with some type of application error stating that there was either a timeout or a problem with SSL/TLS authentication.

You begin to troubleshoot:

Azure Firewall shows the traffic is allowed.
NSG Flow Logs show nothing – you panic until you remember/read that Private Endpoints do not generate flow logs – I told you that I’d complicate things 🙂 You can consider VNet Flow Logs to try to capture this data and then you might discover the cause.

You try and discover two things:

If you disconnect the NSG from the subnet then the connection is allowed. Hmm – the rules are correct so the traffic should be allowed: traffic from the client prefix(es) is permitted to the server IP address/port. The rules match the firewall rules.
You change the Application Rule to a Network Rule (with the NSG still associated and unchanged) and the connection is allowed.

So, something is going on with the Application Rules.

The Root Cause

In this case, the Application Rule is SNATing the connection. In other words, when the connection is relayed from the Azure Firewall instance to the server, the source IP is no longer that of the client; the source IP address is a compute instance in the AzureFirewallSubnet.

That is why:

The connection works when you remove the NSG
The connection works when you use a Network Rule with the NSG – the Network Rule does not SNAT the connection.

Solutions

There are two solutions to the problem:

Using Application Rules

If you want to continue to use Application Rules then the fix is to modify the NSG rule. Change the source IP prefix(es) to be the AzureFirewallSubnet.

The downsides to this are:

The NSG rules are inconsistent with the Azure Firewall rules.
The NSG rules are no longer restricting traffic to documented approved clients.

Using Network Rules

My preference is to use Network Rules for all inbound and east-west traffic. Yes, we lose some of the “Layer-7” features but we still have core features, including IDPS in the Premium SKU.

Contrary to using Application Rules:

The NSG rules are consistent with the Azure Firewall rules.
The NSG rules restrict traffic to the documented approved clients.

When To Use Application Rules?

In my sessions/classes, I teach:

Use DNAT rules for the rare occasion where Internet clients will connect to Azure resources via the public IP address of Azure Firewall.
Use Application Rules for outbound connections to Internet, including Azure resources via public endpoints, through the Azure Firewall.
Use Network Rules for everything else.

This approach limits “weird sh*t” errors like what is described above and means that NSG rules are effectively clones of the Azure Firewall rules, with some additional stuff to control stuff inside of the Virtual Network/subnet.

Checking If Client Has Access To KeyVault With Private Endpoint

How to detect connections to a PaaS resource using Private Endpoint.

In this post, I’ll explain how to check if a client service, such as an App Service, has access to an Azure Key Vault with Private Endpoint.

Private Endpoint

In case you do not know, Private Endpoint gives us a mechanism where we can attach a PaaS service, such as a Key Vault, to a subnet with a NIC and a private IP address. Public connections to the PaaS resources are disabled, and an (Azure) Private DNS Zone is used to alter the name resolution of the PaaS resource to point to the private IP address.

Note that communications to the private endpoint are inbound (and response only). The PaaS resource cannot make outbound connections over a Private Endpoint.

My Scenario

The customer has an App Service Plan that has VNet Integration enabled – this allows the App Services to make outbound connections from “random” IPs on this subnet – NSG/Firewall rules should permit access from the subnet prefix.

The App Services on the plan have Private Endpoints on a second subnet in the VNet. There is also a Key Vault, which also has a Private Endpoint. The “Private Endpoint subnet” has an NSG to deny everything except desired traffic, including allowing HTTPS from the VNet Integration subnet prefix to the Key Vault Private Endpoint.

A developer was wondering if connections from an App Service were working and asked if we could see this in the logs.

Problem

The dev in this case wanted to verify network connectivity. So the obvious place to check was … the network! The way to do that is usually to verify that packets arrived at the destination NIC. You can do that (normally) using NSG Flow Logs. There is sometimes up to 25 minutes (or longer during pandemic compute shortages) of a wait before a flow appears in Log Analytics (data export from the host, 10 minutes collection interval [in our case], data processing [15 minutes]). We checked the logs but nothing was there.

And that is because (at this time) NSG Flow Logs cannot produce data destined to Private Endpoints.

We need a different way to trace connections.

Solution

The solution is to check the logs of the target resource. We enable a lot of logging by standard, including the logs for Key Vault. A little bit of Kql-Fu produced this query:

AzureDiagnostics
| where ResourceProvider =="MICROSOFT.KEYVAULT"
| where ResourceId contains "nameOfVault"
| project CallerIPAddress, OperationName, requestUri_s, ResultType, identity_claim_xms_mirid_s

The resulting columns were:

CallerIPAddress: The IP address of the client (the IP address used by the App Service Plan VNet integration, in our case)
OperationName: Things like SecretGet, Authentication, VaultGet, and SecretList
requestUri_s: The URI of the secret being requested
ResultType: Was it a success or not?
identity_claim_xms_mirid_s: The resource ID of the requesting client (the resource ID of the App Service, in our case)

Armed with the resulting info, the dev got what they needed to prove that the App Service was connecting to the Key Vault.

Enabling NSG Traffic Analytics Fails

This post will deal with a scenario where you get this error when attempting to enable NSG Traffic Analytics with a Log Analytics Workspace:

Failed to save flow log settings

Failed to update flow logs settings for ‘NSG-NAME’. Error: An error occurred..

NSG Traffic Analytics

I work mostly in Azure networking these days. My customers are typically larger enterprises that are focused on governance and security. When you build Azure network architecture for these kinds of organisations, the networks have many pieces to make micro-segmented security a reality. And that means you need to be able to troubleshoot NSG rules and routing. I find the troubleshooting tools in Network Watcher to be useless. Instead, I use:

My own understanding to make up a mental map of the effective routes for the subnet – because this is missing in Azure unless you have an allocated VM NIC in that subnet (often the case)
Azure Firewall’s logs
NSG Traffic Analytics logs in a Log Analytics Workspace

In my architecture, there is a single, central Log Analytics Workspace that is in a different subscription to the virtual networks/NSGs. And this is where the problem is rooted.

Symptoms

When you attempt to enable Traffic Analytics you get the above error. Interestingly, if you only attempt to enable NSG Flow Logs (data logged to storage account) there is no problem. So the issue is related to getting the Workspace configured as a part of the solution (NSG Traffic Analytics).

The Problem & Fix

The problem is that the Microsoft.Network resource provider must be enabled in the subscription that the Workspace is located in. In my case, as I said, I have a dedicated management subscription so there are no network resources to require/enable that resource provider automatically.

If you go to Subscriptions > Resource Providers in the Azure Portal, you can enable the provider there. Wait (no more than 15 minutes) and things should be OK then.

Thanks to Dalan in Azure Networking for helping fix this one!

How to Troubleshoot Azure Routing?

This post will explain how routing works in Microsoft Azure, and how to troubleshoot your routing issues with Route Tables, BGP, and User-Defined Routes in your virtual network (VNet) subnets and virtual (firewall) appliances/Azure Firewall.

Software-Defined Networking

Right now, you need to forget VLANs, and how routers, bridges, routing switches, and all that crap works in the physical network. Some theory is good, but the practice … that dies here.

Azure networking is software-defined (VXLAN). When a VM sends a packet out to the network, the Azure Fabric takes over as soon as the packet hits the virtual NIC. That same concept extends to any virtual network-capable Azure service. From your point of view, a memory copy happens from source NIC to destination NIC. Yes; under the covers there is an Azure backbone with a “more physical” implementation but that is irrelevant because you have no influence over it.

So always keep this in mind: network transport in Azure is basically a memory copy. We can, however, influence the routing of that memory copy by adding hops to it.

Understand the Basics

When you create a VNet, it will have 1 or more subnets. By default, each subnet will have system routes. The first ones are simple, and I’ll make it even more simple:

Route directly via the default gateway to the destination if it’s in the same supernet, e.g. 10.0.0.0/8
Route directly to Internet if it’s in 0.0.0.0/0

By the way, the only way to see system routes is to open a NIC in the subnet, and click Effective Routes under Support & Troubleshooting. I have asked that this is revealed in a subnet – not all VNet-connected services have NICs!

And also, by the way, you cannot ping the subnet default gateway because it is not an appliance; it is a software-defined function that is there to keep the guest OS sane … and probably for us too 😊

When you peer a VNet with another VNet, you do a few things, including:

Instructing VXLAN to extend the plumbing of between the peered VNets
Extending the “VirtualNetwork” NSG rule security tag to include the peered neighbour
Create a new system route for peering.

The result is that VMs in VNet1 will send packets directly to VMs in VNet2 as if they were in the same VNet.

When you create a VNet gateway (let’s leave BGP for later) and create a load network connection, you create another (set of) system routes for the virtual network gateway. The local address space(s) will be added as destinations that are tunnelled via the gateway. The result is that packets to/from the on-prem network will route directly through the gateway … even across a peered connection if you have set up the hub/spoke peering connections correctly.

Let’s add BGP to the mix. If I enable ExpressRoute or a BGP-VPN, then my on-prem network will advertise routes to my gateway. These routes will be added to my existing subnets in the gateway’s VNet. The result is that the VNet is told to route to those advertised destinations via the gateway (VPN or ExpressRoute).

If I have peered the gateway’s VNet with other VNets, the default behaviour is that the BGP routes will propagate out. That means that the peered VNets learn about the on-premises destinations that have been advertised to the gateway, and thus know to route to those destinations via the gateway.

And let’s stop there for a moment.

Route Priority

We now have 2 kinds of route in play – there will be a third. Let’s say there is a system route for 172.16.0.0/16 that routes to virtual network. In other words, just “find the destination in this VNet”. Now, let’s say BGP advertises a route from on-premises through the gateway that is also for 172.16.0.0/16.

We have two routes for the 172.16.0.0/16 destination:

System
BGP

Azure looks at routes that clash like above and deactivates one of them. Azure always ranks BGP above System. So, in our case, the System route for 172.16.0.0/16 will be deactivated and no longer used. The BGP route for 172.16.0.0/16 via the VNet gateway will remain active and will be used.

Specificity

Try saying that word 5 times in a row after 5 drinks!

The most specific route will be chosen. In other words, the route with the best match for your destination is selected by the Azure fabric. Let’s say that I have two active routes:

16.0.0/16 via X
16.1.0/24 via Y

Now, let’s say that I want to send a packet to 172.16.1.4. Which route will be chosen? Route A is a 16 bit match (172.16.*.*). Route B is a 24 bit match (172.16.1.*). Route B is a closer match so it is chosen.

Now add a scenario where you want to send a packet to 172.16.2.4. At this point, the only match is Route A. Route B is not a match at all.

This helps explain an interesting thing that can happen in Azure routing. If you create a generic rule for the 0.0.0.0/0 destination it will only impact routing to destinations outside of the virtual network – assuming you are using the private address spaces in your VNet. The subnets have system routes for the 3 private address spaces which will be more specific than 0.0.0.0:

168.0.0/16
16.0.0/12
0.0.0/8
0.0.0/0

If your VNet address space is 10.1.0.0/16 and you are trying to send a packet from subnet 1 (10.1.1.0/24) to subnet 2 (10.1.2.0/24), then the generic Route D will always be less specific than the system route, Route C.

Route Tables

A route table resource allows us to manage the routing of a subnet. Good practice is that if you need to manage routing then:

Create a route table for the subnet
Name the route table after the VNet/subnet
Only use a route table with 1 subnet

The first thing to know about route tables is that you can control BGP propagation with them. This is especially useful when:

You have peered virtual networks using a hub gateway
You want to control how packets get to that gateway and the destination.

The default is that BGP propagation is allowed over a peering connection to the spoke. In the route table (Settings > Configuration) you can disable this propagation so the BGP routes are never copied from the hub network (with the VNet gateway) to the peered spoke VNet’s subnets.

The second thing about route tables is that they allow us to create user-defined routes (UDRs).

User-Defined Routes

You can control the flow of packets using user-defined routes. Note that UDRs outrank BGP routes and System Routes:

UDR
BGP routes
System routes

If I have a system or BGO route to get to 192.168.1.0/24 via some unwanted path, I can add a UDR to 192.168.1.0/24 via the desired path. If the two routes are identical destination matches, then my UDR will be active and the BGP/system route will be deactivated.

Troubleshooting Tools

The traditional tool you might have used is TRACERT. I’m sorry, it has some use, but it’s really not much more than PING. In the software defined world, the default gateway isn’t a device with a hop, the peering connection doesn’t have a hop, and TRACERT is not as useful as it would have been on-premises.

The first thing you need is the above knowledge. That really helps with everything else.

Next, make sure your NSGs aren’t the problem, not your routing!

Next is the NIC, if you are dealing with virtual machines. Go to Effective Routes and look at what is listed, what is active and what is not.

Network Watcher has a couple of tools you should also look at:

Next Hop: This is a pretty simple tool that tells you the next “appliance” that will process packets on the journey to your destination, based on the actual routing discovered.
Connection Troubleshoot: You can send a packet from a source (VM NIC or Application Gateway) to a certain destination. The results will map the path taken and the result.

The tools won’t tell you why a routing plan failed, but with the above information, you can troubleshoot a (desired) network path.

Locking Down Network Access to the Azure Application Gateway/Firewall

In this post, I will explain how you can use a Network Security Group (NSG) to completely lock down network access to the subnet that contains an Azure Web Application Gateway (WAG)/Web Application Firewall (WAF).

The stops are as follows:

Deploy a WAG/WAF to a dedicated subnet.
Create a Network Security Group (NSG) for the subnet.
Associate the NSG with the subnet.
Create an inbound rule to allow TCP 65503-65534 from the Internet service tag to the CIDR address of the WAG/WAF subnet.
Create rules to allow application traffic, such as TCP 443 or TCP 80, from your sources to the CIDR address of the WAG/WAF
Create a low priority (4000) rule to allow any protocol/port from the AzureLoadBlanacer service tag to the CIDR address of the WAG/WAF
Create a rule, with the lowest priority (4096) to Deny All from Any source.

The Scenario

It is easy to stand up a WAG/WAF in Azure and get it up and running. But in the real world, you should lock down network access. In the world of Azure, all network security begins with an NSG. When you deploy WAG/WAF in the real world, you should create an NSG for the WAG/WAF subnet and restrict the traffic to that subnet to what is just required for:

Health monitoring of the WAG/WAF
Application access from the authorised sources
Load balancing of the WAG/WAF instances

Everything else inbound will be blocked.

The NSG

Good NSG practice is as follows:

Tiers of services are placed into their own subnet. Good news – the WAG/WAF requires a dedicated subnet.
You should create an NSG just for the subnet – name the NSG after the VNet-Subnet, and maybe add a prefix or suffix of NSG to the name.

Health Monitoring

Azure will need to communicate with the WAG/WAF to determine the health of the backends – I know that this sounds weird, but it is what it is.

Note: You can view the health of your backend pool by opening the WAG/WAF and browsing to Monitoring > Backend Health. Each backend pool member will be listed here. If you have configured the NSG correctly then the pool member status should be “Healthy”, assuming that they are actually healthy. Otherwise, you will get a warning saying:

Unable to retrieve health status data. Check presence of NSG/UDR blocking access to ports 65503-65534 from Internet to Application Gateway.

OK – so you need to open those ports from “Internet”. Two questions arise:

Is this secure? Yes – Microsoft states here that these ports are “are protected (locked down) by Azure certificates. Without proper certificates, external entities, including the customers of those gateways, will not be able to initiate any changes on those endpoints”.
What if my WAG/WAF is internal and does not have a public IP address? You will still do this – remember that “Internet” is everything outside the virtual network and peered virtual networks. Azure will communicate with the WAG/WAF via the Azure fabric and you need to allow this communication that comes from an external source.

In my example, my WAF subnet CIDR is 10.0.2.4/24:

Application Traffic

Next, I need to allow application traffic. Remember that the NSG operates at the TCP/UDP level and has no idea of URLs – that’s the job of the WAG/WAF. I will use the NSG to define what TCP ports I am allowing into the WAG/WAF (such as TCP 443) and from what sources.

In my example, the WAF is for internal usage. Clients will connect to applications over a VPN/ExpressRoute connection. Here is a sample rule:

If this was an Internet-facing WAG or WAF, then the source service tag would be Internet. If other services in Azure need to connect to this WAG or WAF, then I would allow traffic from either Virtual Network or specific source CIDRs/addresses.

The Azure Load Balancer

To be honest, this one caught me out until I reasoned what the cause was. My next rule will deny all other traffic to the WAG/WAF subnet. Without this load balancer rule, the client could not connect to the WAG/WAF. That puzzled me, and searches led me nowhere useful. And then I realized:

A WAG/WAF is 1+ instances (2+ in v2), each consuming IP addresses in the subnet.
They are presented to clients as a single IP.
That single IP must be a load balancer
That load balancer needs to probe the load balancer’s own backend pool – which are the instance(s) of the WAG/WAF in this case

You might ask: isn’t there a default rule to allow a load balancer probe? Yes, it has priority 65001. But we will be putting in a rule at 4096 to prevent all connections, overriding the 65000 rule that allows everything from VirtualNetwork – which includes all subnets in the virtual network and all peered virtual networks.

The rule is simple enough:

Deny Everything Else

Now we will override the default NSG rules that allow all communications to the subnet from other subnets in the same VNet or peered VNets. This rule should have the lowest possible user-defined priority, which is 4096:

Why am I using the lowest possible priority? This is classic good firewall rule practice. General rules should be low priority, and specific rules should be high priority. The more general, the lower. The more specific, the higher. The most general rule we have in firewalls is “block everything we don’t allow”; in other words, we are creating a white list of exceptions with the previously mentioned rules.

The Results

You should end up with:

The health monitoring rule will allow Azure to check your WAG/WAF over a certificate-secured channel.
Your application rules will permit specified clients to connect to the WAG/WAF, via a hidden load balancer.
The load balancer can probe the WAG/WAF and forward client connections.
The low priority deny rule will block all other communications.

Job done!

Azure-to-Azure Site Recovery Fails – Connection Cannot Be Established

In this post, I’ll explain how to fix the following errors when you attempt to replicate an Azure virtual machine from one Azure Region to another:

Error 151072: Connection cannot be established to Azure Site Recovery service endpoints.

And:

Error 539: The requested action couldn’t be performed by the ‘A2A’ Replication Provider.

The Cause

A2ASR (the abbreviation of the ASR service for Azure VMs) uses an extension (guest OS agent) called the Mobility Service to migrate disk contents from a source virtual machine to a target (secondary) region (or DR site). The Mobility Service is using the networking of the virtual machine to talk the ASR endpoints in the secondary region. That traffic is therefore going over the NIC and virtual network of the VM, and then to the target region via the Azure backbone.

if you have restricted outbound traffic for your virtual machines, then you might have blocked this traffic:

Third party firewall appliances
Using Network Security Groups (NSGs), as I documented here

The Fix

Woops! Don’t worry, you’ve already created exceptions to allow your virtual machine to boot up. You can create more exceptions to allow the virtual machines to talk to the ASR endpoints (see the below screenshot). Let’s imagine that I am replicating from North Europe to West Europe.

I’ll need at least one set of rules, enabling outbound traffic from my VNet/NICs in the source region, North Europe, to the two IP addresses of the target region, West Europe.

I will also have to enable inbound traffic from my target region, West Europe, to my destination region, North Europe. Why? Isn’t all my traffic going from North Europe to West Europe? That’s true – now. But if you failover to West Europe, you will need to reverse replication afterwards, so you might as well get things right now.

A Script

It all looks messy at first. It probably isn’t too bad. But if you’d like to deploy a canned script to update NSGs, you can. Microsoft has shared a script that you can run. You will need a few pieces of information:

NSG name
NSG resource group name
Subscription ID
Source region
Target region

Run the script (it will prompt you to log in) from source to target, and then reverse the details, treating the target as the source, and vice versa with the NSG(s) in the DR site.

Where’s the Service Tags?

Storage accounts and Azure SQL all have service accounts, but ASR does not. I believe that ASR should have service tags to avoid all of this IP messiness. If you agree, vote here, or forever stay quiet on the subject.

Was This Kind of Information Useful?

If you found this information useful, then imagine what 2 days of training might mean to you. I’m delivering a 2-day course in Amsterdam on April 19-20, teaching newbies and experienced Azure admins about Azure Infrastructure. There’ll be lots of in-depth information, covering the foundations, best practices, troubleshooting, and advanced configurations. You can learn more here.