Designing Network Security To Combat Modern Threats

In this post, I want to discuss how one should design network security in Microsoft Azure, dispensing with past patterns and combatting threats that are crippling businesses today.

The Past

Network security did not change much for a very long time. The classic network design is focused on an edge firewall.”All the bad guys are trying to penetrate our network from the Internet” so we’ll put up a very strong wall at the edge. With that approach, you’ll commonly find the “DMZ” network; a place where things like web proxies and DNS proxies isolate interior users and services from the Internet.

The internal network might be made up of two/more VLANs. For example, one or more client device VLANs and a server VLAN. While the route between those VLANs might pass through the firewall, it probably didn’t; they really “routed” through a smart core switch stack and there was limited to no firewall isolation between those VLANs.

This network design is fertile soil for malware. Ports usually are not let open to attack on the edge firewall. Hackers aren’t normally going to brute force their way through a firewall. There are easier ways in such as:

  • Send an “invoice” PDF to the accounting department that delivers a trojan horse.
  • Impersonate someone, ideally someone that travels and shouts a lot, to convince a helpful IT person to reset a password.
  • Target users via phishing or spear phishing.
  • Cimpromise some upstream include that developers use and use it to attack from the servers.
  • Use a SQL injection attack to open a command prompt on an internal server.
  • And on and on and …

In each of those cases, the attack comes from within. The spread of the blast (the attack) is unfettered. The blast area (a term used to describe the spread of an attack) is the entire network.

Secure Zones To The Rescue!

Government agencies love a nice secure zone architecture. This is a design where sensitive systems, such as GDRP data or secrets are stored on an isolated network.

Some agencies will even create a whol duplicate network that is isolated, forcing users to have two PCs – one “regular” one on the Internet-connected network and a “secure” PC that is wired onto an isolated network with limited secret services.

Realistically, that isolated network is of little value to most, but if you have that extreme a need – then good luck. By the way, that won’t work in The Cloud 🙂 Back to the more regular secure zone …

A special VLAN will be deployed and firewall rules will block all traffic into and out of that secure zone. The user experience might be to use Citrix desktops, hosted in the secure zone, to access services and data in that secure zone. But then reality starts cracking holes in the firewall’s deny all rules. No line of business app lives alone. They all require data from somewhere. Or there are integrations. Printers must be used. Scanners need to scan and share data. And legacy apps often use:

  • Domain (ADDS) credentials (how many ports do you need for that!!!)
  • SMB (TCP 445) for data transfer and integration

Over time, “deny all” becomes a long list of allow * from X to *, and so on, with absolutely no help from the app vendors.

The theory is that if an attack is commenced, then the blast area will be limited to the client network and, if it reaches the servers, it will be limtied to the Internal network. But this design fails to understand that:

  • An attack can come from within. Consider the scneario where compromised runtimes are used or a SQL injection attack breaks out from a database server.
  • All the required integrations open up holes between the secure zone and the other networks, including those legacy protocols that things like ransomware live on.
  • If one workload in the secure zone is compromised, they all are because there is no network segmentation inside of the VLAN.

And eventually, the “secure zone” is no more secure than the Internal network.

Don’t Block The Internet!!!

I’m amazed how many organisations do not block outbound access to the Internet. It’s just such hard work to open up firewall rules for all these applications that have Internet dependencies. I can understand that for a client VLAN. But the server VLAN such be a controlled space – if it’s not known & controlled (i.e. governed) then it should not be permitted.

A modern attack, an advanced persistent threat (APT), isn’t just some dumb blast, grab, and run. It is a sneaky process of:

  1. Penetration
  2. Discovery, often manually controlled
  3. Spread, often manually controlled
  4. Steal
  5. Destroy/encrypt/etc

Once an APT gets in, it usually wants to call home to pull instructions down from a rogue IP address or compromised bot. When the APT wants to steal data, to be used as blackmail and/or to be sold on the Darknet, the malware will seek to upload data to the Internet. Both of these actions are taking advantage of the all-too-common open access to the Internet.

Azure is Different

Years of working with clients has taught me that there are three kinds of people when it comes to Azure networking:

  1. Those who managed on-premises networks: These folks struggle with Azure networking.
  2. Those who didn’t do on-premises networking, but knew what to ask for: These folks take to Azure networking quite quickly.
  3. Everyone else: Irrelevant to this topic

What makes Azure networking so difficult for the network admins? There is no cabling in the fabric – obviously there is cabling in the data centres but it’s all abstracted by the VXLAN software-defined networks. Packets are encapsulated on the source virtual machine’s host, transmitted over the physical network, decapstulated on the destination virtual machine host, and presented to the destination virtual machine’s NIC. In short, packets leave the source NIC and magically arrive on the destination NIC with no hops in between – this is why traceroute is pointless in Azure and why the default gateway doesn’t really exist.

I’m not going to use virtual machines, Aidan. I’m doing PaaS and serverless computing. In Azure, everything is based on virtual machines, unless they are explcitly hosted on physical hosts (Azure VMware Services and some SAP stuff, for example). Even Functions run on a VM somewhere hidden in the platform. Serverless means that you don’t need to manage it.

The software-defined thing is why:

  • Partitioned subnets for a firewall appliance (front, back, VPN, and management) offer nothing from a security perspective in Azure.
  • ICMP isn’t as useful as you’d imagine in Azure.
  • The concept of partitioning workloads for security using subnets is not as useful as you might think – it’s actually counter-productive over time.

Transformation

I like to remind people during a presentation or a project kickoff that going on a cloud journey is supposed to result in transformation. You now re-evaluate everything and find better ways to do old things using cloud-native concepts. And that applies to network security designs too.

Micro-Segmentation Is The Word

Forget “Greece”, get on board with what you need to counter today’s threats: micro-segmentation. This is a concept where:

  • We protect the edge, inbound and outbound, permitting only required traffic.
  • We apply network isolation within the workload, permitting only required traffic.
  • We route traffic between workloads through the edge firewall, , permitting only required traffic.

Yes, more work will be required when you migrate existing workloads to Azure. I’d suggest using Azure Migrate to map network flows. I never get to do that – I always get the “messy migration projects” and I never get to use Azure Migrate – so testing and accessing and understanding NSG Traffic Analytics and the Azure Firewall/firewall logs via KQL is a necessary skill.

Security Classification

Every workload should go through a security classification process. You need to weigh risk verus complexity. If you max the security, you will increase costs and difficulty for otherwise simple operations. For example, a dev won’t be able to connect Visual Studio straight to an App Service if you deploy that App Service on a private or isolated App Service Plan. You also will have to host your own DevOps agents/GitHub runners because the Microsoft-hosted containers won’t be able to reach your SCM endpoints.

Every piece of compute is a potential attack vector: a VM, an App Service, a Function, a Container, a Logic App. The question is, if it is compromised, will the attacker be able to jump to something else? Will the data that is accessible be secret, subject to regulation, or reputational damage?

This measurement process will determine if a workload should use resources that:

  • Have public endpoints (cheapest and easiest).
  • Use private endpoints (medium levels of cost, complexity, and security).
  • Use full VNet integration, such as an App Service Environment or a virtual machine (highest cost/complexity but most secure).

The Virtual Network & Subnet

Imagine you are building a 3-tier workload that will be isolated from the Internet using Azure virtual networking:

  • Web servers on the Internet
  • Middle tier
  • Databases

Not that long ago, we would have deployed that workload on 3 subnets, one for each tier. Then we would have built isolation using Network Security Groups (NSGs), one for each subnet. But you just learned that a SD-network routes packets directly from NIC to NIC. An NSG is a Hyper-V Port ACL that is implemented at the NIC, even if applied at the subnet level. We can create all the isolation we want using an NSG within the subnet. That means we can flatten the network design for the workload to one subnet. A subnet-associated subnet will restrict communications between the tiers – and ideally between nodes within the same tier. That level of isolation should block everything … should 🙂

Tips for virtual networks and subnets:

  • Deploy 1 virtual network per workload: Not only will this follow Azure Cloud Adoption Framework concepts, but it will help your overall security and governance design. Each workload is placed into a spoke virtual network and peered with a hub. The hub is used only for external connectivity, the firewall, and Azure Bastion (assuming this is not a vWAN hub).
  • Assign a single prefix to your hub & spoke: Firewall and NSG rules will be easier.
  • Keep the virtual newtorks small: Don’t waste your address space.
  • Flatten your subnets: Only deploy subnets when there is a technical need, for example VMs and private endpoints are in one subnet, VNet integration for an App Services plan is in another, a SQL managed instance, is in a third.

Resource Firewalls

It’s sad to see how many people disable operating system firewalls. For example, Group Policy is used to diable Windows Firewall. Don’t you know that Microsoft and Linux added those firewalls to protect machines from internal attacks? Those firewalls should remain operational and only permit required traffic.

Many Azure resources also offer firewalls. App Services have firewalls. Azure SQL has a firewall. Use them! The one messy resource is the storage account. The location of the endpoints for storage clusters is in a weird place – and this causes interesting situations. For example, a Logic App’s storage account with a configured firewall will prevent workflows from being created/working correctly.

Network Security Groups

Take a look at the default inbound rules in an NSG. You’ll find there is a Deny All rule which is the lowest possible priority. Just up from that rule, is a built in rule to allow traffic from VirtualNetwork. VirtualNetwork includes the subnet, the virtual network, and all routed networks, including peers and site-to-site connections. So all traffic from internal networks is … permitted! This is why every NSG that I create has a custom DenyAll rule with a priority of 4000. Higher priority rules are created to permit required traffic and only that required traffic.

Tips with your NSGs:

  • Use 1 NSG per subnet: Where the subnet resources will support an NSG. You will reduce your overall complexity and make troubleshooting easier. Remember, all NSG rules are actually applied at the source (outbound rules) or target (inbound rules) NIC.
  • Limit the use of “any”: Rules should be as accurate as possible. For example: Allow TCP 445 from source A to destination B.
  • Consider the use of Application Security Groups: You can abstract IP addresses with an Application Security Group (ASG) in an NSG rule. ASGs can be used with NICs – virtual machines and private endpoints.
  • Enable NSG Flow Logs & Traffic Analytics: Great for troubleshooting networking (not just firewall stuff) and for feeding data to a SIEM. VNet Flow Logs will be a superior replacement when it is ready for GA.

The Hub

As I’ve implied already, you should employ a hub & spoke design. The hub should be simple, small and free of compute. The hub:

  • Makes connections using site-to-site networking using SD-WAN, VPN, and/or ExpressRoute.
  • Hosts the firewall. The firewall blocks everything in every direction by default,
  • Hosts Azure Bastion, unless you are running Azure Virtual WAN – then deploy it to a spoke.
  • Is the “Public IP” for egress traffic for workloads trying to reach the Internet. All egress traffic is via the firewall. Azure Policy should be used to restrict Public IP Addresses to just those requires that require it – things like Azure Bastion require a public IP and you should create a policy override for each required resource ID.

My preference is to use Azure Firewall. That’s a long conversation so let’s move on to another topic; Azure Bastion.

Most folks will go into Azure thinking that they will RDP/SSH straight to their VMs. RDP and SSH are not perfect. This is something that the secure zone concept recognised. It was not unusual for admins/operators to use a bastion host to hop via RDP or SSH from their PC to the required server via another server. RDP/SSH were not open directly to the protected machines.

Azure Bastion should offer the same isolation. Your NSG rules should only permit RDP/SSH from:

  • The AzureBastionSubnet
  • Any other bastion hosts that might be employed, typically by developers who will deploy specialist tools.

Azure Bastion requires:

  • An Entra ID sign-in, ideally protected by features such as conditional access and MFA, to access the bastion service.
  • The destination machine’s credentials.

Routing

Now we get to one of my favourite topics in Azure. In the on-prem world we can control how packets get from A to B using cables. But as you’ve learned, we can run cables in Azure. But we can control the next hop of a packet.

We want to control flows:

  • Ingress from site-to-site networking to flow through the hub firewall: A route in the GatewaySubnet to use the hub firewall as the next hop.
  • All traffic leaving a spoke (workload virtual network) to flow through the hub firewall: A route to 0.0.0.0/0 using the firewall backend/private IP as the next hop.
  • All traffic between hub & spokes to flow through the remote hub firewall: A route to the remote hub & spoke IP prefix (see above tip) with a next hop of the remote hub firewall.

If you follow my tips, especially with the simple hub, then the routing is actually quite easy to implement and maintain.

Tips:

  • Keep the hub free of compute.
  • NSG Traffic Analytics helps to troubleshoot.

Web Application Firewall

The hub firewall shold not be used to present web applications to the Internet. If a web app is classified as requireing network security, then it should be reverse proxied using a Web Application Firewall (WAF). This specialised firewall inspects traffic at the application layer and can block threats.

The WAF will have a lot of false positives. Heavy traffic applications can produce a lot of false positives in your logs; in the case of Log Analytics, the ingestion charge can be huge so get to optimising those false positives as quickly as you can.

My preference is to route the WAF through the hub firewall to the backend applications. The WAF is a form of compte, even the Azure WAF. If you do not need end-to-end TLS, then the firewall could be used to inspect the HTTP traffic from the WAF to the backend using Intrusion Detection Prevention System (IDPS), offering another layer of protection.

Azure offers a couple of WAF options. Front Door with WAF is architecturally interesting, but the default design is that the backend has a public endpoint that limits access to your Front Door instance at the application layer. What if the backend is network connected for max protection? Then you get into complexities with Private Link/Private Endpoint.

A regional WAF is network connected and offers simpler networking, but it sacrifices the performance boosts from Front Door. You can combine Front Door with a regional WAF, but there are more costs with this.

Third party solutions are posisble Services such as Cloud Flare offer performance and security features. One could argue that Cloud Flare offers more features. From the performance perspective, keep in mind that Cloud Flare has only a few peering locations with the Microsoft WAN, so a remote user might have to take a detour to get to your Azure resources, increasing latency.

You can seek out WAF solutions from the likes of F5 and Citrix in the Azure Marketplace. Keep in mind that NVAs can continue skills challenges by siloing the skill – native cloud skills are easier to develop and contract/hire.

Summary

I was going to type something like “this post gives you a quick tour of the micro-segmentation approach/features that you can use in Azure” but then I reaslised that I’ve had keyboard diarrhea and this post is quite Sinofskian. What I’ve tried to explain is that the ways of the past:

  • Don’t do much for security anymore
  • Are actually more complex in architecture than Azure-native patterns and solutions that will work.

If you implement security at three layers, assuming that a breach will happen and could happen anywhere then you limit the blast area of a threat:

  • The edge, using the firewall and a WAF
  • The NIC, using a Network Security Group
  • The resource, using a guest OS/resource firewall

This trust-no-one approach that denies all but the minimum required traffic will make life much harder for an attacker. Including logging and the use of a well configured SIEM will create trip wires that an attacker must trip over to attempt an expansion. You will make their expansion harder & slower, and make it easier to detect them. You will also limit how much they can spread and how much the damage that the attack can create. Furthermore, you will be following the guidance the likes of the FBI are recommending.

There is so much more to consider when it comes to security, but I’ve focused on micro-segmentation in a network context. People do think about Entra ID and management solutions (such as Defender for Cloud and/or SIEM) but they rarely think through the network design by assuming that what they did on-prem will still be fine. It won’t because on-prem isn’t fine right now! So take my advice, transform your network, and protect your assets, shareholders, and your career.

Enabling NSG Traffic Analytics Fails

This post will deal with a scenario where you get this error when attempting to enable NSG Traffic Analytics with a Log Analytics Workspace:

Failed to save flow log settings
Failed to update flow logs settings for ‘NSG-NAME’. Error: An error occurred..

NSG Traffic Analytics

I work mostly in Azure networking these days. My customers are typically larger enterprises that are focused on governance and security. When you build Azure network architecture for these kinds of organisations, the networks have many pieces to make micro-segmented security a reality. And that means you need to be able to troubleshoot NSG rules and routing. I find the troubleshooting tools in Network Watcher to be useless. Instead, I use:

  • My own understanding to make up a mental map of the effective routes for the subnet – because this is missing in Azure unless you have an allocated VM NIC in that subnet (often the case)
  • Azure Firewall’s logs
  • NSG Traffic Analytics logs in a Log Analytics Workspace

In my architecture, there is a single, central Log Analytics Workspace that is in a different subscription to the virtual networks/NSGs. And this is where the problem is rooted.

Symptoms

When you attempt to enable Traffic Analytics you get the above error. Interestingly, if you only attempt to enable NSG Flow Logs (data logged to storage account) there is no problem. So the issue is related to getting the Workspace configured as a part of the solution (NSG Traffic Analytics).

The Problem & Fix

The problem is that the Microsoft.Network resource provider must be enabled in the subscription that the Workspace is located in. In my case, as I said, I have a dedicated management subscription so there are no network resources to require/enable that resource provider automatically.

If you go to Subscriptions > Resource Providers in the Azure Portal, you can enable the provider there. Wait (no more than 15 minutes) and things should be OK then.

Thanks to Dalan in Azure Networking for helping fix this one!

How to Troubleshoot Azure Routing?

This post will explain how routing works in Microsoft Azure, and how to troubleshoot your routing issues with Route Tables, BGP, and User-Defined Routes in your virtual network (VNet) subnets and virtual (firewall) appliances/Azure Firewall.

Software-Defined Networking

Right now, you need to forget VLANs, and how routers, bridges, routing switches, and all that crap works in the physical network. Some theory is good, but the practice … that dies here.

Azure networking is software-defined (VXLAN). When a VM sends a packet out to the network, the Azure Fabric takes over as soon as the packet hits the virtual NIC. That same concept extends to any virtual network-capable Azure service. From your point of view, a memory copy happens from source NIC to destination NIC. Yes; under the covers there is an Azure backbone with a “more physical” implementation but that is irrelevant because you have no influence over it.

So always keep this in mind: network transport in Azure is basically a memory copy. We can, however, influence the routing of that memory copy by adding hops to it.

Understand the Basics

When you create a VNet, it will have 1 or more subnets. By default, each subnet will have system routes. The first ones are simple, and I’ll make it even more simple:

  • Route directly via the default gateway to the destination if it’s in the same supernet, e.g. 10.0.0.0/8
  • Route directly to Internet if it’s in 0.0.0.0/0

By the way, the only way to see system routes is to open a NIC in the subnet, and click Effective Routes under Support & Troubleshooting. I have asked that this is revealed in a subnet – not all VNet-connected services have NICs!

And also, by the way, you cannot ping the subnet default gateway because it is not an appliance; it is a software-defined function that is there to keep the guest OS sane … and probably for us too 😊

When you peer a VNet with another VNet, you do a few things, including:

  • Instructing VXLAN to extend the plumbing of between the peered VNets
  • Extending the “VirtualNetwork” NSG rule security tag to include the peered neighbour
  • Create a new system route for peering.

The result is that VMs in VNet1 will send packets directly to VMs in VNet2 as if they were in the same VNet.

When you create a VNet gateway (let’s leave BGP for later) and create a load network connection, you create another (set of) system routes for the virtual network gateway. The local address space(s) will be added as destinations that are tunnelled via the gateway. The result is that packets to/from the on-prem network will route directly through the gateway … even across a peered connection if you have set up the hub/spoke peering connections correctly.

Let’s add BGP to the mix. If I enable ExpressRoute or a BGP-VPN, then my on-prem network will advertise routes to my gateway. These routes will be added to my existing subnets in the gateway’s VNet. The result is that the VNet is told to route to those advertised destinations via the gateway (VPN or ExpressRoute).

If I have peered the gateway’s VNet with other VNets, the default behaviour is that the BGP routes will propagate out. That means that the peered VNets learn about the on-premises destinations that have been advertised to the gateway, and thus know to route to those destinations via the gateway.

And let’s stop there for a moment.

Route Priority

We now have 2 kinds of route in play – there will be a third. Let’s say there is a system route for 172.16.0.0/16 that routes to virtual network. In other words, just “find the destination in this VNet”. Now, let’s say BGP advertises a route from on-premises through the gateway that is also for 172.16.0.0/16.

We have two routes for the 172.16.0.0/16 destination:

  • System
  • BGP

Azure looks at routes that clash like above and deactivates one of them. Azure always ranks BGP above System. So, in our case, the System route for 172.16.0.0/16 will be deactivated and no longer used. The BGP route for 172.16.0.0/16 via the VNet gateway will remain active and will be used.

Specificity

Try saying that word 5 times in a row after 5 drinks!

The most specific route will be chosen. In other words, the route with the best match for your destination is selected by the Azure fabric. Let’s say that I have two active routes:

  1. 16.0.0/16 via X
  2. 16.1.0/24 via Y

Now, let’s say that I want to send a packet to 172.16.1.4. Which route will be chosen? Route A is a 16 bit match (172.16.*.*). Route B is a 24 bit match (172.16.1.*). Route B is a closer match so it is chosen.

Now add a scenario where you want to send a packet to 172.16.2.4. At this point, the only match is Route A. Route B is not a match at all.

This helps explain an interesting thing that can happen in Azure routing. If you create a generic rule for the 0.0.0.0/0 destination it will only impact routing to destinations outside of the virtual network – assuming you are using the private address spaces in your VNet. The subnets have system routes for the 3 private address spaces which will be more specific than 0.0.0.0:

  1. 168.0.0/16
  2. 16.0.0/12
  3. 0.0.0/8
  4. 0.0.0/0

If your VNet address space is 10.1.0.0/16 and you are trying to send a packet from subnet 1 (10.1.1.0/24) to subnet 2 (10.1.2.0/24), then the generic Route D will always be less specific than the system route, Route C.

Route Tables

A route table resource allows us to manage the routing of a subnet. Good practice is that if you need to manage routing then:

  • Create a route table for the subnet
  • Name the route table after the VNet/subnet
  • Only use a route table with 1 subnet

The first thing to know about route tables is that you can control BGP propagation with them. This is especially useful when:

  • You have peered virtual networks using a hub gateway
  • You want to control how packets get to that gateway and the destination.

The default is that BGP propagation is allowed over a peering connection to the spoke. In the route table (Settings > Configuration) you can disable this propagation so the BGP routes are never copied from the hub network (with the VNet gateway) to the peered spoke VNet’s subnets.

The second thing about route tables is that they allow us to create user-defined routes (UDRs).

User-Defined Routes

You can control the flow of packets using user-defined routes. Note that UDRs outrank BGP routes and System Routes:

  1. UDR
  2. BGP routes
  3. System routes

If I have a system or BGO route to get to 192.168.1.0/24 via some unwanted path, I can add a UDR to 192.168.1.0/24 via the desired path. If the two routes are identical destination matches, then my UDR will be active and the BGP/system route will be deactivated.

Troubleshooting Tools

The traditional tool you might have used is TRACERT. I’m sorry, it has some use, but it’s really not much more than PING. In the software defined world, the default gateway isn’t a device with a hop, the peering connection doesn’t have a hop, and TRACERT is not as useful as it would have been on-premises.

The first thing you need is the above knowledge. That really helps with everything else.

Next, make sure your NSGs aren’t the problem, not your routing!

Next is the NIC, if you are dealing with virtual machines. Go to Effective Routes and look at what is listed, what is active and what is not.

Network Watcher has a couple of tools you should also look at:

  • Next Hop: This is a pretty simple tool that tells you the next “appliance” that will process packets on the journey to your destination, based on the actual routing discovered.
  • Connection Troubleshoot: You can send a packet from a source (VM NIC or Application Gateway) to a certain destination. The results will map the path taken and the result.

The tools won’t tell you why a routing plan failed, but with the above information, you can troubleshoot a (desired) network path.

Locking Down Network Access to the Azure Application Gateway/Firewall

In this post, I will explain how you can use a Network Security Group (NSG) to completely lock down network access to the subnet that contains an Azure Web Application Gateway (WAG)/Web Application Firewall (WAF).

The stops are as follows:

  1. Deploy a WAG/WAF to a dedicated subnet.
  2. Create a Network Security Group (NSG) for the subnet.
  3. Associate the NSG with the subnet.
  4. Create an inbound rule to allow TCP 65503-65534 from the Internet service tag to the CIDR address of the WAG/WAF subnet.
  5. Create rules to allow application traffic, such as TCP 443 or TCP 80, from your sources to the CIDR address of the WAG/WAF
  6. Create a low priority (4000) rule to allow any protocol/port from the AzureLoadBlanacer service tag to the CIDR address of the WAG/WAF
  7. Create a rule, with the lowest priority (4096) to Deny All from Any source.

The Scenario

It is easy to stand up a WAG/WAF in Azure and get it up and running. But in the real world, you should lock down network access. In the world of Azure, all network security begins with an NSG. When you deploy WAG/WAF in the real world, you should create an NSG for the WAG/WAF subnet and restrict the traffic to that subnet to what is just required for:

  • Health monitoring of the WAG/WAF
  • Application access from the authorised sources
  • Load balancing of the WAG/WAF instances

Everything else inbound will be blocked.

The NSG

Good NSG practice is as follows:

  1. Tiers of services are placed into their own subnet. Good news – the WAG/WAF requires a dedicated subnet.
  2. You should create an NSG just for the subnet – name the NSG after the VNet-Subnet, and maybe add a prefix or suffix of NSG to the name.

Health Monitoring

Azure will need to communicate with the WAG/WAF to determine the health of the backends – I know that this sounds weird, but it is what it is.

Note: You can view the health of your backend pool by opening the WAG/WAF and browsing to Monitoring > Backend Health. Each backend pool member will be listed here. If you have configured the NSG correctly then the pool member status should be “Healthy”, assuming that they are actually healthy. Otherwise, you will get a warning saying:

Unable to retrieve health status data. Check presence of NSG/UDR blocking access to ports 65503-65534 from Internet to Application Gateway.

OK – so you need to open those ports from “Internet”. Two questions arise:

  • Is this secure? Yes – Microsoft states here that these ports are “are protected (locked down) by Azure certificates. Without proper certificates, external entities, including the customers of those gateways, will not be able to initiate any changes on those endpoints”.
  • What if my WAG/WAF is internal and does not have a public IP address? You will still do this – remember that “Internet” is everything outside the virtual network and peered virtual networks. Azure will communicate with the WAG/WAF via the Azure fabric and you need to allow this communication that comes from an external source.

In my example, my WAF subnet CIDR is 10.0.2.4/24:

Application Traffic

Next, I need to allow application traffic. Remember that the NSG operates at the TCP/UDP level and has no idea of URLs – that’s the job of the WAG/WAF. I will use the NSG to define what TCP ports I am allowing into the WAG/WAF (such as TCP 443) and from what sources.

In my example, the WAF is for internal usage. Clients will connect to applications over a VPN/ExpressRoute connection. Here is a sample rule:

If this was an Internet-facing WAG or WAF, then the source service tag would be Internet. If other services in Azure need to connect to this WAG or WAF, then I would allow traffic from either Virtual Network or specific source CIDRs/addresses.

The Azure Load Balancer

To be honest, this one caught me out until I reasoned what the cause was. My next rule will deny all other traffic to the WAG/WAF subnet. Without this load balancer rule, the client could not connect to the WAG/WAF. That puzzled me, and searches led me nowhere useful. And then I realized:

  • A WAG/WAF is 1+ instances (2+ in v2), each consuming IP addresses in the subnet.
  • They are presented to clients as a single IP.
  • That single IP must be a load balancer
  • That load balancer needs to probe the load balancer’s own backend pool – which are the instance(s) of the WAG/WAF in this case

You might ask: isn’t there a default rule to allow a load balancer probe? Yes, it has priority 65001. But we will be putting in a rule at 4096 to prevent all connections, overriding the 65000 rule that allows everything from VirtualNetwork – which includes all subnets in the virtual network and all peered virtual networks.

The rule is simple enough:

Deny Everything Else

Now we will override the default NSG rules that allow all communications to the subnet from other subnets in the same VNet or peered VNets. This rule should have the lowest possible user-defined priority, which is 4096:

Why am I using the lowest possible priority? This is classic good firewall rule practice. General rules should be low priority, and specific rules should be high priority. The more general, the lower. The more specific, the higher. The most general rule we have in firewalls is “block everything we don’t allow”; in other words, we are creating a white list of exceptions with the previously mentioned rules.

The Results

You should end up with:

  • The health monitoring rule will allow Azure to check your WAG/WAF over a certificate-secured channel.
  • Your application rules will permit specified clients to connect to the WAG/WAF, via a hidden load balancer.
  • The load balancer can probe the WAG/WAF and forward client connections.
  • The low priority deny rule will block all other communications.

Job done!