There Is More To Azure Networking Than Connectivity & Security

This post will explain how a well-designed, secured, governed and managed network design plays a foundational role in digital transformation and cloud enablement.

Cloud Adoption Versus Cloud Migration

What? Aidan – I thought this was a post about Azure networking!

Yes, it is … but you’ll have to join me on this journey. Lately, I’ve been using the “we need to step back and think about why we’re doing any of this” line quite a bit. The context of that line changes, but the message remains consistent.

Why did we go to The Cloud (Azure in our case)? For many, the reason is something like “I was told to”, “we were leaving our old hosting company”, or “our hardware support ended”. Those reasons triggered what I call a cloud migration project. I’ve done a LOT of those projects – thanks to scope limitations in the engagement, forced either by poorly advised customers (that lead to restricted tenders) or salespeople who refused to have a larger conversation.

Many organisations with internal developers that do a cloud migration end up in a situation 18-24 months later. Developers refuse to deploy into “IT’s cloud”. This is because IT has recreated its old data centre in Azure, along with the restrictions, controls, and lack of trust. We were told “cloud is how you work, not where you work”, but not many people heard that message. We end up with situations where businesses have paid for Azure, but developers don’t get the Cloud; they get IT-driven and IT-restricted virtualisation in Azure.

Cloud Adoption is a change journey, as documented by the Cloud Adoption Framework. We are supposed to:

  1. Understand why the business (not IT) wants to use the Cloud
  2. Create a cloud strategy for the organisation
  3. Define and enable a new way of delivering cross-functional digital services.
  4. Do all the other technical stuff that we focus on, with the architecture based on the above.

Steps 1 and 2 (CAF Strategy and Phase) are the keys to cloud adoption success. In theory, if we do everything correctly:

  1. The developers want to adopt the new cloud environment because it enables their mission.
  2. The business sees a return on the investment with faster innovation of digital services.

Where Does Networking Come Into This?

Pretty much every customer I’ve dealt with wants to improve their security for business protection or to meet compliance requirements. That typically results in larger usage of Virtual Networks. Many customers end up recreating their data centre networks in Azure; they create 1 Virtual Network (spoke) for each VLAN:

  • DMZ
  • Regular zone
  • Secure Zone

Or maybe they have:

  • Dev
  • Test
  • Production

Each of these networks shares various traits:

  • A big virtual network with many subnets
  • Managed by the central IT infrastructure

I can go into all the security and complexity flaws that result from this too-common design pattern. But my focus is on cloud adoption in this post:

  • Developers are actively prevented from having network access/control. They rely on helpdesk tickets to get anything done – what happened to the essential cloud trait of “on-demand self-service”?
  • Subscriptions are filled with dozens of resource groups. Access is granted on a per-resource group granularity, which complicates and slows things down.
  • The desire for more security is gradually eroded due to operational complexity and constant delegation of rights with complicated granularity.

So, believe it or not, Azure networking is our canary in the mine. I have used, and I continue to use this reliable little bird to smell out operational/security failures in customers’ Azure environments.

Now, you know how I can detect adoption problems from the floor up. Next I want to explain how I can architect the Azure network to solve these issues.

Landing Zones

Let’s bend some minds. 8-ish years ago, I started working on a new “standard design” for my employer (a consulting company) with a fellow principal consultant. We mutually came to the table with an alternative subscription strategy than usual. The norm was that each of the above traditional spoke VNets would be aligned with a subscription each. That results in very few subscriptions, with demands for complicated role delegations, tagging, cost management, and so on. We switched to a 1 subscription/workload (application/service) approach; this new level of granularity:

  • Required 1 small Virtual Network where networking is required
  • Developer/operator role delegations are done once per subscription
  • Cost management is done per subscription (Budgets) with much less tagging for metadata
  • Easier operations with fewer mistakes through subscription selection in Azure Portal/PowerShell/CLI/etc. The resource groups in the subscription are related to only that workload.
  • The security boundary is much smaller. The access boundary is the single workload. Any VNet-based workloads must route via the hub firewall to reach any other workload, subject to rules and IDPS inspection.

Microsoft introduced the concept of landing zones a few years ago, which uses the same subscription/workload approach:

  • Platform landing zone: A subscription that offers shared infrastructure, such as a hub, a shared Application Gateway/WAF, Active Directory Domain Controllers, DNS, etc.
  • Application landing zone: A subscription that hosts a single application/service/workload.

Like with my approach, each landing zone has a Virtual Network (if required) that is:

  • Sized according to the workload architecture with some spare capacity.
  • Peered with the hub, with the egress path from the workload being via the hub firewall.

Security & Governance

Let’s consider some things:

  • The business requires governance to manage IT and to ensure regulatory compliance.
  • IT security must protect the business, customers, vendors, etc.
  • We have many workloads/subscriptions.

We cannot have 1 policy for everything – sometimes we have business/operational reasons to have more-strict policies or less-strict policies. For example, we might require more Defender for Cloud features in some workloads or allow PaaS public endpoints in others.

Microsoft gave us Enterprise Scale around 5 years ago. This reference architecture (with supplied templated deployments) offers a subscription categorisation approach using Management Groups:

  • Corporate: Workloads that can connect to other networks.
  • Online: Workloads that have an online presence and should not connect to other workloads.

Azure Policy is used to enforce the standards for each Management Group.

I don’t know about you, but I have never seen such a binary requirement in the real world. I’ve seen many people discuss/use a third Management Group called Hybrid; they wonder how to build the policies to enforce the requirements.

In the real world, just about everything is shades of grey when it comes to connectivity. I’ve had ultra-secure workloads with web interfaces. I’ve had low-end workloads with high security. And I can guarantee you that sensitive workloads have compelling business reasons to be both online and integrated with traditional private-protocol connectivity.

I thought about this last year and came up with a different approach. We can use CAF’s operational methodologies to develop a tiered, documented, and implemented policy that aligns with the organisation’s governance, security, and management requirements. I suggested that we would have three tiers (names are irrelevant):

  • Gold: The strictest policies
  • Silver: Medium-level policies, containing the most workloads
  • Bronze: The most relaxed policies

The result is 3 Management Groups (above), each with Azure Policy automatically auditing/enforcing the designed and continuously improved requirements.

The new (CAF Plan) operational model would introduce a step to categorise the workload based on security risks, governance requirements, and management needs. Each workload would be placed in the correct Management Group with policies.

The policies give us automation and guardrails. For example, where appropriate, we can:

  • Restrict regions.
  • Ban public IP association with NICs
  • Disable public endpoints
  • Enable Defender for Cloud plans
  • Force VNet Flow Logging
  • Configure diagnostics settings
  • Enable VNet Flow Logs
  • And much more

The key to this is momentum. My approach is “minimum viable product” (MVP). For example, I had a 30-minute call with a customer last year and designed their starter policies. Now they (should) run regular reviews to assess the policies/risks/requirements and expand the policies/implementations. We didn’t freeze for 2 years to build a policy. We got some essentials in place and we carried on with getting results for the business.

Now, let’s get back to networking!

At-Scale Network Configuration And Enforcement

Developers, operators, and (rival) service providers are empowered to build in the Azure environment with a new guardrail-protected landing zone approach. How do we ensure that their Virtual Networks are built correctly?

We can use Azure Virtual Network Manager (AVNM).

Note that the horrid per-subscription pricing for AVNM was replaced a long time ago. Please go back and reassess the pricing before you run away.

AVNM gives us policy-driven:

  • Discovery and grouping of Virtual Networks for granular policy assignments
  • Peering with a hub and mesh capabilities
  • Route Table deployment/association with User-Defined Routes (UDRs)
  • Security Admin Rules that are processed before NSG rules with override capabilities
  • IP Address Management (IPAM) to provide approved, non-repeating IP prefixes for new networks and to manage their lifecycle

In short, if you deploy a VNet, I can:

  • Get an approved IP prefix for the Virtual Network
  • Use Azure Policy to automatically configure/enforce things like VNet Flow Logs and DNS settings
  • Use AVNM to correctly connect, route, and secure your VNet

To quote Van Halen: “they got you coming in, and they got you going out”. I always did prefer “Van Hagar” ๐Ÿ™‚

Summary

A legacy, cable-oriented, on-prem network in Azure indicates that the organisation has not modernised how digital services are created, operated, and delivered to the business. In short, the business is paying for the cloud but is getting remotely hosted Hyper-V.

We can enable modern collaborative working processes by modernising our designs. Using application landing zones will create a new form of granularity for all aspects of infrastructure, security, governance, and management. We can use the governance features to create the guardrails and some of the autmations. We can use Azure Virtual Network Manager (AVNM) to ensure a good Virtual Network deployment.

If You Want To Learn More

Contact me via my consulting company, Cloud Mechanix, if you would like to learn how I can help you with this design pattern.

When To Add Subnets To An Azure Virtual Network

In this post, I want to explain the real reasons to add subnets to an Azure virtual network. This post is born out of frustration. I’ve seen post after post on social media, particularly on LinkedIn, where the poster has “Azure expert” in their description, and sharing advice from the year 2002 for cable-oriented (on-prem) networks.

The BS Advice

Consider the scenario below:

The above diagram shows us the commonly advised Virtual Network architecture for a 3-tier web app. There are 3 tiers. The poster will say:

Each tier should have its own subnet for security reasons. Each subnet will have an NSG.

So if we have web servers, app servers, and database servers, the logic is that the subnet + NSG combination provides security. The poster is half right:

  • The NSG does micro-segmentation of the machines.
  • The subnets do nothing.

Back To The Basics … Again

I want you to do this:

  1. Build a VNet with 2 subnets.
  2. Build 2 VMs, each attached to a different subnet.
  3. Log into one of the VMs.
  4. Run tracert to the second VM.

What will you see? The next and only hop is the second VM.

Ping the default gateway. What happens? Timeouts. The default gateway does not exist.

This is easily explained: Virtual Networks do not exist. Subnets do not exist.

Think of a Virtual Network as a Venn diagram. Our two virtual machines are in the same circle. That is an instruction to the Azure fabric to say:

These machines are permitted to route to each other

That’s how Coca-Cola and PepsiCo could both have Virtual Networks with overlapping address spaces in the same Azure room and not be able to talk to each other.

Note: This is functionality of VXLAN implemented through the Hyper-V switch extension capability that was introduced in Windows Server 2012.

Simple Example

Let us fix that simple example. We will first understand that NSGs offer segmentation. No matter how I associate an NSG, the rules are always applied on the host virtual switch port (in Hyper-V, that’s on the NIC). If a rule says “no” then that packet is automatically dropped. If a rule says “yes”, then the packet is permitted.

In the diagram below, we accept that subnets play no role in security segmentation. We have flattened the network to a single subnet. There is a single web server, app server, and database server – we will add complexity later:

This network is much simpler, right? And it offers no less security than the needlessly more complicated first example. An NSG is associated with the subnet. NSG rules allow the required traffic, and a low-priority rule denies all other traffic. Only the permitted traffic can enter any specific NIC.

I’ve seen arguments that this will create complicated rules. Pah! I’ve built/migrated more apps than I care to remember. The rules for these apps are hardly ever that numerous.

Aidan, what if I am going to run a highly available application? Lucky for you if the code supports that (seriously!). Whether you’re using availability sets or availability zones (lucky you, these days), we will make a tiny design change.

We will create a (free) Application Security Group (ASG) for each tier. We will then use the ASG as the source and destination instead of the VM IP addresses.

Aidan, what if I’m going to use Virtual Machine Scale Sets (VMSS)? It’s no different: you add the ASG for the tier to the networking properties of the VMSS. Each created VMSS instance will automatically be associated with the ASG.

When Should I Add Subnets?

There are several reasons why you should add subnets. I’ll list them first before I demonstrate them:

  • Azure requires it
  • Unique routing
  • Remote network sources
  • Scaling

Azure Requires It

There are scenarios when Azure requires a dedicated subnet. Some that I can immediately think of are:

  • Virtual Network Gateway
  • Azure Route Server
  • SQL Managed Instance (MI)
  • App Service Regional Virtual Network Integration
  • App Service Environment (ASE – App Service Isolated Tier) VNet injection

Let’s PaaS-ify the developers (see what I did there :D) and move from VMs to PaaS. We will replace the web servers with App Services and the database with SQL MI:

  • The web servers ran two apps, Web and API. Each will have a Private Endpoint for ingress traffic. The Private Endpoints can remain in the General Subnet.
  • The web servers must talk to the app servers (still VMs) over the VNet, so they will get Regional VNet Integration via the App Service Plan. This will require a dedicated subnet for egress only. This subnet will have no ingress.
  • SQL MI requires a dedicated subnet.

Unique Routing

Next-hop routing is always executed by the source Azure NIC. Every subnet has a collective set of routes to destination prefixes (network addresses). Those routes are propagated to the NICs in that subnet (subnets do not exist). The NICs decide the next hop for each packet, and the Azure network fabric sends the packet (by VXLAN) directly to the NIC of the next hop

There may be a situation where you want to customise routing.

For the sake of consistency, I’m going to use our web app, but in a little wonky way. My app is expanding. Some more VMs are being added for custom processing. Those VMs are being added to the GeneralSubnet.

My wonky scenario is that the security team have decided that traffic from the App Servers to the SQL VM must go through a firewall – that implies that the return traffic must also go through the firewall. No other traffic inside the app needs to go through the firewall. The firewall is deployed in a peered hub.

That means that I must split the GeneralSubnet into two source routing domains (subnets):

  1. GeneralSubnet: Containing the Private Endpoint NICs and my new custom processing VMs.
  2. AppServerSubnet: Containing only the app server VMs.

We will implement the desired via-firewall routing using User-Defined Routes in Route Tables:

  • AppServerSubnet: Go via the hub firewall to get to SqlMiSubnet.
  • SqlMiSubnet: Go via the hub firewall to get to AppServerSunet.

Remote Network Sources

So far, we have been using Application Security Groups (ASGs) to abstract IP addresses in our NSGs. ASGs are great, but they have restrictions:

  • Firewalls, including Azure Firewall, have no idea what an ASG is. You will have to use IP addresses as the sources in the firewall rules – possibly abstracted as IP Groups (Azure Firewall) or similar in third-party firewalls.
  • ASGs can only be used inside their parent subscription. You’re not going to be able to use them as sources in other workloads if you follow the subscription/workload approach of application landing zones.

Using an IP address(es) as a source is OK if the workload does not autoscale. What happens if your app tier/role uses autoscaling and addresses are mixed with addresses from other tiers/roles that should not have access to a remote resource?

There is only one way to solve this: break the source resource(s) out into their own subnet. I recently saw this one with a multi-subscription workload where there was going to be an Azure-hosted DevOps agent pool. Originally, the autoscaling pool was going to share a subnet with other VMs. However, I needed to grant HTTPS access to the DevOps pool only to all other resources. I couldn’t do that if the DevOps pool remained in a shared subnet. I split the pool into its own subnet and was able to use that subnet’s prefix as the source in the various firewall/NSG rules.

Scaling

There are two scaling scenarios that I can think of at the moment. You will have some workload component that will autoscale. The autoscaling role/tier requires a large number of IPs that you want to dedicate to that role/tier. In this case, yes, you may dedicate a subnet to that role/tier.

The second scenario is that you have followed a good practice of deploying a relatively small VNet for your workload, with some spare capacity for additional subnets. However, the scope of the workload has changed significantly. The spare capacity will not be enough. You need to expand the VNet, so you add a second IP prefix to the VNet. This means that new IP capacity requires additional subnets from the new prefix.

In Summary

Every diagram for a new VNet in Azure should start very simply: 1 subnet. Do not follow the overly-simple advice from “Azure expert”LinkedIn posts that say “create a subnet for every tier to create a security boundary”. You absolutely do not need to do that! Most workloads, even in large enterprises, are incredibly simple, and one subnet will suffice. You should only add subnets when the need requires it, as documented above.

Designing An Azure Hub Virtual Network

In this post, I am going to share a process for designing a hub virtual network for a hub & spoke secured virtual network deployment in Microsoft Azure.

The process I lay out in this document will not work for everyone.I think, based experience, that very few organisations will find exceptions to this process.

What Is And Is Not In This Post

This post is going to focus on the process of designing a hub virtual network. You will not find a design here … that will come in a later post.

You will also not find any mention of Azure Virtual WAN. You DO NOT need to use Azure Virtual WAN to do SD-WAN, despite the claptrap on Microsoft documentation on this topic. Virtual WAN also:

  • Restricts your options on architecture, features, and network design.
  • Is a nightmare to troubleshoot because the underlying virtual network is hidden in a Microsoft tenant.

Rules Of Engagement

The hub will be your network core in a network stamp: a hub & spoke. The hub & spoke will contain networks in a single region, following concepts:

  • Resilience & independence: Workloads in a spoke in North Europe should not depend on a hub in West Europe.
  • Micro-segmentation: Workloads in North Europe trying to access workloads in West Europe should go through a secure route via hubs in each region.
  • Performance: Workload A in North Europe should not go through a hub in West Europe to reach Workload B in North Europe.
  • Cost Management: Minimise global VNet peering to just what is necessary. Enable costs of hubs to be split into different parts of the organisation.
  • Delegation of Duty: If there are different network teams, enable each team to manage their hubs.
  • Minimised Resources: The hub has roles only of transit, connectivity, and security. Do not place compute or other resources into the hub; this is to minimise security/networking complexity and increase predictability.

A Hub Design Process

The core of our Azure network will have very little in the way of resources. What can be (not “must be”)included in that hub can be thought of as functions:

  • Site-to-site networking: VPN, ExpressRoute, and SD-WAN.
  • Point-to-site VPN: Enabling individuals to connect to the Azure networks using a VPN client on their device.
  • Firewall: Providing security for ingress, egress, and inter-workload communications.
  • Virtual Machines: Reduce costs of secured RDP/SSH by deploying Azure Bastion in the hub.

If we are doing a high-level design, we have a two questions that we will ask about each of thse functions:

  • Is the function required?
  • What technology will be used?

We won’t get into tiers/SKUs, features, or configurations just yet; that’s when we get into low-level or detailed design.

One can use the following flow chart to figure out what to use – it’s a bit of an eye test so you might need to open the image in another tab:

Site-to-Site (S2S) Networking

While it is very commonly used, not every organisation requires site-to-site connectivity to Azure.

For example, I had a migration customer that was (correctly) modernising to the “top tier” of cloud computing by migrating from legacy apps to SaaS. They wanted to re-implement an SD-WAN for over 100 offices to connect their new and small Azure footprint. I was the lead designer so I knew their connectivity requirements – they were going to use Azure Virtual Desktop (AVD) only to connect to their remaining legacy apps. AVD doesn’t need a site-to-site connection. I was able to save that organisation from entering into a costly managed SD-WAN services contract and instead focus on Internet connectivity – not long later they shutdown their Azure footprint when SaaS aleternatives were found for the the last legacy applications.

If we establish that site-to-site connectivity is required then we must ask the first question:

Are latency and SLA important?

If the answer to either of these items is “yes” then there is no choice: An ExpressRoute Virtual Network Gateway is required.

If the answer is no, then we are looking at some kind of VPN connectivity. We can ask another question to determine the type of solution:

Will there be a small number of VPN connections?

If a small number of VPN connections is required, the Azure VPN Virtual Network Gateway is suitable – consider the SKUs/sizes and complexities of management to determine what “a small number” is.

If you determine that the VPN Virtual Network Gateway is unsuitable then an SD-WAN network virtual appliance (NVA) should be used. Note that it would be recommended to deploy Azure Route Server with a third-party VPN/SD-WAN appliance to enable propagation network prefixes:

  • Azure > SD-WAN
  • SD-WAN > Azure

You may find that you need one or more of the above solutions! For example:

  • Some ExpressRoute customers may opt to deploy a parallel VPN tunnel with an identical routing configuration over a completely different ISP. This enables automatic failover from ExpressRoute to VPN in the event of a circuit failure.
  • An SD-WAN customer may also have ExpressRoute for some offices/workloads where SLA or latency are important. Another consideration may be that one workload has other technical requirements that only ExpressRoute (Direct) can service such as very high throughput.

You have one more question to ask after you have picked the site-to-site component(s):

Will you require site-to-site transit through Azure via the site-to-site network connections?

In other words, should Remote Site A be able to route to Remote Site B using your Azure site-to-site connections? If the answer is yes then you must deploy Azure Route Server to enable that routing.

Point-To-Site (P2S) VPN

I personally have not deployed very much of this solution but I do hear it being discussed quite a bit. Some organisations must enable users (or external suppliers) to create a VPN connection from their individual devices to Azure. If this is required then you must ask:

Is the scenario(s) simple?

I’ve kept that vague because the problem is vague. There are two solutions with one being overly-simplistic in capabilities and the other being more fully-featured.

The Azure VPN Gateway (also used for site-to-site VPN) offers a very available (Azure resource) solution for P2S VPN. It offers different configuration for authentication and device support. But it is very limited. For example, it has no routing rules to restrict which users get access to which networks. This means that if you grant network (firewall/NSG) access to one user via the VPN address pool, you must grant the same access to all users, which is clearly pretty poor if you have many types/roles of remote VPN clients (IT, developer of workload X, developer of workload Y, Vendor A, Vendor B, etc).

In such scenarios, one should consider a third-party NVA for point-to-site networking. Third-party NVAs may offer more features for P2S VPN than the VPN Virtual Network Gateway.

A P2S NVA may reside in the same hub as a VPN Virtual Network Gateway (and other S2S solutions).

It’s not in the diagram but you should also consider Entra Global Secure Access as an alternative to P2S VPN. The Private Network Connector would be deployed in a spoke(s), not the hub.

Firewall

Is a firewall required? The correct answer for anyone considering a hub & spoke architecutre should be “of course it is”. But you might not like security, so we’ll ask that question anyway.

Once you determine that security is important to your employer, you must ask yourself:

Shall I use a native PaaS firewall?

The native PaaS solution in Azure is Azure Firewall. I have many technical reasons to prefer Azure Firewall over third-party alternatives. For consultants, a useful attribute of Azure Firewall is that you can skill up on one solution that you can implement/use/manage for many customers and projects (migrations) won’t face repeated delays as you wait on others to implement rules in third-party firewalls.

If you want to use a different firewall then you are free to do so.

If you are using Azure Firewall then there is a follow-up question if there will be S2S network connections:

Are the remote networks using non-RFC1918 address prefixes?

In other words, do the remote networks use address prefixes outside of:

  • 192.168.0.0/16
  • 172.16.0.0/12
  • 10.0.0.0/8

If they do then Azure Firewal requires some configuration because traffic to non-RFC1918 prefixes is forced to the Internet by default – they are Internet addresses after all! You can statically configure the prefixes if they do not change. Or …

  • If you are using Azure Route Server
  • The prefixes can change a lot thanks to scenarios such as acquisition or rapid growth

… you can (in preview today) configure integration between Azure Firewall and Azure Route Server so the firewall dynamically learns the address prefixes from the remote networks.

Virtual Machines

Do not put compute in the hub!

This scenario asks:

Will any of the workloads in your spoke virtual networks have virtual machines?

You will have virtual machines even if you “ban” virtual machines – I guarantee that they will eventually appear for things like security solutions, self-hosted agents, Azure Virtual Desktop, AKS, and so on.

Unfortunately, many consider secure remote access (SSH/RDP) to be opening a port in the firewall for TCP 22/3389. That is not considered secure because those protocols can be and have been attacked. In the past, those who took security seriously used a dedicated “jump box” or “bastion host” to isolate vulnerable on-premises machines from assets in the data centre. We can use the same process with Azure Bastion where there is no IaaS requirement – we leverage Entra security features to authenticate the connection request and the guest OS credentials to verify VM access.

One can deploy Bastion in a spoke – that is perfectly valid for some scenarios. However, many important features are only in the paid-for SKUs so you might wish to deploy a shared Azure Bastion. Unfortunately, routing restrictions by Bastion prevent deploying a shared Bastion in a spoke, so we have no choice but to deploy a shared Azure Bastion in a hub. If you wish to have a share an Azure Bastion across workloads then it will be the final component in the hub.

If/when Azure Bastion supports route tables in the AzureBastionSubnet I will recommend moving shared Bastion deployments to a spoke – yes, I know that we can do that with Azure Virtual WAN but there are many things that we cannot do with Azure Virtual WAN.

You could consider a third-party alterantive or a DIY bastion solution. If so, place that into a spoke because it will be compute-based.

Wrapping Up

As you can see, the high-level design of the hub is very simple.

There are few functions in it because when you understand Azure virtual networks, routing, and NSGs, then you understand that designing a secure network should not be complex. Complexity is the natural predator of manageability and dependable security. There is a little more detail when we get into a low-level or detailed design, but that’s a topic for another day.

Micro-Segmentation Security In Azure Networks

In this post, I want to discuss the importance of designing and implementing micro-segmentation in Azure networks.

Repeating The Same Mistakes

In 2002-2003, the world was being hammered by malware. So much so, that Microsoft did a reset on their Windows development processes and effectively built a new version of Windows XP with Windows XP Service Pack 2. The main security feature of that release was the Windows Firewall – the purpose of this was to isolate each Windows machine in the network by default. It’s a pity that nearly every Windows admin then used Group Policy to disable the Windows Firewall!

Times have moved on and so have the bad guys. Malware isn’t just an anarchist or hobby activity. Malware is a billion-dollar business (ransomware/data theft) and a military activity. Naturally, defences have evolved .. wait .. no … most admins/consultants are still deploying networks that your Daddy/Mommy deployed 22 years ago but I’ll deal with that in another post.

Instead, I want to discuss a part of the defensive solution: micro-segmentation.

Assume Penetration

We must assume that the attacker will always find a way in. Not every attack will be by Sandra Bullock clicking some magical symbol on a website to penetrate the firewall. Most attacks have relatively simple vectors such as stealing a password, hash highjacking, or getting an accountant to open a PDF. Determined attackers aren’t just “driving by”; they will look for an entry. Maybe it’s malware in vendor software that you will deploy! Maybe, it’s a vulnerability in open-source software that your developers will deploy via GitHub? Maybe a managed service provider’s Entra ID tenant has been penetrated and they have Lighthouse access to your Azure subscriptions? Each of those examples bypasses your firewall and any advanced scanning features that it may have. How do you stop them?

Micro-Segmentation

Let me conjure an image for you. A submarine is on patrol. It has a wartime mission. The submarine is always under orders to continue that mission. The submarine is detected by the enemy and is attacked. The attack causes damage which creates a flood. If left unchecked, the flood will sink the ship. What happens? The crew is trained to isolate the flood by sealing the leaking compartment – doors are slammed, seals are locked, and the water is contained in that compartment. Sure, the sailors and ship functions in that compartment are dead, but the ship can continue its mission.

That is a way to visualise micro-segmentation.

Microsoft Zero-Trust

Microsoft has a relatively small collection of documentation on zero-trust architecture for Azure. There are 3 useful bullet points:

  • Be ready to handle attacks before they happen.
  • Minimize the extent of the damage and how fast it spreads.
  • Increase the difficulty of compromising your cloud footprint.

Let’s expand on that a little.

Be Ready

You will be ready for an attack because you assume that you already are under attack. You don’t wait to deploy security systems and configurations; you design them with your workloads. You deploy security with your workloads. You maintain security with your workloads.

Increase The Difficulty of Compromising Your Cloud Footprint

You should put in the defences that are appropriate to your actual risks and ability to install/manage. A bad example is a medical organisation choosing a more affordable firewall to save a few bucks – this is the sort of organisation that will be targeted.

Minimise The Extent of Damage

This can also be referred to as minimising the blast zone. You want to limit how much damage the bad guys cause, just like the submarine limited flooding to the damaged compartment. This means that we make it harder to get from any one point on the network to the next.

It’s one thing to put in the security defences, but you must also:

  • Enable/configure the security features: it shocks me how many organisations/consultants opt not to or don’t know how to enable essential features in their security solution.
  • Monitor your security systems: If we assume that the attacker will get in, then we should monitor our security features to detect and shut down the attack. Again, I’m shocked every time I see security features in Azure that have no logging or alerting enabled.

Microsoft lays out a path to zero-trust where step number one is network segmentation. The basic pattern is laid out:

Applications are partitioned to different Azure Virtual Networks (VNets) and connected using a hub-spoke model

Microsoft uses the term “application”. I prefer the term “workload”. Some, like ITIL, might use the term “service”. A workload is a collection of resources that work together to provide a service to or for the organisation. Maybe it’s a bunch of Azure resources that create a retail site. Maybe it’s a CRM system. Maybe it’s an identity management & governance workload.

The pattern that Microsoft is recommending is one that I have been promoting through my employer for the last 6 years. Each workload gets a dedicated “small” virtual network. The workload VNet is peered with a hub (and only the hub by default). The hub firewall provides isolation and deeper inspection than NSGs can offer.

Step 4 tells us:

Fully distributed ingress/egress cloud micro-perimeters and deeper micro-segmentation

NSGs micro-segment the single or small set of subnet(s) in the VNet, restriocting resource-to-resource connections to just what is required. Isolation is now done centrally and at the NIC, thanks to NSGs. You should also consider network protections on PaaS resources such as Storage Accounts or Key Vaults.

If we revisit the submarine comparison, the workload-specific virtual network is one of the compartments in the boat. If there is a leak (an attack), the NSGs limit or slow down expansion in the subnet(s). The firewall isolates the workload/compartment from other workloads/compartments and the Internet by default to prevent command and control or downloads by the attacker. Deeper firewall inspection searches for attack patterns.

Don’t Forget Monitoring

Microsoft zero-trust has more than just networking. One other step I want to highlight is monitoring/alerting because it ties into the micro-segmentation features of networking. Consider the mechanisms we can put in place:

  • Paas resource firewalls with logging
  • NSG with VNet Flow Logging
  • (Azure) Firewall with logging for firewall rules and deep inspection features (Azure Firewall has Threat Intelligence and IDPS).

Each of those barriers or detection systems can be thought of as a string with a bell on it. The attacker will tickle or trip over those strings. If the bell rings, we should be paying attention. When you fail to put in the barriers or configure monitoring then you don’t know that the attacker is there doing something – and we assume that the attacker will get in and do something – so aren’t we failing to do our job?

It’s Not Just Me Telling You

You can say “There goes Aidan, rattling on about micro-segmentation. Why should I listen to him?”. It would be one thing if it were just me sharing my opinion on Azure network security but what if others told you to do the same things?

Microsoft tells you to implement micro-segmentation. The US NSA tells you to do it. The Canadian Centre for Cyber Security tells you to do it. The UK NCSC tells you to do it. I could keep googling (binging, of course) national security agencies and I’d find the same recommendation with each result. If you are not implementing this security technique designed for today’s threats (not for the Blaster worm of 2003) then you are not only not doing your job but you are choosing to leave the door open for attackers; that could be viewed very poorly by employers, by shareholders, or by informed compliance auditors.

How Do Network Security Groups Work?

A Greek Phalanx, protected by a shield wall made up of many individuals working under 1 instruction as a unit – like an NSG.

Yesterday, I explained how packets travel in Azure networking while telling you Azure virtual networks do not exist. The purpose was to get readers closer to figuring out how to design good and secure Azure networks without falling into traps of myths and misbeliefs. The next topic I want to tackle is Network Security Groups – I want you to understand how NSGs work … and this will also include Admin Rules from Azure Virtual Network Manager (AVNM).

Port ACLs

In my previous post, Azure Virtual Networks Do Not Exist, I said that Azure was based on Hyper-V. Windows Server 2012 introduced loads of virtual networking features that would go on to become something bigger in Azure. One of them was a mostly overlooked-by-then-customers feature called Port ACLs. I liked Port ACLs; it was mostly unknown, could only be managed using PowerShell and made for great demo content in some TechEd/Ignite sessions that I did back in the day.

Remember: Everything in Azure is a virtual machine somewhere in Azure, even “serverless” functions.

The concept of Port ACLs was it gave you a simple firewall feature controlled through the virtualisation platform – the virtual machine and the guest OS had no control and had to comply. You set up simple rules to allow or deny transport layer (TCP/UDP) traffic on specific ports. For example, I could block all traffic to a NIC by default with a low-priority inbound rule and introduce a high-priority inbound rule to allow TCP 443 (HTTPS). Now I had a web service that could receive HTTPS traffic only, no matter what the guest OS admin/dev/operator did.

Where are Port ACLs implemented? Obviously, it is somewhere in the virtualisation product, but the clue is in the name. Port ACLs are implemented by the virtual switch port. Remember that a virtual machine NIC connects to a virtual switch in the host. The virtual switch connects to the physical NIC in the host and the external physical network.

A virtual machine NIC connects to a virtual switch using a port. You probably know that a physical switch contains several ports with physical cables plugged into them. If a Port ACL is implemented by a switch port and a VM is moved to another host, then what happens to the Port ACL rules? The Hyper-V networking team played smart and implemented the switch port as a property of the NIC! That means that any Port ACL rules that are configured in the switch port move with the NIC and the VM from host to host.

NSG and Admin Rules Are Port ACLs

Along came Azure and the cloud needed a basic rules system. Network Security Groups (NSGs) were released and gave us a pretty interface to manage security at the transport layer; now we can allow or deny inbound or outbound traffic on TCP/UDP/ICMP/Any.

What technology did Azure use? Port ACLs of course. By the way, Azure Virtual Network Manager introduced a new form of basic allow/deny control that is processed before NSG rules called Admin Rules. I believe that this is also implemented using Port ACLs.

A Little About NSG Rules

This is a topic I want to dive deep into later, but let’s talk a little about NSG rules. We can implement inbound (allow or deny traffic coming in) or outbound (allow or deny traffic going out) rules.

A quick aside: I rarely use outbound NSG rules. I prefer using a combination of routing and a hub firewall (dey all by default) to control egress traffic.

When I create a NSG I can associate it with:

  • A NIC: Only that NIC is affected
  • A subnet: All NICs, including Vnet integrated PaaS resources and Private Endpoints, are affected

The association is simply a management scaling feature. When you associate a NSG with a subnet the rules are not processed at the subnet.

Tip: virtual networks do not exist!

Associating a NSG resource with a subnet propagates the rules from the NSG to all NICs that are connected to that subnet. The processing is done by Port ACLs at the NIC.

This means:

  • Inbound rules prevent traffic from entering the virtual machine.
  • Outbound rules prevent traffic from leaving the virtual machine.

Which association should you choose? I advise you to use subnet association. You can see/manage the entire picture in one “interface” and have an easy-to-understand processing scenario.

If you want to micro-manage and have an unpredictable future then go ahead and associate NSGs with each NIC.

If you hate yourself and everyone around you, then use both options at the same time:

  • The subnet NSG is processed first for inbound traffic.
  • The NIC NSG is processed first for outbound traffic.

Keep it simple, stupid (the KISS principle).

Micro-Segmentation

As one might grasp, we can use NSGs to micro-segment a subnet. No matter what the resources do, they cannot bypass the security intent of the NSG rules. That means we don’t need to have different subnets for security zones:

  • We zone using NSG rules.
  • Virtual networks and their subnets do not exist!

The only time we need to create additional subnets is when there are compatibility issues such as NSG/Route table association or a PaaS resource requires a dedicated subnet.

Watch out for more content shortly where I break some myths and hopefully simplify some of this stuff for you. And if I’m doing this right, you might start to look at some Azure networks (like I have) and wonder “Why the heck was that implemented that way?”.

Network Rules Versus Application Rules for Internal Traffic

This post is about using either Network Rules or Application Rules in Azure Firewall for internal traffic. I’m going to discuss a common scenario, a “problem” that can occur, and how you can deal with that problem.

The Rules Types

There are three kinds of rules in Azure Firewall:

  • DNAT Rules: Control traffic originating from the Internet, directed to a public IP address attached to Azure Firewall, and translated to a private IP Address/port in an Azure virtual network. This is implicitly applied as a Network Rule. I rarely use DNAT Rules – most modern applications are HTTP/S and enter the virtual network(s) via an Application Gateway/WAF.
  • Application Rules: Control traffic going to HTTP, HTTPS, or MSSQL (including Azure database services).
  • Network Rules: Control anything going anywhere.

The Scenario

We have an internal client, which could be:

  • On-premises over private networking
  • Connecting via point-to-site VPN
  • Connected to a virtual network, either the same as the Azure Firewall or to a peered virtual network.

The client needs to connect to a server, with SSL authentication, to a server. The server is connected to another virtual network/subnet. The route to the server goes through the Azure Firewall. I’ll complicate things by saying that the server is a PaaS resource with a Private Endpoint – this doesn’t affect the core problem but it makes troubleshooting more difficult ๐Ÿ™‚

NSG rules and firewall rules have been accounted for and created. The essential connection is either HTTPS or MSSQL and is implemented as an Application Rule.

The Problem

The client attempts a connection to the server. You end up with some type of application error stating that there was either a timeout or a problem with SSL/TLS authentication.

You begin to troubleshoot:

  • Azure Firewall shows the traffic is allowed.
  • NSG Flow Logs show nothing – you panic until you remember/read that Private Endpoints do not generate flow logs – I told you that I’d complicate things ๐Ÿ™‚ You can consider VNet Flow Logs to try to capture this data and then you might discover the cause.

You try and discover two things:

  • If you disconnect the NSG from the subnet then the connection is allowed. Hmm – the rules are correct so the traffic should be allowed: traffic from the client prefix(es) is permitted to the server IP address/port. The rules match the firewall rules.
  • You change the Application Rule to a Network Rule (with the NSG still associated and unchanged) and the connection is allowed.

So, something is going on with the Application Rules.

The Root Cause

In this case, the Application Rule is SNATing the connection. In other words, when the connection is relayed from the Azure Firewall instance to the server, the source IP is no longer that of the client; the source IP address is a compute instance in the AzureFirewallSubnet.

That is why:

  • The connection works when you remove the NSG
  • The connection works when you use a Network Rule with the NSG – the Network Rule does not SNAT the connection.

Solutions

There are two solutions to the problem:

Using Application Rules

If you want to continue to use Application Rules then the fix is to modify the NSG rule. Change the source IP prefix(es) to be the AzureFirewallSubnet.

The downsides to this are:

  • The NSG rules are inconsistent with the Azure Firewall rules.
  • The NSG rules are no longer restricting traffic to documented approved clients.

Using Network Rules

My preference is to use Network Rules for all inbound and east-west traffic. Yes, we lose some of the “Layer-7” features but we still have core features, including IDPS in the Premium SKU.

Contrary to using Application Rules:

  • The NSG rules are consistent with the Azure Firewall rules.
  • The NSG rules restrict traffic to the documented approved clients.

When To Use Application Rules?

In my sessions/classes, I teach:

  • Use DNAT rules for the rare occasion where Internet clients will connect to Azure resources via the public IP address of Azure Firewall.
  • Use Application Rules for outbound connections to Internet, including Azure resources via public endpoints, through the Azure Firewall.
  • Use Network Rules for everything else.

This approach limits “weird sh*t” errors like what is described above and means that NSG rules are effectively clones of the Azure Firewall rules, with some additional stuff to control stuff inside of the Virtual Network/subnet.

Azure WAF and False Positives

This post will explain how to override false positives in the (network) Azure Web Application Firewall (WAF), without compromising security, using one of four methods in combination with a tiered WAF Policy architecture:

  1. Managed Rulesets
  2. Custom Rules
  3. Exclusions
  4. Disabled rules

False Positives

A WAF is a rather simple solution, attempting to inspect L7 (application layer) traffic and intercept attacks such as protocol misuse, SQL injection, or cross-site scripting. Unfortunately, false positives can occur.

For example, let’s assume that an API app is securely shared using a WAF. Messages sent to the API might be formatted in JSON, with lots of special characters to format the message. SQL Inspection defenses count special characters, trying to find where an attacker is trying to escape out of a web request to create a database command that will execute. If the defense counts too many special characters (it will!) then an alert will be created and the message will be blocked if Prevention mode is enabled.

One must allow that traffic through because it is expected traffic that the application (and the business) requires. But one must do this without opening up too many holes in the WAF, making the WAF a costly, pointless existence.

Log Analytics Ingestion Charge

There is a side effect to false positives. False positives will vastly outnumber actual attack/probing attempts. Busy workloads can generate huge amounts of logs for false positives. If you use Log Analytics, that data has a cost:

  • Storage: Not too bad
  • Ingestion: This one is painful

The way to reduce the cost is to reduce the noise by overriding the detections that create false positives. Organizations that have a lot of web traffic could save a significant amount of money here.

WAF Policies

The WAF functionality of the Azure Application Gateway (AppGw) is managed by a resource called an Application Gateway WAF Policy (WAF Policy). The typical approach is to associate 1 WAF Policy with a WAF resource. The WAF policy will create customizations. For reasons that should become apparent later, I am going to urge you to take a slightly more granular approach to manage your WAF if your WAF is used to securely share more than one workload or listener:

  • WAF parent policy: A WAF policy will be associated with the WAF. This policy will apply to the WAF and all listeners unless another WAF Policy overrides specific settings.
  • Per-Listener/Per-Workload policy: This is a policy that is created specifically for a listener or a workload (a set of listeners). Any customisations that apply only to a listener or a workload will be applied here, without affecting any other listener or workload.

Methodology

You will never know what false positives you will encounter. If your WAF goes straight into Prevention mode then you will create a world of pain and be the recipient of a lot of hate-messages/emails.

Here’s the approach that I recommend:

  1. Protect your WAF with an NSG that has Traffic Analytics enabled. The NSG should only allow the necessary HTTP, HTTPS, WAF monitoring (from Azure), and load balancing traffic. Use a custom deny-all rule to block everything else.
  2. Enable monitoring for the Application Gateway, sending all logs to a queryable destination such as Log Analytics.
  3. Monitor traffic for a period of time – enough to allow expected normal usage of the full systems. Your monitoring should detect the false positives.
  4. Verify that Traffic Analytics did not record malicious IP addresses hitting your WAF.
  5. Query your monitoring data to find the false positives for each listener. Identify the hostname, request URI, ruleset, rule group, and rule ID that is causing the issue on a per-listener/workload basis.
  6. Ideally, developers fix any issues that create false positives but this is unlikely – so we’ll move on.
  7. Determine your override strategy (see below).
  8. Deploy your overrides with the policies still in Detection mode.
  9. Monitor traffic for another period of time to ensure that there are no more false positives.
  10. Switch the parent policy to Prevention Mode.
  11. Swith each per-listener/per-workload policy to Prevention Mode
  12. Monitor

Managed Rule Sets

The WAF today has two rulesets that you can use:

  • OWASP: Used to detect attacks such as SQL Injection, Cross-site scripting, and so on.
  • Microsoft Bot Manager Rule Set: Used to prevent malicious bots from browsing/attacking your workloads.

You need the OWASP ruleset – but we will need to manage it (later). The bot ruleset, in my experience, creates a huge amount of noise will no way of creating granular overrides. One can override the bot ruleset using custom rules, but as you’ll see later, that’s a big stick that is not granular at all!

My approach to this is to disable the Microsoft Bot Manager Rule Set (or leave it disabled) in the parent and child rulesets. If I have a need to enable it somewhere, I can do it in a per-listener or per-workload ruleset.

Custom Rules

A custom rule is created in a WAF Policy to force traffic that matches certain criteria to be:

  • Always allowed
  • Always denied
  • Logged only without denying it

You can create a sequence of filters based on:

  • IP Address
  • Number
  • String
  • Geo Location

If the set of filters matches a request then your desired action will apply. For example, if I want to force traffic to be allowed to my API, I can enter the API URI as one of the filters (as above) and all traffic will be allowed.

Yes, all traffic will be allowed, including traffic that is not a false positive. If I only had a few OWASP rules that were blocking the traffic, the custom rule would disable all OWASP rules.

If you must use this approach, then implement it in the child policy so it is limited to the associated listener/workload.

Exclusions

This is the newest of the override types in WAF Policy – and I’ve found it to be the least useful.

The theory is that you can create an exclusion for one or more OWASP rules based on the values of request headers. For example, if a header called RequestHeaderKeys contains a value of X-Scanner you can instruct the affected OWASP rules to be disabled. This sounds really powerful and quite granular. But this starts to fall apart with other scenarios, such as the aforementioned SQL Injection.

Another common rule that alerts on or blocks traffic is Missing User Agent Header. Exclusions work on the value of a header, so if the header is missing, Exclusions cannot evaluate it.

Another gotcha is that you cannot combine header filters to create an exclusion. The Azure Portal experience for creating an Exclusion makes it look like you can. However, the result is two or more Exclusions that work independently.

If Exclusions will work for you, implement them in the per-listener/per-workload policy and specify only the rules that must be overridden. This approach will limit the effect of the exclusion:

  1. The scope is just the listener/workload that is associated with the WAF Policy.
  2. The scope is further limited to just requests where the header matches, allowing all other requests and all OWASP rules to be applied.

Disabled Rules

The final approach that you can use is to disable rules that are creating false positive alerts. A simple workload might only require one or two rules to be disabled. An older & larger workload might require many OWASP rules to be disabled!

If you are going to disable OWASP rules, then do it in the per-listener/per-workload policy. This will limit the effect of the changes to that listener/workload.

This is a fairly each approach and it is pretty granular – not as much as Exclusions. The downside is that you are completely disabling certain protections for an entire listener/workload, leaving the workload vulnerable to attacks of those previously protected types.

Combinations

If you have the time and the data, you can combine different approaches. For example:

  • A webhook that comes from the same IP address all of the time can be allowed via a Custom Rule based on an IP Address filter. Any other traffic will be subject to the fill defenses of the WAF.
  • If you have certain headers that must be allowed and you want to enable all other protections for all other traffic then use Exclusions.
  • If traffic can come from anywhere and you need to override OWASP rules, then disable those rules.

No Great Solution

In summary, there is no perfect solution. The best you can do is find the correct override solution for the specific false positive and deploy it to a specific listener or workload. This will limit the holes that you create in the WAF to the absolute minimum while enabling your workloads to function.

Azure Firewall Basic – For Small/Medium Business & “Branch”

Microsoft has just announced a lower cost SKU of Azure Firewall, Basic, that is aimed at small/medium business but could also play a role in “branch office” deployments in Microsoft Azure.

Standard & Premium

Azure Firewall launched with a Standard SKU several years ago. The Standard SKU offered a lot of features, but some things deemed necessary for security were missing: IDPS and TLS Inspection were top of the list. Microsoft added a Premium SKU that added those features as well as fuller web category inspection and URL filtering (not just FQDN).

However, some customers didn’t adopt Azure Firewall because of the price. A lot of those customers were small-medium businesses (SMBs). Another scenario that might be affected is a “branch office” in an Azure region – a smaller footprint that is closer to clients that isn’t a main deployment.

Launching The Basic SKU

Microsoft has been working on a lower cost SKU for quite a while. The biggest challenge, I think, that they faced was trying to figure out how to balance feature, performance, and availability with price. They know that the target market has a finite budget, but there are necessary feature requirements. Every customer is different, so I guess when face with this conundrum, one needs to satisfy the needs of 80% of customers.

The clues for a new SKU have been publicly visible for quite a while – the ARM reference for Azure Firewall documented that a Basic SKU existed somewhere in Azure (in private preview). Tonight, Microsoft launched the Basic SKU inpublic Preview. A longer blog post adds some details.

Introducing the Azure Firewall

The primary target market for the Basic SKU hasn’t deployed a firewall appliance of any kind in Azure – if they are in Azure then they are most likely only using NSGs for security – which operates only at the transport protocol (TCP, UDP, ICMP) layer in a decentralised way.

The Azure Firewall is a firewall appliance, allowing centralised control. It should be deployed with NSGs and resource firewalls for layered protection, and where there is a zero-trust configuration (deny all by default) in all directions, even inside of a workload.

The Azure Firewall is native to Microsoft Azure – you don’t need a third party license or support contract. It is fully deployable and configured as code (ARM, Bicep, Terraform, Pulumi, etc), making it ideal for DevSecOps. Azure Firewall is much easier to learn than NVAs because the firewall is easily available through an Azure subscription and the training (Microsoft Learn) is publicly available – not hidden behind classic training paywalls. Thanks to the community and a platform model, I expect that more people are learning Azure Firewall than any other kind of firewall today – skills are in short supply so using native tech that is easy to learn and many are learning just makes sense.

Comparing Azure Basic With Standard and Premium

Microsoft helpfully put together a table to compare the 3 SKUs:

Comparing Azure Firewall Basic with Standard and Premium

Another difference with the Basic SKU is that you must deploy the AzureFirewallManagementSubnet in addition to the AzureFirewallSubnet – this additional subnet is often associated with forced tunneling. The result is that the firewall will have a second public IP address that is used only for management tasks.

Pricing

The Basic SKU follows the same price model as the higher SKUs: a base compute cost and a data processing cost. The shared pricing is for the Preview so it is subject to change.

The Basic SKU base compute (deployment) cost is โ‚ฌ300.03 per month in West Europe. That’s less than 1/3 of the cost of the Standard SKU at โ‚ฌ947.54 per month. The data processing cost for the Basic SKU is higher at โ‚ฌ0.068 per GB. However, the amount of data passing through such a firewall deployment will be much lower so it probably will not be a huge add-on.

Preview Deployment Error

At this time, the Basic SKU is in preview. You must enable the preview in your subscription. If you do not do this, your deployment will fail with this error:

“code”:ย “FirewallPolicyMissingRequiredFeatureAllowBasic”,

“message”:ย “Subscriptionย ‘someGuid’ย isย missingย requiredย featureย ‘Microsoft.Network/AzureFirewallBasic’ย forย Basicย policies.”

Some Interesting Notes

I’ve not had a chance to do much work with the Basic SKU – work is pretty crazy lately. But here are two things to note:

  • A hub & spoke deployment is still recommended, even for SMBs.
  • Availability zones are supported for higher availability.
  • You are forced to use Azure Firewall Manager/Azure Firewall Policy – this is a good thing because newer features are only in the new management plane.

Final Thoughts

The new SKU of Azure Firewall should add new customers to this service. I also expect that larger enterprises will also be interested – not every deployment needs the full blown Standard/Premium deployment but some form of firewall is still required.

Azure Firewall DevSecOps in Azure DevOps

In this post, I will share the details for granting the least-privilege permissions to GitHub action/DevOps pipeline service principals for a DevSecOps continuous deployment of Azure Firewall.

Quick Refresh

I wrote about the design of the solution and shared the code in my post, Enabling DevSecOps with Azure Firewall. There I explained how you could break out the code for the rules of a workload and manage that code in the repo for the workload. Realistically, you would also need to break out the gateway subnet route table user-defined route (legacy VNet-based hub) and the VNet peering connection. All the code for this is shared on GitHub – I did update the repo with some structure and with working DevOps pipelines.

This Update

There were two things I wanted to add to the design:

  • Detailed permissions for the service principal used by the workload DevOps pipeline, limiting the scope of change that is possible in the hub.
  • DevOps pipelines so I could test the above.

The Code

You’ll find 3 folders in the Bicep code now:

  • hub: This deploys a (legacy) VNet-based hub with Azure Firewall.
  • customRoles: 4 Azure custom roles are defined. This should be deployed after the hub.
  • spoke1: This contains the code to deploy a skeleton VNet-based (spoke) workload with updates that are required in the hub to connect the VNet and route ingress on-prem traffic through the firewall.

DevOps Pipelines

The hub and spoke1 folders each contain a folder called .pipelines. There you will find a .yml file to create a DevOps pipeline.

The DevOps pipeline uses Azure CLI tasks to:

  • Select the correct Azure subscription & create the resource group
  • Deploy each .bicep file.

My design uses 1 sub for the hub and 1 sub for the workload. You are not glued to this bu you would need to make modifications to how you configure the service principal permissions (below).

To use the code:

  1. Create a repo in DevOps for (1 repo) hub and for (1 repo) spoke1 and copy in the required code.
  2. Create service principals in Azure AD.
  3. Grant the service principal for hub owner rights to the hub subscription.
  4. Grant the service principal for the spoke owner rights to the spoke subscription.
  5. Create ARM service connections in DevOps settings that use the service principals. Note that the names for these service connections are referred to by azureServiceConnection in the pipeline files.
  6. Update the variables in the pipeline files with subscription IDs.
  7. Create the pipelines using the .yml files in the repos.

Don’t do anything just yet!

Service Principal Permissions

The hub service principal is simple – grant it owner rights to the hub subscription (or resource group).

The workload is where the magic happens with this DevSecOps design. The workload updates the hub suing code in the workload repo that affects the workload:

  • Ingress route from on-prem to the workload in the hub GatewaySubnet.
  • The firewall rules for the workload in the hub Azure Firewall (policy) using a rules collection group.
  • The VNet peering connection between the hub VNet and the workload VNet.

That could be deployed by the workload DevOps pipeline that is authenticated using the workload’s service principal. So that means the workload service principal must have rights over the hub.

The quick solution would be to grant contributor rights over the hub and say “we’ll manage what is done through code reviews”. However, a better practice is to limit what can be done as much as possible. That’s what I have done with the customRoles folder in my GitHub share.

Those custom roles should be modified to change the possible scope to the subscription ID (or even the resource group ID) of the hub deployment. There are 4 custom roles:

  • customRole-ArmValidateActionOperator.json: Adds the CUSTOM – ARM Deployment Operator role, allowing the ARM deployment to be monitored and updated.
  • customRole-PeeringAdmin.json: Adds the CUSTOM – Virtual Network Peering Administrator role, allowing a VNet peering connection to be created from the hub VNet.
  • customRole-RoutesAdmin.json: Adds the CUSTOM – Azure Route Table Routes Administrator role, allowing a route to be added to the GatewaySubnet route table.
  • customRole-RuleCollectionGroupsAdmin.json: Adds the CUSTOM – Azure Firewall Policy Rule Collection Group Administrator role, allowing a rules collection group to be added to an Azure Firewall Policy.

Deploy The Hub

The hub is deployed first – this is required to grant the permissions that are required by the workload’s service principal.

Grant Rights To Workload Service Principals

The service principals for all workloads will be added to an Azure AD group (Workloads Pipeline Service Principals in the above diagram). That group is nested into 4 other AAD security groups:

  • Resource Group ARM Operations: This is granted the CUSTOM – ARM Deployment Operator role on the hub resource group.
  • Hub Firewall Policy: This is granted the CUSTOM – Azure Firewall Policy Rule Collection Group Administrator role on the Azure Firewalll Policy that is associated with the hub Azure Firewall.
  • Hub Routes: This is granted the CUSTOM – Azure Route Table Routes Administrator role on the GattewaySubnet route table.
  • Hub Peering: This is granted the CUSTOM – Virtual Network Peering Administrator role on the hub virtual network.

Deploy The Workload

The workload now has the required permissions to deploy the workload and make modifications in the hub to connect the hub to the outside world.

Defending Against Supply Chain Attacks

In this post, I will discuss the concepts of supply chain attacks and some thoughts around defending against them.

What Is A Supply Channel Attack?

The recent ransomware attack through Kaseya made the news but the concept of a supply chain attack isn’t new at all. Without doing any research I can think of two other examples:

  • SolarWinds: In December 2020, attackers used compromised code in SolarWinds monitoring solutions to compromise customers of SolarWinds.
  • RSA: In 2011, the Chinese PLA (or hackers sponsored by them) compromised RSA and used that access to attack customers of RSA.

What is a supply chain attack? It’s pretty hard to break into a network, especially one that has hardened itself. Users can be educated – ok, some will never be educated! Networks can be hardened and micro-segmented. Identity protections such as MFA and threat detection can be put in place.

But there remains a weakness – or several of them. There’s always a way into a network – the third party! Even the most secure network deployments require some kind of monitoring system – something where a piece of software is deployed onto “every VM”.ย  Or there’s some software vendor that’s deep into your network that has openings all over the place. Those are your threats. If an attacker compromises the software from one of those vendors then they will get into your network during your next update and they will use the existing firewall holes & permissions that are required by the software to probe, spread, and attack.

Protection

You still need to have your first lines of defense, ideally using tools that are designed for protection against advanced persistent threats – not your regular AV package, dumby:

  1. Identity
  2. Email
  3. Firewall
  4. Backup with isolated offline storage protected by MFA

That’s a start, but, but a supply chain attack bypasses all that by using existing channels to enter your network as if it is from a trusted source – because the attack is embedded in the code from a trusted source.

Micro-Segmentation

The first step should be micro-segmentation (AKA multi-segementation). No two nodes on your network should be able to communicate unless:

  1. They have to
  2. They are restricted to the required directions, protocols, and ports.
  3. That traffic passes through a firewall – and ideally several firewalls.

In Microsoft Azure, that means using:

  • A central firewall, in the form of a network firewall and/or web application firewall (Azure or NVA). This firewall controls connections between the outside world and your workloads, between your workloads, and importantly from your workloads to the outside world (prevents malware from talking to its human controller).
  • Network Security Groups at the subnet level that protect the subnet and even isolate nodes inside the subnet (use a custom Deny All rule because the default Deny All rule is useless when you understand the logic of how it works).
  • Resource firewalls – that’s the guest OS firewall and Azure resource firewalls.

If you have a Windows ADDS domain, use Group Policy to force the use of Windows Firewall – lazy admins and those very same vendors that will be the channel of attack will be the first to attempt to disable the firewall on machines that they are working on.

For Azure resources, consider the use of Azure Policy to force/audit the use of the firewalls in your resources and a default route to 0.0.0.0/0 via your central firewall.

An infrastructure-as-code approach to the central firewall (Azure Firewall) and NSGs brings documentation, change control, and rollback to network security.

Security Monitoring

This is where most organisations fail, and even where IT security officers really don’t get it.

Syslog is not security monitoring. Your AV is not security monitoring. You need something bigger, that is automated, and can filter through the noise – I regularly use the term “be your Neo to read the Matrix”. That’s because even in a small network, there is a lot of noise. Something needs to filter through that noise and identity the threats.

For example, there’s a lot of TCP 445 connection attempts coming from one IP address. Or there are lots of failed attempts to sign in as a user from one IP address. Or there are lots of failed connections logged by NSG rules. Or even better – all of the above. These are the sorts of things that malware that is attempting to spread will do. This is the sort of work that Azure Sentinel is perfect for – Sentinel connects to many data sources, pulls that data to a central place where complex queries can be run to look for threats that a human won’t be able to do. Threats can create incidents, incidents can trigger automated flows to eliminate the noise, and the remaining incidents can create alerts that humans will act upon.

But some malware is clever and not so noisy. The malware that hit the HSE (the Irish national health service) uses a lot of manual control to quietly spread over a very long time. Restricting outbound access to the Internet to just the required connections for business needs will cripple this control mechanism. But there’s still an automated element to this malware.

Other things to implement in Azure will include:

  • IDPS: An intrusion detection & prevention in the firewall, for example Azure Firewall Premium. When known malware/attack flows pass through the firewall, the firewall can log an alert or alert/deny the flows.
  • Security Center: Enabling Security Center “Azure Defender” (the tier previously known as the Azure Security Center Standard) provides you with oodles of new features, including some endpoint protections that are very confusingly packaged and licensed by Microsoft.

Managed Services Providers

MSPs are a part of the supply chain for their customers. MSP staff typically have credentials that allow them into many customer networks/services. That makes the identities of those staff very valuable.

A managed service provider should be a leader in identity security process, tooling, and governance. In the Microsoft world, that means using Azure AD Premium with MFA enabled for all staff. In the Azure world, Lighthouse should be used to gain access to customers’ cloud implementations. And that access should be zero-trust, powered by Privileged Identity Management (PIM).

Oh Cr@p!

These attackers are not script kiddies. They are professional organisations with big budgets, very skilled programmers and operators, and a lot of time and will. They know that with some persistent effort targeting a vendor, they can enter a lot of networks with ease. Hitting a systems management company, or more scarily, a security vendor, reaps BIG rewards because we invest in these products to secure ourย entire networks. The other big worry is those vendors that are deeply embedded with certain verticals such as finance or government. Imagine a vendor that is in every branch of a national government – one successful attack could bring down that entire government afterย  a wave of upgrades! Or hitting a well known payment vendor could open up every bank in the EU.