Azure Bastion For Secure SSH/RDP in Preview

Microsoft has announced a new preview of a platform-based jumpbox called Azure Bastion for providing secure RDP or SSH connections to virtual machines running or hosted in Azure.

Secure Remote Connections

Most people that are using The Cloud are using virtual machines, and one of the great challenges for them is secure remote access. You need RDP or SSH to be able to run these machines in the real world.

Remember: for 99.9% of customers, servers are not cattle, they are sacred cows.

Just opening up RDP or SSH straight through a public IP address is bad – hopefully you have an NSG in place, but even that’s bad. If you enable Standard Tier Security Center, the alerts will let you know how bad pretty quickly. And if the recent scare about the RDP vulnerability didn’t wake you up to this, then maybe you deserve to have someone else’s bot farm or a bitcoin mine running in your network.

There are ways that you can secure things, but they all have the pluses and minuses.

VPN

The real reason that we have point-to-site VPN in Azure virtual network gateway was as an admin entry point to the virtual network.

The clue is in the maximum number of simultaneous connections which is 128, way too low to consider as an end user solution for a Fortune 1000, who Microsoft really do their planning for.

If you have supported end user VPN then you know that it’s right up there with password resets for helpdesk ticket numbers, even with IT people like developers. Don’t go here – it won’t end well.

Just-in-Time VM Access

JIT VM Access is a feature of Security Center Standard Tier. It modifies your NSG rules to deny managed protocols such as RDP/SSH (the deny rules are stupidly made as low priority so they don’t override any allow rules!).

When you need to remote onto a VM, an NSG rule is added for a managed amount of time to allow remote access via the selected protocol from a specific source IP address.

So, if it’s all set up right, you deny remote access to virtual machines most of the time. But you will open direct access. And the way JIT VM Access manages the rules now is wonky, so I would not trust it.

An RDP Jumpbox

This is an old method – a single virtual machine, or maybe a few of them, are made available for direct access. They are isolated into a dedicated subnet. You remote into a jumpbox, and from there, you remote into one of your application/data virtual machines.

Unfortunately, it’s still straight RDP/SSH into a machine that is directly accessible on the Internet. So in the remoting protocol vulnerability scenario, you are still vulnerable at the application layer. You could combine JIT VM Access, but now normal daily operations are going to be a drag and I guarantee you that people will invest time to undermine network security. Also, you are limited to 2 RDS connections per jumpbox without investing in a larger RDS (machines + licensing) solution.

Guacamole

This one is relatively new to me. At first it looked awesome. It’s a HTTPS-based service that allows you to proxy into Linux or Windows virtual machines via RDP or SSH.

All looked good until you started running Windows Server 2016 or later in your virtual machines and you needed NLA for secure connections via RDP. Then it all fell apart. The solution requires you to either disable NLA in the guest OS (boo!) or to hard code a username/password with local logon rights for your guest OS’s into the Guacamole server (double-boo!).

Azure Bastion

In case you don’t know this, a bastion host is another name for a jumpbox – an isolated machine that you bounce through. In this case, Bastion is a service that is accessible via the Azure Portal. You sign into the portal, click Connect and use the Bastion service to connect to a Linux or Windows virtual machine via SSH/RDP in the Portal. The virtual machine does not require a public IP address or a “NAT rule”, but it’s still SSH/RDP.

Azure Bastion

On the downside:

  • There’s no multi-factor authentication (MFA)
  • It requires that you sign into the Azure Portal – many people running in the guest OS might not even have those rights!
  • VNet peering is not supported – so larger enterprises are ruled out here … no one in their right mind will deploy 500 bastion hosts (one per VNet) in a large enterprise.

Microsoft did say that these things will be worked on, but when? After GA, which based on the time of year I guess will be just before/after Ignite in early November?

In my opinion, Bastion is the right idea, but more of the backlog should have been included in the minimal viable product.

A Gateway to a Better Solution

If you are a Citrix or a RDS person then you’ve been screaming for the last 5 minutes. Because you’ve been using something for years that most people still don’t know is possible. Both Citrix and RDS have the concept of an SSL gateway.

In the case of RDS, we can deploy one or more (load balanced) Windows Server virtual machines with the RDS Gateway role. If we combine that with NPS and Azure AD, we can also add MFA. With a simple tweak to the Remote Desktop Connection client (MSTSC.EXE), we can RDP to a Windows machine behind the RDS Gateway. The connection from the client to the gateway is pre-authenticated, x.509 certificate protected, HTTPS traffic encapsulating the RDP stream. That connection terminates at the RDS Gateway and then forwards as RDS to the desired Windows Server virtual machine behind it.

Unlike the previous jumpbox solution:

  • This can be a low-end machine, such as a B-Series.
  • It can scale out using a load balancer
  • Many people can relay through a single jumpbox machine.
  • You won’t need RDS licensing at all, not even to scale out to more than 2 users per gateway machine.

So – there’s no SSH here. So Linux is a problem.

Opinion

We don’t really have a complete solution right now. Azure Bastion probably will be the best one in the long-run, but it has so many missing features that I couldn’t consider it now. For Windows, an RDS Gateway is probably best, and for Linux, a Guacamole server might be best.

What do you think?

Webinar – Getting More Performance From Azure VMs

I will be doing a webinar later today for the European SharePoint Office 365 & Azure Community (from the like-named conference). The webinar is at 14:00 UK/Irish, 15:00 CET, and 09:00 EST. Registration is here.

Title: Getting More Performance from Azure Virtual Machines

Speaker: Aidan Finn, MVP, Ireland

Date and Time: Wed, May 1, 2019 3:00 PM – 4:00 PM CEST

Webinar Description:  You’ve deployed your shiny new application in the cloud, and all that pride crashes down when developers and users start to complain that it’s slow. How do you fix it? In this session you’ll learn to understand what Azure virtual machines can offer, how to pick the right ones for the right job, and how to design for the best possible performance, including networking, storage, processor, and GPU.

Key benefits of attending:
– Understand virtual machine design
– Optimise storage performance
– Get more from Azure networking

Do Not Enable Azure Storage Account Firewall – IaaS

If you read through the security recommendations in Azure Security Center, you do get given out to a lot. A lot of it makes no sense if you understand Azure and the recommendations. One that appeared to make sense was to enable the relatively new firewall in Azure Storage:

  • Only allow trusted subnets – nice idea to limit the attack surface on the storage account in conjunction with service endpoints.
  • Allow “trusted Microsoft services” to access the storage account (on by default).

Note: A storage account can only be connected if you know one of the really long random access keys.

But if you do enable this firewall in an Azure deployment, things will break:

  • Boot Diagnostics: Does not know how to write to a secured storage account, even with firewall rules and service endpoints enabled.
  • Serial Console Access: Requires Boot Diagnostics to be working so that’s dead too.
  • NSG Flow Logs/Traffic Analytics: Another feature that doesn’t understand a secured storage account, even with “trusted Microsoft services” marked as enabled (default).

And there might be more!

So you have to aks yourself – do you want maximum security or a usable & manageable system? Storage account firewalls are pretty new – we didn’t need them a few months ago. So we can drop that feature, and maybe use the new Advanced Threat Protection for storage accounts feature instead?

It’s a pit that some joined-up thinking and integration testing weren’t done here.

Global ONLINE Azure Bootcamp

On one day every year, community members all across the planet get together at local events and host/attend sessions on Azure; this is the Global Azure Bootcamp. It’s been running on a Spring Saturday for years, and this year it is on April 27th.

Unfortunately, Microsoft Ireland wasn’t able to provide a venue so it looked like there would not be a local event in this part of Ireland. While I was at the recent MVP Summit, I threw out the idea of running an online version of the Global Azure Bootcamp … a Global Online Azure Bootcamp. The MVP Lead for UK& Ireland, Claire, loved the idea, ran off to the organisers of the global event, came back and said “do it!”.

So I did … I reached out to the speaker community and … was blown away by the response. So much so, that this will be a truly Global ONLINE Azure Bootcamp with content for all timezones:

  • We’re starting at 09:00 Perth/Bejing time
  • Finishing at 17:00 Seattle/Los Angeles time

The idea is that sessions will be pre-recorded and made available online on a scheduled basis on April 27th. That means anyone with Internet access anywhere on the planet can join this instance of the Global Azure Bootcamp – some of the presenters will actually be live-presenting elsewhere that day!

The content spans many tracks: dev, infrastructure, devops, data, AI, governance, security, and more. There really is something for everyone that is interested in Azure.

You can learn more here on the official event site.

This event has no sponsorship and it’s all be organized at the very last second. So here’s my ask:

Hopefully we’ll see (so to speak because we don’t have tracking) you there on the day!

Aidan.

Azure Availability Zones in the Real World

I will discuss Azure’s availability zones feature in this post, sharing what they can offer for you and some of the things to be aware of.

Uptime Versus SLA

Noobs to hosting and cloud focus on three magic letters: S, L, A or service level agreement. This is a contractual promise that something will be running for a certain percentage of time in the billing period or the hosting/cloud vendor will credit or compensate the customer.

You’ll hear phrases like “three nines”, or “four nines” to express the measure of uptime. The first is a 99.9% measure, and the second is a 99.99% measure. Either is quite a high level of uptime. Azure does have SLAs for all sorts of things. For example, a service deployed in a valid virtual machine availability set has a connectivity (uptime) SLA of 99.9%.

Why did I talk about noobs? Promises are easy to make. I once worked for a hosting company that offers a ridiculous 100% SLA for everything, including cheap-ass generic Pentium “servers” from eBay with single IDE disks. 100% is an unachievable target because … let’s be real here … things break. Even systems with redundant components have downtime. I prefer to see realistic SLAs and honest statements on what you must do to get that guarantee.

Azure gives us those sorts of SLAs. For virtual machines we have:

  • 5% for machines with just Premium SSD disks
  • 9% for services running in a valid availability set
  • 99% for services running in multiple availability zones

Ah… let’s talk about that last one!

Availability Sets

First, we must discuss availability sets and what they are before we move one step higher. An availability set is anti-affinity, a feature of vSphere and in Hyper-V Failover Clustering (PowerShell or SCVMM); this is a label on a virtual machine that instructs the compute cluster to spread the virtual machines across different parts of the cluster. In Azure, virtual machines in the same availability set are placed into different:

  • Update domains: Avoiding downtime caused by (rare) host reboots for updates.
  • Fault domains: Enable services to remain operational despite hardware/software failure in a single rack.

The above solution spreads your machines around a single compute (Hyper-V) cluster, in a single room, in a single building. That’s amazing for on-premises, but there can still be an issue. Last summer, a faulty humidity sensor brought down one such room and affected a “small subset” of customers. “Small subset” is OK, unless you are included and some mission critical system was down for several hours. At that point, SLAs are meaningless – a refund for the lost runtime cost of a pair of Linux VMs running network appliance software won’t compensate for thousands or millions of Euros of lost business!

Availability Zones

We can go one step further by instructing Azure to deploy virtual machines into different availability zones. A single region can be made up of different physical locations with independent power and networking. These locations might be close together, as is typically the case in North Europe or West Europe. Or they might be on the other side of a city from each other, as is the case in some in North America. There is a low level of latency between the buildings, but this is still higher than that of a LAN connection.

A region that supports availability zones is split into 4 zones. You see three zones (round robin between customers), labeled as 1, 2, and 3. You can deploy many services across availability zones – this is improving:

  • VNet: Is software-defined so can cross all zones in a single region.
  • Virtual machines: Can connect to the same subnet/address space but be in different zones. They are not in availability sets but Azure still maintains service uptime during host patching/reboots.
  • Public IP Addresses: Standard IP supports anycast and can be used to NAT/load balance across zones in a single region.

Other network resources can work with availability zones in one of two ways:

  • Zonal: Instances are deployed to a specific zone, giving optimal latency performance within that zone, but can connect to all zones in the region.
  • Zone Redundant: Instances are spread across the zone for an active/active configuration.

Examples of the above are:

  • The zone-aware VNet gateways for VPN/ExpressRoute
  • Standard load balancer
  • WAGv2 / WAFv2

Considerations

There are some things to consider when looking at availability zones.

  • Regions: The list of regions that supports availability zones is increasing slowly but it is far from complete. Some regions will not offer this highest level of availability.
  • Catchup: Not every service in Azure is aware of availability zones, but this is changing.

Let me give you two examples. The first is VM Boot Diagnostics, a service that I consider critical for seeing the console of the VM and getting serial console access without a network connection to the virtual machine. Boot Diagnostics uses an agent in the VM to write to a storage account. That storage account can be:

  • LRS: 3 replicas reside in a single compute cluster, in a single room, in a single building (availability zone).
  • GRS: LRS plus 3 asynchronous replicas in the paired region, that are not available for write unless Microsoft declares a total disaster for the primary region.

So, if I have a VM in zone 1 and a VM in zone 2, and both write to a storage account that happens to be in zone 1 (I have no control over the storage account location), and zone 1 goes down, there will be issues with the VM in zone 2. The solution would be to use ZRS GPv2 storage for Boot Diagnostics, however, the agent will not support this type of storage configuration. Gotcha!

Azure Advisor will also be a pain in the ass. Noobs are told to rely on Advisor (it is several questions in the new Azure infrastructure exams) for configuration and deployment advice. Advisor will see the above two VMs as being not highly available because they are not (and cannot) be in a common availability set, so you are advised to degrade their SLA by migrating them to a single zone for an availability set configuration – ignore that advice and be prepared to defend the decision from Azure noobs, such as management, auditors, and ill-informed consultants.

Opinion

Availability zones are important – I use them in an architecture pattern that I am working on with several customers. But you need to be aware of what they offer and how certain things do not understand them yet or do not support them yet.

 

I’m Speaking At IP Expo in Manchester This Week

I will be in Manchester, UK, this week. I will be presenting an updated version of my “Solving the Azure Storage Maze” talk during the Altaro-sponsored slot at 12:45 on Wednesday in the Infrastructure Modernisation Theatre of IP Expo – registration is free.

If you have ever struggled with understanding all the storage options (not including databases – because I have 30 minutes, not 1 day) in Azure then I will help you navigate through all of the options. Note that I have updated this talk with Premium Blob and Premium Files content.

Designing Solutions That You Are Migrating To The Cloud

In this post, I will discuss some trends that I have noticed when people are planning the migration of a service to The Cloud.

I am going to make this post as cloud-agnostic as I can, with my limitation being that I only work in Azure. I don’t know AWS or Google Compute, but I know their offerings are similar to many found in Azure. So, I suspect that what I talk about here will apply across the Big-3 clouds.

My Observation

Let’s say that you are migrating a 2-tier application to The Cloud. You’ve been running it on-premises or in co-lo hosting until now. There’s Checkpoint firewalls, some Windows/Linux web servers, and a database cluster (let’s go with SQL Server Always On for this example). You want to move this to The Cloud.

The common thing that I see, by customers and consultants, is that they will look to take this design and redeploy it identically in their preferred cloud. They’ll want the same firewalls, virtual machines with the same OS, and a VM-based SQL Server cluster.

What has been accomplished with that migration? All they’ve done is move the problems. Nothing has really changed other than the location – OK, that’s a bit untrue because the cloud service might offer some automation, security, bandwidth solutions that the co-lo hosting company did not. But essentially, the application is identical. All the other benefits of The Cloud, such as elasticity, flexible (maybe even lower) pricing, easier disaster recovery, lower operational costs, etc … they cannot be delivered because the solution is identical.

The Objections

Why would one struggle to figure out how to build an S2D cluster with expensive VMs (for disk and network throughput) and Premium/Ultra SSD disks (for IOPS and lower latency), build up availability sets VM clusters, when you can just turn on a cloud service and get an Enterprise SKU of SQL Server with always-on availability? Adopting that cloud service WILL be cheaper, easier, near instantly available, and never quire a patch deployment or upgrade by you ever again. Sure, I’ll hear the usual BS arguments now:

  • Cost of the PaaS service: Seriously? You want to run Ultra/Premium disks with high-core count VMs with Enterprise licensing, and you think that will be cheaper?
  • App compatibility: I’m going Azure-specific here because it’s what I know … SQL Managed Instance is SQL Server that you know, but in the platform.
  • License mobility: Yup, Azure (sorry!) supports hybrid usage benefit for SQL Server in the platform too!

And as for the web servers: the easy solution is to use a platform service. It is simple and will run your IIS or Tomcat code. And depending on your cloud, it will support IP filtering, firewall or full blow Layer-7 WAF and/or DDoS protection. Time to deploy? A few minutes. Future maintenance? Near none. DevOps integrations: way more than any VM could ever offer.

Cloud, Cloud, Cloud

When you are going to The Cloud, you need to leave the year 2008 behind you. I pick that year because that’s when I first attended cloud events and most people attending were there to learn how to stop The Cloud. Those people still exist, some consciously thinking like that and some unconsciously sabotaging their customer/employer.

Traditional solutions can be done in the cloud. But you have to ask: should they?

By The Way

How would I design the above scenario in Azure?

  • Database: Azure SQL, which is an always-on (triple) cluster. I might go with Managed Instance if app-compat was a concern.
  • Web services: App Services, Linux/Tomcat container from the gallery

Options:

  • Redis Cache for database performance.
  • CDN for static web content performance – large amounts of static content could live in a storage account with CDN support too.
  • ASE instead of normal App Services if I need to bring a WAF into play.
  • If using ASE, I could enable DD0S Standard Tier protection, with L4 on the VNet and L7 in the WAF.
  • Traffic Manager to abstract the deployment from the cloud, enabling mobility of the service.

Reasons To Use A Third Party Firewall In Azure

In this post, I will go through some of the reasons that one might use to choose a third-party firewall network virtualization appliance (NVA) in Azure instead of the Azure Firewall.

You can read my take on choosing the Azure Firewall here.

Management

Let’s say you use Firewall X for your on-premises network(s). You have two things:

  • A skillset
  • A management tool

Maybe you want to re-use those? Let’s talk about that reasoning.

You have developed skills over the years to manage and troubleshoot Firewall X – well done! And now you want to bring those skills to Azure. At first, that seems logical. But what if I told you that there was an alternative that had the same functionality as (if not more than) Firewall X, scaled better than Firewall X, and was so easy that I could teach you to fully use it in 15 minutes? Hmm. Those years of skills don’t really make much sense now, do they?

Centralized management – I’ll give you some credit here. Azure Firewall does not have this right now. If I have 4 Azure Firewalls spread around the globe, I do not have 1 management experience. I have identical configuration experiences, but the global configurations have to be replicated – you could script that or use JSON templates. That’s not the same as using a GUI and saying “push this rule to the following 4 firewalls”. But let me ask you this: is this one feature genuinely a business reason to choose a third-party that has an unstable design and limited performance, high availability (if it even has it) or scale-out (most don’t even have this)?

Trust

“You want me to use a MICROSOFT firewall?”. Get over yourself. You’re in Azure and you’re going to be relying on Microsoft security all over the place. Grab your Sony Walkman and return back to whatever decade you came from.

Client VPN

Now we’re talking about something I can genuinely agree with – to a point. Azure sucks at end-user VPN. Azure’s approach is that you should be changing the user experience to using HTTPS (TLS) connectivity to web apps or Citrix/RDS gateways. But time and again, I do encounter customers who want/need VPN. Windows Server mysteriously does not support any of its user connectivity in Azure. And the Azure VPN Gateway has a limited and unsatisfying user VPN experience. So if you want to use a modern “SSL” VPN client with a third-party firewall, I can understand that. BUT, I would limit that appliance to that role. I just cannot stand the mess to get HA working with some of the third party NVAs (if they bother documenting) and the near-absence of scale-out for performance. I would still use Azure Firewall for the firewall 😊

Emotion

And that’s what you have left. And that’s not a valid business reason.

Brand

I’ve done a good bit of reading. So far the only brand of third-party NVA that I would consider myself for an edge/central firewall deployment is Palo Alto – but I’d rather use Azure Firewall over it anyway! All of the third-party solutions are compromised in some way:

  • Don’t do active-active clustering (scale-out)
  • Don’t even offer HA!
  • Have hack solutions (“we’ll edit your route tables for you”) for failover that you know will do more damage than an outage
  • Their documentation pure stinks

How to Troubleshoot Azure Routing?

This post will explain how routing works in Microsoft Azure, and how to troubleshoot your routing issues with Route Tables, BGP, and User-Defined Routes in your virtual network (VNet) subnets and virtual (firewall) appliances/Azure Firewall.

Software-Defined Networking

Right now, you need to forget VLANs, and how routers, bridges, routing switches, and all that crap works in the physical network. Some theory is good, but the practice … that dies here.

Azure networking is software-defined (VXLAN). When a VM sends a packet out to the network, the Azure Fabric takes over as soon as the packet hits the virtual NIC. That same concept extends to any virtual network-capable Azure service. From your point of view, a memory copy happens from source NIC to destination NIC. Yes; under the covers there is an Azure backbone with a “more physical” implementation but that is irrelevant because you have no influence over it.

So always keep this in mind: network transport in Azure is basically a memory copy. We can, however, influence the routing of that memory copy by adding hops to it.

Understand the Basics

When you create a VNet, it will have 1 or more subnets. By default, each subnet will have system routes. The first ones are simple, and I’ll make it even more simple:

  • Route directly via the default gateway to the destination if it’s in the same supernet, e.g. 10.0.0.0/8
  • Route directly to Internet if it’s in 0.0.0.0/0

By the way, the only way to see system routes is to open a NIC in the subnet, and click Effective Routes under Support & Troubleshooting. I have asked that this is revealed in a subnet – not all VNet-connected services have NICs!

And also, by the way, you cannot ping the subnet default gateway because it is not an appliance; it is a software-defined function that is there to keep the guest OS sane … and probably for us too 😊

When you peer a VNet with another VNet, you do a few things, including:

  • Instructing VXLAN to extend the plumbing of between the peered VNets
  • Extending the “VirtualNetwork” NSG rule security tag to include the peered neighbour
  • Create a new system route for peering.

The result is that VMs in VNet1 will send packets directly to VMs in VNet2 as if they were in the same VNet.

When you create a VNet gateway (let’s leave BGP for later) and create a load network connection, you create another (set of) system routes for the virtual network gateway. The local address space(s) will be added as destinations that are tunnelled via the gateway. The result is that packets to/from the on-prem network will route directly through the gateway … even across a peered connection if you have set up the hub/spoke peering connections correctly.

Let’s add BGP to the mix. If I enable ExpressRoute or a BGP-VPN, then my on-prem network will advertise routes to my gateway. These routes will be added to my existing subnets in the gateway’s VNet. The result is that the VNet is told to route to those advertised destinations via the gateway (VPN or ExpressRoute).

If I have peered the gateway’s VNet with other VNets, the default behaviour is that the BGP routes will propagate out. That means that the peered VNets learn about the on-premises destinations that have been advertised to the gateway, and thus know to route to those destinations via the gateway.

And let’s stop there for a moment.

Route Priority

We now have 2 kinds of route in play – there will be a third. Let’s say there is a system route for 172.16.0.0/16 that routes to virtual network. In other words, just “find the destination in this VNet”. Now, let’s say BGP advertises a route from on-premises through the gateway that is also for 172.16.0.0/16.

We have two routes for the 172.16.0.0/16 destination:

  • System
  • BGP

Azure looks at routes that clash like above and deactivates one of them. Azure always ranks BGP above System. So, in our case, the System route for 172.16.0.0/16 will be deactivated and no longer used. The BGP route for 172.16.0.0/16 via the VNet gateway will remain active and will be used.

Specificity

Try saying that word 5 times in a row after 5 drinks!

The most specific route will be chosen. In other words, the route with the best match for your destination is selected by the Azure fabric. Let’s say that I have two active routes:

  1. 16.0.0/16 via X
  2. 16.1.0/24 via Y

Now, let’s say that I want to send a packet to 172.16.1.4. Which route will be chosen? Route A is a 16 bit match (172.16.*.*). Route B is a 24 bit match (172.16.1.*). Route B is a closer match so it is chosen.

Now add a scenario where you want to send a packet to 172.16.2.4. At this point, the only match is Route A. Route B is not a match at all.

This helps explain an interesting thing that can happen in Azure routing. If you create a generic rule for the 0.0.0.0/0 destination it will only impact routing to destinations outside of the virtual network – assuming you are using the private address spaces in your VNet. The subnets have system routes for the 3 private address spaces which will be more specific than 0.0.0.0:

  1. 168.0.0/16
  2. 16.0.0/12
  3. 0.0.0/8
  4. 0.0.0/0

If your VNet address space is 10.1.0.0/16 and you are trying to send a packet from subnet 1 (10.1.1.0/24) to subnet 2 (10.1.2.0/24), then the generic Route D will always be less specific than the system route, Route C.

Route Tables

A route table resource allows us to manage the routing of a subnet. Good practice is that if you need to manage routing then:

  • Create a route table for the subnet
  • Name the route table after the VNet/subnet
  • Only use a route table with 1 subnet

The first thing to know about route tables is that you can control BGP propagation with them. This is especially useful when:

  • You have peered virtual networks using a hub gateway
  • You want to control how packets get to that gateway and the destination.

The default is that BGP propagation is allowed over a peering connection to the spoke. In the route table (Settings > Configuration) you can disable this propagation so the BGP routes are never copied from the hub network (with the VNet gateway) to the peered spoke VNet’s subnets.

The second thing about route tables is that they allow us to create user-defined routes (UDRs).

User-Defined Routes

You can control the flow of packets using user-defined routes. Note that UDRs outrank BGP routes and System Routes:

  1. UDR
  2. BGP routes
  3. System routes

If I have a system or BGO route to get to 192.168.1.0/24 via some unwanted path, I can add a UDR to 192.168.1.0/24 via the desired path. If the two routes are identical destination matches, then my UDR will be active and the BGP/system route will be deactivated.

Troubleshooting Tools

The traditional tool you might have used is TRACERT. I’m sorry, it has some use, but it’s really not much more than PING. In the software defined world, the default gateway isn’t a device with a hop, the peering connection doesn’t have a hop, and TRACERT is not as useful as it would have been on-premises.

The first thing you need is the above knowledge. That really helps with everything else.

Next, make sure your NSGs aren’t the problem, not your routing!

Next is the NIC, if you are dealing with virtual machines. Go to Effective Routes and look at what is listed, what is active and what is not.

Network Watcher has a couple of tools you should also look at:

  • Next Hop: This is a pretty simple tool that tells you the next “appliance” that will process packets on the journey to your destination, based on the actual routing discovered.
  • Connection Troubleshoot: You can send a packet from a source (VM NIC or Application Gateway) to a certain destination. The results will map the path taken and the result.

The tools won’t tell you why a routing plan failed, but with the above information, you can troubleshoot a (desired) network path.

Locking Down Network Access to the Azure Application Gateway/Firewall

In this post, I will explain how you can use a Network Security Group (NSG) to completely lock down network access to the subnet that contains an Azure Web Application Gateway (WAG)/Web Application Firewall (WAF).

The stops are as follows:

  1. Deploy a WAG/WAF to a dedicated subnet.
  2. Create a Network Security Group (NSG) for the subnet.
  3. Associate the NSG with the subnet.
  4. Create an inbound rule to allow TCP 65503-65534 from the Internet service tag to the CIDR address of the WAG/WAF subnet.
  5. Create rules to allow application traffic, such as TCP 443 or TCP 80, from your sources to the CIDR address of the WAG/WAF
  6. Create a low priority (4000) rule to allow any protocol/port from the AzureLoadBlanacer service tag to the CIDR address of the WAG/WAF
  7. Create a rule, with the lowest priority (4096) to Deny All from Any source.

The Scenario

It is easy to stand up a WAG/WAF in Azure and get it up and running. But in the real world, you should lock down network access. In the world of Azure, all network security begins with an NSG. When you deploy WAG/WAF in the real world, you should create an NSG for the WAG/WAF subnet and restrict the traffic to that subnet to what is just required for:

  • Health monitoring of the WAG/WAF
  • Application access from the authorised sources
  • Load balancing of the WAG/WAF instances

Everything else inbound will be blocked.

The NSG

Good NSG practice is as follows:

  1. Tiers of services are placed into their own subnet. Good news – the WAG/WAF requires a dedicated subnet.
  2. You should create an NSG just for the subnet – name the NSG after the VNet-Subnet, and maybe add a prefix or suffix of NSG to the name.

Health Monitoring

Azure will need to communicate with the WAG/WAF to determine the health of the backends – I know that this sounds weird, but it is what it is.

Note: You can view the health of your backend pool by opening the WAG/WAF and browsing to Monitoring > Backend Health. Each backend pool member will be listed here. If you have configured the NSG correctly then the pool member status should be “Healthy”, assuming that they are actually healthy. Otherwise, you will get a warning saying:

Unable to retrieve health status data. Check presence of NSG/UDR blocking access to ports 65503-65534 from Internet to Application Gateway.

OK – so you need to open those ports from “Internet”. Two questions arise:

  • Is this secure? Yes – Microsoft states here that these ports are “are protected (locked down) by Azure certificates. Without proper certificates, external entities, including the customers of those gateways, will not be able to initiate any changes on those endpoints”.
  • What if my WAG/WAF is internal and does not have a public IP address? You will still do this – remember that “Internet” is everything outside the virtual network and peered virtual networks. Azure will communicate with the WAG/WAF via the Azure fabric and you need to allow this communication that comes from an external source.

In my example, my WAF subnet CIDR is 10.0.2.4/24:

Application Traffic

Next, I need to allow application traffic. Remember that the NSG operates at the TCP/UDP level and has no idea of URLs – that’s the job of the WAG/WAF. I will use the NSG to define what TCP ports I am allowing into the WAG/WAF (such as TCP 443) and from what sources.

In my example, the WAF is for internal usage. Clients will connect to applications over a VPN/ExpressRoute connection. Here is a sample rule:

If this was an Internet-facing WAG or WAF, then the source service tag would be Internet. If other services in Azure need to connect to this WAG or WAF, then I would allow traffic from either Virtual Network or specific source CIDRs/addresses.

The Azure Load Balancer

To be honest, this one caught me out until I reasoned what the cause was. My next rule will deny all other traffic to the WAG/WAF subnet. Without this load balancer rule, the client could not connect to the WAG/WAF. That puzzled me, and searches led me nowhere useful. And then I realized:

  • A WAG/WAF is 1+ instances (2+ in v2), each consuming IP addresses in the subnet.
  • They are presented to clients as a single IP.
  • That single IP must be a load balancer
  • That load balancer needs to probe the load balancer’s own backend pool – which are the instance(s) of the WAG/WAF in this case

You might ask: isn’t there a default rule to allow a load balancer probe? Yes, it has priority 65001. But we will be putting in a rule at 4096 to prevent all connections, overriding the 65000 rule that allows everything from VirtualNetwork – which includes all subnets in the virtual network and all peered virtual networks.

The rule is simple enough:

Deny Everything Else

Now we will override the default NSG rules that allow all communications to the subnet from other subnets in the same VNet or peered VNets. This rule should have the lowest possible user-defined priority, which is 4096:

Why am I using the lowest possible priority? This is classic good firewall rule practice. General rules should be low priority, and specific rules should be high priority. The more general, the lower. The more specific, the higher. The most general rule we have in firewalls is “block everything we don’t allow”; in other words, we are creating a white list of exceptions with the previously mentioned rules.

The Results

You should end up with:

  • The health monitoring rule will allow Azure to check your WAG/WAF over a certificate-secured channel.
  • Your application rules will permit specified clients to connect to the WAG/WAF, via a hidden load balancer.
  • The load balancer can probe the WAG/WAF and forward client connections.
  • The low priority deny rule will block all other communications.

Job done!