Virtual WAN Is Not Required For SD-WAN

Did you know that you do not need to use Virtual WAN to implement an SD-WAN with Azure? In fact, contrary to the recommendations from Microsoft, Virtual WAN might be the worst way to add Azure networks to an SD-WAN.

My History With Virtual WAN

You might think that the introduction of this post paints me as a complete hater who has never given Virtual WAN a chance. I have. In fact, I can point out features that some of my 1:1 feedback calls probably contributed to. I’ve implemented Virtual WAN with customers.

However, I’ve seen the problems. I’ve seen that the hype doesn’t always work. I’ve personally experienced the lack of troubleshooting capabilities that depended on my deep understanding of the hidden networking. I’ve seen colleagues struggle with the complexity. I’ve seen how some customers’ routing requirements cannot be met with Virtual WAN. And many architectural features that some organisations require cannot be deployed with Virtual WAN.

I concluded that my time with Virtual WAN was over during a proof of concept that I insisted a customer do. They had previously used Virtual WAN without a firewall. I was asked to build a new multi-region Azure environment (multiple hubs) with firewalls. I was not sure that it would go well – this was before routing intent was in preview. I tested and confirmed that Virtual WAN was not going to work; the customer implemented a Meraki SD-WAN using Virtual Network-based hubs and lost no functionality. In fact, they gained functionality.

In an older case, I convinced a customer to go with Virtual WAN. I regret this one. There was a lot of hype. They used Meraki. There was a solution from Meraki to integrate with the Virtual WAN VPN Gateway. We found bugs in the script and fixed them. But the most annoying thing about that solution was that every time the customer changed anything in the SD-WAN, every VPN tunnel to Azure was torn down and recreated. I heard recently that the customer is looking to remove SD-WAN. I don’t blame them, and I regret ever recommending it to them.

The Microsoft Claims

The Azure Cloud Adoption Framework incorrectly states the following:

Use a Virtual WAN topology if any of the following requirements apply to your organization:

  • Your organization intends to deploy resources across several Azure regions and requires global connectivity between virtual networks in these Azure regions and multiple on-premises locations.
  • Your organization intends to use a software-defined WAN (SD-WAN) deployment to integrate a large-scale branch network directly into Azure, or requires more than 30 branch sites for native IPSec termination.
  • You require transitive routing between a virtual private network (VPN) and Azure ExpressRoute. For example, if you use a site-to-site VPN to connect remote branches or a point-to-site VPN to connect remote users, you might need to connect the VPN to an ExpressRoute-connected DC through Azure.

I will burst those bubbles one by one.

Several Regions & Global Connectivity

Do you want to deploy across multiple regions? Not a problem. You can very easily do that with Virtual Network-based hubs. I’ve done it again and again.

Do you want to connect the spokes in different regions? Yup, also easy:

  • Build each hub-and-spoke from a single IP prefix.
  • Your spokes already route via the hub.
  • Peer the hubs.
  • Create User-Defined Routes in each firewall subnet (you will be using firewalls in this day and age) to route to remote hub-and-spoke IP prefixes via the remote hub firewalls.

Job done! The only additional steps were:

  • Peer the hubs
  • Add UDRs to each firewall subnet for each remote hub-and-spoke IP prefix

You do that once. Once!

How about connecting the remote sites? Simples: you connect them as usual.

There is some marketing material about how we can use the Microsoft WAN as the company WAN using vWAN. Yes, in theory. The concept is that the Microsoft Global WAN is amazing. You VPN from site A (let’s say Oslo, Norway) to a local Azure region and you VPN from site B (let’s say Houston, Texas) to a local Azure region. Then vWAN automatically enables Oslo <> Texas connectivity over the Microsoft Global Network. Yes, it does. And the performance should be amazing. I did a proof-of-concept in 2 hours with a customer. The performance of VPN directly between Oslo <> Houston was much better. Don’t buy the hype! Question it and test. And by the way, we can build this with VNets too – I was told by an MS partner that they did this solution between two sites on different continents years before vWAN existed.

SD-WAN

Microsoft suggests that you can only add Azure networks to an SD-WAN if you use Virtual WAN.

Here’s some truth. Under the covers, vWAN hub is built on a traditional Virtual Network. Then you can use (don’t) a VPN Gateway or a third-party SD-WAN appliance for connectivity.

The list of partners supporting vWAN was greatly increased recently – I remember looking for Meraki support a few months ago, and it was not there (it is now). But guess what, I bet you that everyone one of those partners offers the exact same solution for Virtual Networks via the Marketplace. And I bet:

  • There are more partner options
  • There are no trade-offs
  • The resilience is just the same

I have done Azure/Meraki SD-WAN twice since the above customer X. In both cases, we went with the Azure Marketplace and Virtual Network. And in both cases, it was:

  • Dead simple to set up.
  • It worked the first time.

Transitive Routing

Virtual WAN is powered by a feature that is hidden unless you do an ARM export. That feature is Azure Route Server. Did you know:

  • You can deploy Azure Route Server to a Virtual Network. The deployment is a next-next-net.
  • It can be easily BGP peered with a third-party networking appliance.
  • The Azure Route server will learn remote site prefixes from the networking appliance/SD-WAN.
  • The Azure Route Server will advertise routes to the networking appliance/SD-WAN.

Azure Route Server BGP propagation is managed using the same VNet peering settings as Virtual Network Gateway.

There is a single checkbox (true/false property) to enable transitive routing between VPN/ExpressRoute remote sites. And that setting is amazing.

I signed in to work one day and was asked a question. I had built out the environment for a large customer with an HQ in Oslo:

  • Remote sites around the world with a Meraki SD-WAN.
  • Leased line to Oracle Cloud – the global sites backhauled through Oslo.
  • The VNet-based hub in Azure was added to the SD-WAN. All offices wre connected directly to Azure via VPN.
  • Azure Route Server was added and peered to the Meraki SD-WAN.
  • Azure had an ExpressRoute connection (Oracle Cloud Interconnect) to Oracle Cloud.

An excavator has torn up the leased line to Oracle. The essential services in Oracle Cloud were unavailable. I was asked if the Azure connection to Oracle Cloud coule be leveraged to get the business back online? I thought for 30 seconds and said, “Yes, give me 5 minutes”. Here’s what I did:

  1. I check the box to enable transitive routing in Azure Route Server.
  2. I clicked Save/Apply and waited a few minutes for the update task
  3. I asked the client to test.

And guess what? Contrary to the above CAF text, the client was back online. A few weeks later, I was told that not only did they get back online, but the SD-WAN connection to the VIRUTAL NETWORK-BASED hub in Azure gave the global branch offices lower latency connections than their backhaul through Oslo to Oracle Cloud. Whoda-thunk-it?

vWAN is PaaS

One of the arguments for the vWAN hub is that it pushes complexity down into the platform; it’s a PaaS sub-resource.

Yes, it’s a PaaS sub-resource. Is a well-designed hub complex? A hub should contain very few resources, based around:

  • Remote connectivity resource
  • Firewall
  • Maybe Azure Bastion

There’s not much more to a hub than that if you value security. What exactly am I saving with the more-expensive vWAN?

Limitations of vWAN

Let’s start with performance. A hub in Virtual WAN has a throughput limitation of 50 Gbps. I thought that was a theoretical limit … until I did a network review for a client a few years ago. They had a single workload that pushed 29Gbps through the hub, 1 Gbps shy of the limit for a Standard tier Azure Firewall. I recommended an increase to the 100 Gbps Premium tier, but warned that the bottleneck was always going to be the vWAN hub.

The architectural limitations of vWAN are many – so many that I will miss some:

  • No VNet Flow Logs
  • Impossible to troubleshoot routing/connectivity in a real way
  • No support for Azure Bastion in the hub
  • No support for NAT Gateway for firewall egress traffic (SNAT port exhaustion)
  • Secured traffic between different secured (firewall) hubs requires Routing Intent
  • No Forced Tunnelling in Azure Firewall without Routing Intent
  • Routing Intent is overly simplistic – everything goes through the firewall
  • No support for IP Prefix for the firewall
  • Azure Firewall cannot use Route Server Integration (auto-configuration of non-RFC1918 usage in private networks)
  • Hub Route Tables are a complexity nightmare

Impossible Solution

Anyone who has deployed more than a couple of Azure networks has heard the following statement made regarding failing connections over site-to-site networking:

The Azure network is broken

A new site-to-site appliance or firewall has been placed in Azure, and the root cause of the issue is “never the remote network“.

Proving that the issue isn’t the firewall can be tricky. That’s because firewall appliances are black boxes. I updated my standard hub design last year to assist with this:

  • Add a subnet with identical routing configuration (BGP propagation and user-defined routes) as the (private) firewall subnet.
  • Add a low-spec B-series VM to this subnet with an autoshutdown. This VM is used only for diagnostics.

The design allows an Azure admin to log into the VM. The VM mimics the connectivity of the firewall and allows tests to be done against failing connections. If the test fails from the VM, it proves that the firewall is not at fault.

No other compute resources are placed in the hub.

Here’s the gotcha. I can do this in a VNet hub. I cannot do this in a vWAN hub. The vWAN hub Virtual Network is in a Microsoft-managed tenant/subscription. You have no access, and you cannot troubleshoot it. You are entirely at the mercy of Azure support – and, sadly, we know how that process will go.

Virtual WAN In Summary

You do not need Virtual WAN for connectivity or SD-WAN. So why would one adopt it instead of VNet-based hubs, especially when you consider costs and the loss of functionality? I just do not understand (a) why Microsoft continues to push Virtual WAN and (b) why it continues to exist.

Azure Route Server Saves The Day

In this post, I will discuss a recent scenario where we used Azure Route Server branch-to-branch routing to rescue a client.

The Original Network Design

This client is a large organisation with a global footprint. They had a previous WAN design that was out of scope for our engagement. The heart of the design was Meraki SD-WAN, connecting their global locations. I like Meraki – it’s relatively simple and it just works – that’s coming from me, an Azure networking person with little on-premises networking experience.

The client started using the services of a cloud provider (not Microsoft). The client followed the guidance of the vendor and deployed a leased line connection to a cloud region that was close to their headquarters and to their own main data centre. The leased line provides low latency connectivity between applications hosted on-premises and applications/data hosted in the other cloud.

Adding Azure

The customer wanted to start using Azure for general compute/data tasks. My employer was engaged to build the original footprint and to get them started on their journey.

I led the platform build-out, delegating most of the hands-on and focusing on the design. We did some research and determined the best approach to integrate with the other cloud vendor was via ExpressRoute. The Azure footprint was placed in an Azure region very close to the other vendor’s region.

An ExpressRoute circuit was deployed between a VNet-based hub in Azure – always my preference because of the scalability, security/governance concepts, and the superiority over Virtual WAN hub when it comes to flexibility and troubleshooting. The Meraki solution from the Azure Marketplace was added to the hub to connect Azure to the SD-WAN and BGP propagation with Azure was enabled using Azure Route Server. To be honest – that was relatively simple.

The customer had two clouds:

  • The other vendor via a leased line.
  • Azure via SD-WAN.
  • And an interconnect between Azure and the other cloud via ExpressRoute.

Along Came a Digger

My day-to-day involvement with the client was over months previously. I got a message early one morning from a colleague. The client was having a serious networking issue and could I get online. The issue was that an excavator/digger had torn up the lines that provided connectivity between the client’s data centre and the other cloud.

Critical services in the other Cloud were unavailable:

  • App integration and services with the on-premises data centre.
  • App availability to end users in the global offices.

I thought about it for a short while and checked out my theory online. One of the roles of Azure Route server is to enable branch to branch connectivity between “on-premises” locations between ExpressRoute/VPN.

Forget that the other cloud is a cloud – think of the other cloud’s region as an on-premises site that is connected via ExpressRoute and the above Microsoft diagram makes sense – we can interconnect the two locations via BGP propagation through Azure Route Server:

  • The “on-premises” location via ExpressRoute
  • The SD-WAN via the Meraki which is already peered with Azure Route Server

I presented the idea to the client. They processed the information quickly and the plan was implemented quickly. How quickly? It’s one setting in Azure Route Server!

The Solution

The workaround was to use Azure as a temporary route to the other Cloud. The client had routes from their data centre and global offices to Azure via the Meraki SD-WAN. BGP routes were propagating between the SD-WAN connected locations, thanks to the peering between the Meraki NVA in the Azure hub and Azure Route Server.

BGP routes were also propagating between the other cloud and Azure thanks to ExpressRoute.

The BGP routes that did exist between the SD-WAN and the other cloud were gone because the leased line was down – and was going to be down for some time.

We wanted to fill the gap – get routes from the other cloud and the SD-WAN to propagate through Azure. If we did that then the SD-WAN locations and the other cloud could route via the Meraki and the ExpressRoute gateway in the Azure Hub – Azure would become the gateway between the SD-WAN and the other cloud.

The solution was very simple: enable branch-to-branch connectivity in Azure Route Server. There’s a little wait when you do that and then you run a command to check the routes that are being advertised to the Route Server peer (the Meraki NVA in this case).

The result was near instant. Routes were advertised. We checked Azure Monitor metrics on the ExpressRoute circuit and could see a spike in traffic that coincided with the change. The plan had worked.

The Results

I had not heard anything in a while. This morning I heard that the client was happy with the fix. In fact, user experience was faster.

Go back to the original diagram before Azure and I can explain. Users are located in the branch offices around the world. Their client applications are connecting to services/data in the other cloud. Their route is a “backhaul”:

  1. SD-WAN to central data centre
  2. Leased line over long distance to the other cloud

When we introduced the “Azure bypass” after the leased line failure, a new route appeared for end users:

  1. SD-WAN to Azure
  2. A very short distance hop over ExpressRoute

Latency was reduced quite a bit so user experience improved. On the contrary, latency between the on-premises data centre and the other cloud has increased because the SD-WAN is a new hop but at least the path is available. The original leased line is still down after a few weeks – this is not the fault of the client!

Some Considerations

Ideally one would have two leased lines in place for failover. That incurs costs and it was not possible. What about Azure ExpressRoute Metro? That is still in preview at this time and is not available in the Azure metro in question.

However, this workaround has offered a triangle of connectivity. When the lease line in repaired, I will recommend that the triangle becomes their failover – if any one path fails, the other two will take the place, bringing the automatic recoverability that was part of the concept of the original ARPANET.

The other change is that the other cloud should become another site in the Meraki SD-WAN to improve the user app experience.

If we do keep branch-to-branch connectivity then we need to consider “what is the best path”? For example, we want the data centre to route directly to the other cloud when the leased line is available because that offers the lowest latency. But what if a route via Azure is accidentally preferred? We need control.

In Azure Route Server, we have the option to control connectivity from the Azure perspective (my focus):

  • (Default) Prefer ExpressRoute: Any routes received over ExpressRoute will be used. This would offer sub-optimal routes because on-premises prefixes will be received from the other cloud.
  • Prefer VPN: Any routes received over VPN will be used. This would offer sub-optimal routes because other cloud prefixes will be received from on-premises.
  • Use AS path: Let the admin/network advertise a preferred path. This would offer the desired control – “use this path unless something goes wrong”.

Recording – Introducing Azure Virtual WAN

Here is a video recording that I recorded last week called Introducing Azure Virtual WAN.

I was scheduled to do a live presentation for the (UK) Northern Azure User Group (NAUG). All was looking good … until my wife went into labour 5 weeks early! We welcomed healthy twin girls and my wife is doing well – all are home now. But at the time, I was clocking up lots of miles to visit my wife and new daughters in the evening. The scheduled online user group meeting was going to clash with one of my visits.

I reached out to the organiser, Matthew Bradley (a really good and smart guy – and someone who should be an MVP IMO), and explained the situation. I offered to record my presentation for the user group. So that’s what I did – I deliberately did a 1-take recording and didn’t do the usual editing to clean up mistakes, coughs, actually’s and hmms. I felt that the raw recording would be more like what I would be like if I was live.

The feedback was positive and I was asked if I would share the video. So here you go:

Azure Virtual WAN ARM – The Chicken & Egg Gateway ID Discombobulation

This post will explain how to deal with the gateway ID properties in the Azure Microsoft.Network/virtualhubs resource when using ARM templates.

Background

The Azure WAN Hub is capable of having 3 gateway sub-resources:

  • Point-to-site VPN: Microsoft.Network/p2sVpnGateways
  • VPN (site-to-site): Microsoft.Network/vpnGateways
  • ExpressRoute: Microsoft.Network/expressRouteGateways, which does not support diagnostic settings in the 2020-04-01 API

As you would expect, when you create these resources, you have to supply them with the resource ID of the Microsoft.Network/virtualhubs resource:

"virtualHub": {
  "id": "<<<<resource ID of the virtual hub>>>>"
},

What is a surprise is what happens in the Microsoft.Network/virtualhubs resource. After a gateway is associated, a property (type object, presumably for future-proofing) for the associated gateway type is added to the hub:

"vpnGateway": {
  "id": "<<<< Resource ID of Microsoft.Network/vpnGateways resource>>>>"
},
"expressRouteGateway": { 
 "id": "<<<< Resource ID of Microsoft.Network/p2sVpnGateways resource>>>>"
},
"p2SVpnGateway": { 
 "id": "<<<< Resource ID of Microsoft.Network/expressRouteGateways resource>>>>"
},

The surprising thing is what happens.

The Problem

There are 3 possible states in the hub when it comes to each gateway:

  1. The hub exists without a gateway: The above hub properties are not required.
  2. The gateways are being added: The above hub properties cannot be added because the gateway resource ID points to a resource that does not exist yet – the hub must exist and be configured before the gateway(s).
  3. The gateways exist: Any re-run of the ARM template (which might be common to update the hub route tables or configuration via DevOps) must include the above gateway properties in the hub resource with the correct resource IDs for the gateways.

And steps 2 and 3 are where the chicken and egg are in an ARM template. You must supply the gateway resource ID in the hub for all updates to the hub after a gateway is deployed, and you must not include the gateway resource ID in the hub when deploying the gateway. This would be easy to deal with if ARM would (finally) give us a “ifexists()” function but there is no sign of that. So we need a hack solution.

The Hack Solution

This one comes from the Well-Architected Framework/Cloud Adoption Framework, Enterprise-Scale Architecture. This way-too-complicated beastie shows how Microsoft’s people are dealing with the issue. The JSON for the Microsoft.Network/virtualhubs template contains these properties:

"properties": {
  "virtualWan": {
    "id": "[variables('vwanresourceid')]"
  },
  "addressPrefix": "[parameters('vHUB').addressPrefix]",
  "vpnGateway": "[if(not(empty(parameters('vHUB').vpnGateway)),parameters('vHUB').vpnGateway, json('null'))]"
}

The key for dealing with vpnGateway is the vHUB parameter, an object that contains a value called vpnGateway.

When they first run the deployment, the value of vHUB.vpngateway is set to {} or null in the parameters file, stored in GitHub. That means that when the hub is first run (and there is no VPN gateway), the if statement in the above snippet will pass json(‘null’) to the vpnGateway property. That is acceptable to the resource provider and the hub will deploy cleanly. Later on in the deployment, the VPN gateway will be created.

If you were to just re-run the hub template now, you will get an error about not being allowed to change the vpnGateway property in the hub resource. Behind the scenes it has been updated by the VPN gateway deployment. Every execution of the hub template must now include the resource ID of the VPN Gateway – that sucks, right? Now the hack really kicks in.

After the first deployment of the hub (and the VPN Gateway), you must open the resource group in the Azure Portal, enable viewing hidden items, open the VPN Gateway resource, go to properties, and document the resource ID.

Now, you need to open the parameters file for the hub. Edit the vHUB.vpnGateway property and set it to:

"vpnGateway": { 
 "id": "<<<< Resource ID of Microsoft.Network/vpnGateways resource>>>>"
},

Now you can cleanly re-run the hub template.

How Should It Work?

The best solution would be if the gateway ID properties were just documentation for Azure, properties that we humans cannot edit. But I suspect that the ability to configure these settings might have something to do with the newly announced NVA-in-hub preview. Otherwise, ARM needs to finally give us an ifexists() function – vote here now if you agree.