Errors When You Add A Cert To Application Gateway Listener From Key Vault

This post is dealing with a situation where you attempt to add a certificate to a v2 Azure Application Gateway/Firewall (WAG_v2/WAF_v2) from an Azure Key Vault. The attempt fails and any further attempt to delete/modify the certificate fails with this error:

Invalid value for the identities ‘/subscriptions/xxxxxxx/resourcegroups/myapp/providers/Microsoft.ManagedIdentity/userAssignedIdentities/myapp-waf-id’. The ‘UserAssignedIdentities’ property keys should only be empty json objects, null or the resource exisiting property.

Application Gateway v2 and Key Vault

Azure Key Vault is the best place to store secrets in Microsoft Azure – particularly SSL certificates. Key Vault has a nice system for abstracting versions of a certificate so you can put in newer versions without changing references to the older one. There is also a feature for automatic renewal of expiring certs from certain issuers. I also like the separation of exposed resource from organisation secrets that you get with this approach; the legacy method was that you had to upload the cert into the WAG/WAF, but now WAG_v2/WAF_v2 allow you to store the certs in a Key Vault, and that limited access is done using a managed user ID (an Azure resource, not an Azure AD resource, which makes it more agile).

The Problem

I was actually going to write a blog post about how to obtain the secret ID of a certificate from the Key Vault so you could add it to the WAGv2/WAFv2. But as I was setting up the lab, I realised that during the day, Microsoft had updated the Azure Portal blade so certs were instead presented as a drop-down list box; now my post was pointless. But I continued setting things up and hit the above issue.

The Cause/Fix

When you use this architecture, WAF_v2/WAG_v2 requires that you have enabled soft delete on the Key Vault. And that’s the only check that they have been doing. The default setting for Key Vault soft delete is 90 days. I was in a lab, I was mucking around, so I set soft delete in my Key Vault to 7 days – a perfectly legit value for Key Vault. However, the Application Gateway (AppGW) requires it to be set to 90 days minimum … even though it does not check it!

To undo the damage you can run the following PowerShell cmdlets:

  • Set-AzApplicationGatewayIdentity
  • Remove-AzApplicationGatewaySslCertificate
  • Remove-AzApplicationGatewayHttpListener
  • Set-AzApplicationGateway to update the WAF

Thanks to Cat in the Azure network team for the help!

Microsoft Ignite 2019 – Deliver Highly Available Secure Web Application Gateway and Web Application Firewall

Speaker:

  • Amit Srivastava, Principal Program Manager, Microsoft

Mission Critical HTTP Applications

  • Always On
  • Secure
  • Scalable
  • Telemetry
  • Polygot – variety of backed, IaaS, PaaS, on-prem

Many things to think about.

What Azure Pieces Can We Use?

  • WAG
  • AFD
  • CDN
  • WAF
  • Azure Load Balancer
  • Azure Traffic Manager

WAG

Regional ADS as a service. A full reverse proxy. It terminates the incoming connection and creates a new one to the web server.

  • Platform managed: built-in HA and sclability
  • Layer 7 load balancing: URL path, host based, round robin, session affinity, redirection
  • Security and SSL management: WAF, SSL Offload, SSL re-encryption, SSL policy
  • Public or ILB: Public internet, internal or both.
  • Flexible backends: VMs, VMSS, AKS, public IP, cloud services, ALB/ILB, On-premises
  • Rich diagnostics: Azure monitor, log analytics, network watcher, RHC, more

Standard v2 SKU in GA

  • Available in 26 regions
  • Built-in zone redundancy
  • Static VIP
  • HTTP header/cookies insertion/modification
  • Increased scale limits 20 -> 100 listeners
  • Key vault integration and autorenewal of SSL certs (GA)
  • AKS ingress controller (GA)

Autoscaling and performance improvements:

  • Grow and shrink based on app traffic requirements
  • 5 x better SSL offloads performance
    • 500-50,000 connections/sec with RSA 2048 bit certs
    • 30,000, 3,000,000 persistent connections
    • 2,500 – 250,0000 HTTP req/sec
  • 75% reduction in provisioning time ~5mins

Key Vault Integration in v2 GA

  • Front end TLS cert integrated with Azure Key Vault
  • Utilizes user-assigned management identity for access control on key vault
  • Use certificate or secrets on Key Vault
  • Pools every 4 hours to enable automatic cert renewal – you can force a poll if you need to
  • Manual override or specific certificate version retrieval

WAG v2 Header Rewrites

  • Manipulate request and response headers and cookies
    • Strip port from x-forwarded-for header
    • Add security headers like HSTS and X-XSS-Protection
    • Common header manipulation ex: HOST, SERVER
  • Conditional header rewrites … something

Ingress Controller

  • Ingress controller for 1+ AKS clusters at one time
  • Deployed using HELM – newer easier options by EOY
  • Utilized pod-AAD for ARM authentication
  • Tighter integration with AKS add-on support upcoming
  • Supports URI-path based, host based, SSL termination, SSL re-encryption, redirection, custom health probes, draining, cookie affinity.
  • Support for Let’s Encrypt provided TLS certs
  • WAF fully supported with custom listener policies
  • Support for multiple AKS as backend
  • Support for mixed mode- both AKS and other backend types on the same application gateway.

http://aka.ms/appgawks

Application Gateway Wildcard Listener

  • Managed preview
  • Support for wildcard characters in listener host name
  • Supports * and ? characters in host name
  • Associate wildcard or SAN certs to serve HTTPS

Telemetry Enhancements

  • GA
  • Diagnostics Log Enhancements
    • TLS protocol version, cipher spec selected.
    • Backend target server, response code, latency.
  • Metrics Enahncements
    • Backend response status code
    • RPS/healthy node
    • End-to-end latency
    • Backend latency
    • Backend connect, first byte, and last byte latency.

Azure Monitor Insights for Application Gateway

  • Public Preview
  • Sign health and metric console for your entire cloud network#
  • No agent/configuration required
  • Visualize the structure and functional dependencies
  • More

AKS Demo

He loads a Helm YAML config to the AKS cluster. Now the AKS cluster can configure listers, backend pools, rules, etc for the containers/services running on the cluster. Pretty cool.

Azure WAF

Cloud native WAF

  • Unified WAF offering
    • Protect your apps at network edge or in region uniformly
  • Public preview:
    • Microsoft threat intelligence
      • Protect apps against automated attacks
      • Manage good/bad bots with Azure BotManager RuleSet
    • Site and URI pathc specific WAF policies
      • Customise WAF policies at regional WAF for finer grained protection at each host/listener or URI path level
    • Geo-filtering on regional WAF

WAF

  • HA, scalable fully platform managed
  • Auto-scaling support
  • New RuleSet CRS 3.1 added, will soon be the default
  • Integration with Azure Sentinel SIEM
  • Performance and concurrency enhancements
  • More

WAF Policy Enhancements

  • Assign different policies to different sites behind the same WAF
  • Increased configurability
  • Per-URI policy

Geo Filtering Public Preview

  • Block, allow, log countries.
  • Easily configurable in WAF policy
  • Geo data refreshed weekly

Only in special Portal URI at the moment – normal Azure Portal soon.

Bot Protection (Public Preview)

  • Stuff

Azure Availability Zones in the Real World

I will discuss Azure’s availability zones feature in this post, sharing what they can offer for you and some of the things to be aware of.

Uptime Versus SLA

Noobs to hosting and cloud focus on three magic letters: S, L, A or service level agreement. This is a contractual promise that something will be running for a certain percentage of time in the billing period or the hosting/cloud vendor will credit or compensate the customer.

You’ll hear phrases like “three nines”, or “four nines” to express the measure of uptime. The first is a 99.9% measure, and the second is a 99.99% measure. Either is quite a high level of uptime. Azure does have SLAs for all sorts of things. For example, a service deployed in a valid virtual machine availability set has a connectivity (uptime) SLA of 99.9%.

Why did I talk about noobs? Promises are easy to make. I once worked for a hosting company that offers a ridiculous 100% SLA for everything, including cheap-ass generic Pentium “servers” from eBay with single IDE disks. 100% is an unachievable target because … let’s be real here … things break. Even systems with redundant components have downtime. I prefer to see realistic SLAs and honest statements on what you must do to get that guarantee.

Azure gives us those sorts of SLAs. For virtual machines we have:

  • 5% for machines with just Premium SSD disks
  • 9% for services running in a valid availability set
  • 99% for services running in multiple availability zones

Ah… let’s talk about that last one!

Availability Sets

First, we must discuss availability sets and what they are before we move one step higher. An availability set is anti-affinity, a feature of vSphere and in Hyper-V Failover Clustering (PowerShell or SCVMM); this is a label on a virtual machine that instructs the compute cluster to spread the virtual machines across different parts of the cluster. In Azure, virtual machines in the same availability set are placed into different:

  • Update domains: Avoiding downtime caused by (rare) host reboots for updates.
  • Fault domains: Enable services to remain operational despite hardware/software failure in a single rack.

The above solution spreads your machines around a single compute (Hyper-V) cluster, in a single room, in a single building. That’s amazing for on-premises, but there can still be an issue. Last summer, a faulty humidity sensor brought down one such room and affected a “small subset” of customers. “Small subset” is OK, unless you are included and some mission critical system was down for several hours. At that point, SLAs are meaningless – a refund for the lost runtime cost of a pair of Linux VMs running network appliance software won’t compensate for thousands or millions of Euros of lost business!

Availability Zones

We can go one step further by instructing Azure to deploy virtual machines into different availability zones. A single region can be made up of different physical locations with independent power and networking. These locations might be close together, as is typically the case in North Europe or West Europe. Or they might be on the other side of a city from each other, as is the case in some in North America. There is a low level of latency between the buildings, but this is still higher than that of a LAN connection.

A region that supports availability zones is split into 4 zones. You see three zones (round robin between customers), labeled as 1, 2, and 3. You can deploy many services across availability zones – this is improving:

  • VNet: Is software-defined so can cross all zones in a single region.
  • Virtual machines: Can connect to the same subnet/address space but be in different zones. They are not in availability sets but Azure still maintains service uptime during host patching/reboots.
  • Public IP Addresses: Standard IP supports anycast and can be used to NAT/load balance across zones in a single region.

Other network resources can work with availability zones in one of two ways:

  • Zonal: Instances are deployed to a specific zone, giving optimal latency performance within that zone, but can connect to all zones in the region.
  • Zone Redundant: Instances are spread across the zone for an active/active configuration.

Examples of the above are:

  • The zone-aware VNet gateways for VPN/ExpressRoute
  • Standard load balancer
  • WAGv2 / WAFv2

Considerations

There are some things to consider when looking at availability zones.

  • Regions: The list of regions that supports availability zones is increasing slowly but it is far from complete. Some regions will not offer this highest level of availability.
  • Catchup: Not every service in Azure is aware of availability zones, but this is changing.

Let me give you two examples. The first is VM Boot Diagnostics, a service that I consider critical for seeing the console of the VM and getting serial console access without a network connection to the virtual machine. Boot Diagnostics uses an agent in the VM to write to a storage account. That storage account can be:

  • LRS: 3 replicas reside in a single compute cluster, in a single room, in a single building (availability zone).
  • GRS: LRS plus 3 asynchronous replicas in the paired region, that are not available for write unless Microsoft declares a total disaster for the primary region.

So, if I have a VM in zone 1 and a VM in zone 2, and both write to a storage account that happens to be in zone 1 (I have no control over the storage account location), and zone 1 goes down, there will be issues with the VM in zone 2. The solution would be to use ZRS GPv2 storage for Boot Diagnostics, however, the agent will not support this type of storage configuration. Gotcha!

Azure Advisor will also be a pain in the ass. Noobs are told to rely on Advisor (it is several questions in the new Azure infrastructure exams) for configuration and deployment advice. Advisor will see the above two VMs as being not highly available because they are not (and cannot) be in a common availability set, so you are advised to degrade their SLA by migrating them to a single zone for an availability set configuration – ignore that advice and be prepared to defend the decision from Azure noobs, such as management, auditors, and ill-informed consultants.

Opinion

Availability zones are important – I use them in an architecture pattern that I am working on with several customers. But you need to be aware of what they offer and how certain things do not understand them yet or do not support them yet.