Terrafying Azure – A Tale From The Dark Side

This post is a part of the Azure Back to School 2023 online event. In this post, I will discuss using Microsoft Azure Export for Terraform, also known as Aztfexport and previously known as Azure Terrafy (a great name!), to create Terraform code from existing Azure deployments, why you would do it, and share a few tips.

Terraform

Terraform is one of a few Infrastructure-as-Code (IaC) languages out there that support Microsoft Azure. You might wonder why I would use it when Azure has ARM and Bicep. I’ll do a quick introduction to Terraform and then explain my reasoning which you are free to disagree with 🙂

Terraform is a product of Hashicorp available as a free-to-use product that is supported with some paid-for services. Like other IaC languages, it describes and desired end result. The major feature that differs from the native Azure languages is the use of state files – a file that describes what is deployed in Azure. This state file has a few nice use cases, including:

  • The outputs of a resource are documented, enabling effortless integration between resources in the same or even different files – with some effort, outputs from different deployments can be included in another deployment.
  • A true what-if engine that (mostly) works, unlike the native what-if in Azure, greatly reducing the time required for deployments and the ability to plan (pre-review) a deployment’s expected changes.

My first encounter with Terraform was a government project where the customer wanted to use Terraform over Bicep. Their reasoning was that elected politicians come and go, and suppliers come and go. If they were going to invest in an IaC skillset, they wanted the knowledge to be transferrable across clouds.

That’s the big advantage of Terraform. While the code itself is not cloud portable, the skill is. Terraform uses providers to be able to manage different resource types. Azure is a provider, written by Microsoft. Azure AD is a provider – ARM/Bicep still do not support Azure AD! AWS and GCP have providers. VMware has a provider. GitHub has a provider – the list goes on and on. If a provider does not exist, you can (in theory) write your own.

On that project, I was meant to be hands-off as an architect. But there were staffing and scheduling issues so I stepped up. Having never written a line of Terraform before I had my first workload, with some review help from a teammate, written in under a day. By the way, the same thing in Bicep took three days! Terraform is really well documented, with lots of examples, and the language makes sense.

Unlike Bicep, which is still beholden to a lot of the complexity of ARM. Doing simple things can involve stupidly complicated functions that only a C programmer (I used to be one) could enjoy (and I didn’t). I got hooked on Terraform and convinced my colleagues that it was a better path than Bicep, which was our original plan to replace ARM/JSON.

Aztfexport

Switching Terraform creates a question – what do we do with our existing workloads which are either deploying using Click Ops (Portal), script, or ARM/Bicep?

Microsoft has created a tool called Azure Export for Terraform (Aztfexport) on GitHub. The purpose of this tool is to take an existing resource group/resource/Graph query string and export it as Terraform code.

The code that is produced is intended to be used in some other way. In other words, Microsoft is not exporting code that should be able to immediately deploy new resources. They say that the produced code should be able to pass a terraform plan where the existing resources are compared with the state file and the code and say “the code is clean and there are no changes required”.

The Terraform configurations generated by aztfexport are not meant to be comprehensive and do not ensure that the infrastructure can be fully reproduced from said generated configurations. For details, please see limitations).

Azure/aztfexport: (github.com)

Why Use Aztfexport?

If I can’t use the code to deploy resources then what value is it? Hopefully you will see what aztfexport is a central part of my toolkit. I see it being useful in the following ways:

  • Learning Terraform: If you’ve not used Terraform before then it’s useful to see how the code can be produced, especially from resources that you are already familiar with.
  • Creating TF for an existing workload: You need to “terrafy” a resource/resource group and you want a starting point.
  • Azure-to-Azure migrations: You have a set of existing resources and you want to get a dump of all the settings and configurations.
  • Learning how a resource type/solution is coded: My favourite learning method is to follow the step-by-step and then inspect the resource(s) as code.
  • Understand how a resource type/solution works: This is a logical jump from the previous example, now including more resources as a whole solution.
  • Auditing: Comparing what is there with what should be there – or not there.
  • Documentation: The best form of resource documentation is IaC – why create lengthy documentation when the code is the resource?

I did use Aztfexport to learn Terraform more. In my current project, I have used it again and again to do Azure-to-Azure migrations, taking legacy ClickOps deployments and rewriting them as new secure/governed deployments. I’ve save countless hours capturing settings and configurations and re-using them as new code.

The Bad Stuff

Nothing is perfect, and Aztfexport has some thorns too. Notice that the expected usage is that the produced code should pass a terraform plan. That is because in many situations (like with ARM exports) the code is not usable to deploy resources. That can be because:

  • ARM APIs do not expose everything, so how can Terraform get those settings?
  • The tool or the providers using used do not export everything.

One example I’ve seen includes App Services configurations that do not include the code type details. Another recent one was with WAF Policies overridden WAF rules were not documented. In both cases, the code would pass a plan. But neither would re-produce the resources. I’ve learned that I do need to double-check things with a resource type that I’ve never worked with before – then I know what to go and manually grab either from an ARM export or a visual inspection in the Portal.

Another thing is that the resources are named by a “machine” – there is no understanding of the role. Every resource is res-1, res-2, and so on, no matter the type or the role in the workload. That is a bit anonymous, but I find that useful when inspecting dependencies between resources.

A giant main.tf file is created, which I break up into many smaller files. I can find relationships based on those easy-to-track dependencies and logically group resources where it suits my coding style.

One feature of TF is the easy reuse of resource IDs. One can easy refer to resource_type.resource_name.id in a property and know that the resource ID of that resource will be used. Unfortunately, some Aztfsexport code doesn’t do that so you get static resource IDs that should be replaced – that happens with other properties of resources too, so that all should be cleaned up to make code more reusable.

Installing Aztfexport

You will need to install Terraform – I prefer to use a Package Manager for that – the online instructions for a manual installation are a mess. You will also require Azure CLI.

The full instructions for installing Aztfexport are shared on GitHub, covering Windows, MacOS and Linux. The Windows installation is easy:

winget install aztfexport

You will need to restart your terminal (Windows) to get an updated Path variable so the aztfexport binary can be found.

Before you use aztfexport, you will need to log in using Azure CLI:

Open your terminal

Login:
az login

Change subscription:
az account set -subscription <subscription ID>

Verify the correct subscription was selected by checking the resource groups:
az group list

Create an empty folder on your PC and navigate to that folder in your terminal. The aztfexport tool requires an empty folder, by default, to create an export including all the required provider files and the generated code.

If you want to create an export of a single resource then you can run:

aztfexport resource <resource ID>

If you want to create an export of a resource group, then you can run:

aztfexport resource-group -n <resource group name>

Not the -n above means “don’t bother me with manual confirmation of what resources to include in the export”. In Terraform, sub-resources that can be managed as their own Terraform resources would otherwise need to be confirmed and that gets pretty tiresome pretty fast.

Tips

I’ve got to hammer on this one again, the produced code is not intended for deployment. Take the code, copy and paste it into new files and clean it up.

If your goal is to take over an existing IaC/ClickOps deployment with Terraform then you are going to have some fun. The resources already exist and Terraform is going to be confused because there is no state file. You will have to produce a state file using Terraform export for every resource definition in your code. That means knowing the resource IDs of everything, including Azure AD objects, role assignments, and sub-resources. You’ll need to understand the format of those resource IDs – use an existing state file for that. Often the resource ID is the simple Azure resource ID, or a derivation of a parent resource ID that you can figure out from another state file. Sometimes you need to wander through Azure AD (look at assignments in scopes that you do have access to if you don’t have direct Azure AD rights), use Azure CLI to “list” resources or items, or browse around using Resource Explorer in the Azure Portal.

Do take some time to compare your code with any previous IaC code or with an ARM export. Look for things that are missing – Terraform has many defaults that won’t be included and that code is missing because it is not required. I often include that code because I know that they are settings that Devs/Ops might want to tune later.

If you have the misfortune of having to work an existing Terraform module library then you will have to translate the exported code as parameter/variable files for the new code – I do not envy you 🙂

Summary

This post is an introduction to Microsoft Azure Export for Terraform and a quick how-to-get-started guide. There is much more to learn about, such as how to use a custom backend (if resource names in Terraform are not a big deal and to eliminate the terraform import task) or even how to use a resource map to identify resources to export across many resource groups.

The tool is not perfect but it has saved me countless hours over the last year or so, dating back to when it was called Azure Terrafy. It’s one in my toolkit and I regularly break it out to speed up my work. In my opinion, anyone starting to work with Terraform should install and use this tool.

Azure Infrastructure Announcements – August 2023

This post brings you a summary of the infrastructure announcements from Azure that were made during August 2023. There are lots of announcements from Storage and a few interesting notes for VMs, networking, and ASR.

Storage

Azure Managed Lustre: not your grandparents’ parallel file system

With a few clicks of a web interface or an Azure Resource Manager template, AMLFS lets you provision an all-flash Lustre file system in minutes. What’s different is that this Lustre file system is all yours. If someone else in Azure is running a job that creates a million files, you won’t ever know it because your Lustre servers and SSDs are exclusively yours.

Massively scaled and high performance file systems for HPC workloads.

General availability | Azure NetApp Files: SMB Continuous Availability (CA) shares

To enhance resiliency during storage service maintenance operations, SMB volumes used by Citrix App Layering, FSLogix user profile containers and Microsoft SQL Server on Microsoft Windows Server can be enabled with Continuous Availability

SMB Transparent Failover means that clients should not notice maintenance operations.

Public preview: Azure Storage Mover support for SMB and Azure Files

Storage Mover is a fully managed migration service that enables you to migrate on-premises files and folders to Azure Storage while minimizing downtime for your workload. Azure Storage Mover can now migrate your SMB shares to Azure file shares.

To be honest, I’ve not encountered a “replace the file server with Azure Files” scenario yet. Third-party vendors often won’t support it for LOB apps. User data typically ends up in SharePoint/OneDrive. And wouldn’t most Citrix/RDS admins want to start with new profiles?

Generally available: Azure Blob Storage Cold Tier

Azure Blob Storage Cold Tier is now generally available. It is a new online access tier that is the most cost-effective Azure Blob offering for storing infrequently accessed data with long-term retention requirements, while providing instant access. The pricing of the cold tier storage option lies between the cool and archive tiers, and it follows a 90-day early deletion policy. You can seamlessly utilize the cold tier in the same way as the hot and cool tiers.

Cool – Cold. Tell me that isn’t confusing. The scenario is that you want to store data for a long time, but you need it immediately available. Archive requires a 15-hour restore (“rehydration”) that can be accelerated with a charge. Cold is one step up, but not as cost-effective.

Public Preview: Azure NetApp Files Cloud Backup for Virtual Machines

With Cloud Backup for Virtual Machines, you can now create VM consistent snapshot backups of VMs on Azure NetApp Files datastores. The associated virtual appliance installs in the Azure VMware Solution cluster and provides policy-based automated and consistent backup of VMs integrated with Azure NetApp Files snapshot technology for fast backups and restores of VMs, groups of VMs (organized in resource groups) or complete datastores lowering RTO, RPO, and improving total cost of ownership.

General Availability: Incremental snapshots for Premium SSD v2 Disk and Ultra Disk Storage

You can now instantly restore Premium SSD v2 and Ultra Disks from snapshots and attach them to a running VM without waiting for any background copy of data. This new capability allows you to read and write data on disks immediately after creation from snapshots, enabling you to recover your data from accidental deletes or a disaster quickly

I can see third-party backup making use of this.

Azure Elastic SAN updates: Private Endpoints & Shared Volumes

As we approach general availability of Azure Elastic SAN, we continue improving the service and adding features based on your feedback. Today, we are releasing private endpoint support and volume sharing support via SCSI (Small Computer System Interface) Persistent Reservation.

This sounds like the sort of feature maturity one will expect as the service approaches general availability. I wonder what the actual target market is for this service.

Azure Site Recovery

Private Preview – DR for Shared Disks – Azure Site Recovery

We are excited to announce the Private Preview of DR for Azure Shared Disks for workloads running Windows Server Failover Clusters (WSFC) on Azure VMs. Now you can protect, monitor, and recover your WSFC-clusters as a single unit across its DR Lifecycle, while also generating cluster-consistent recovery points – which are consistent across all the disks (including the Shared Disk) of the cluster.

This feature is long overdue for customers using shared virtual hard disks to create failover clusters.

Networking

Public preview: Support for new custom error pages in Application Gateway

In addition to the response codes 403 and 502, the Azure Application Gateway now lets you configure company-branded error pages for more response codes – 400, 405, 408, 500, 503, and 504. You can configure these error pages at a global level to apply to all the listeners on your gateway or individually for each listener. 

These pages can be shared on any publicly accessible URI.

Azure Firewall: New Monitoring and Logging Updates

Notes:

  • (Preview) With the Azure Firewall Resource Health check, you can now view the health status of your Azure Firewall and address service problems that may affect your Azure Firewall resource. Resource Health allows IT teams to receive proactive notifications regarding potential health degradations and recommended mitigation actions for each health event type
  • (Preview) The Azure Firewall Workbook presents a dynamic platform for analyzing Azure Firewall data. Within the Azure portal, you can utilize it to generate visually engaging reports.
  • (GA) The Latency Probe metric is designed to measure the overall latency of Azure Firewall and provide insight into the health of the service. IT administrators can use the metric for monitoring and alerting if there is observable latency and diagnosing if the Azure Firewall is the cause of latency in a network.

Resource health should make for a useful alert, especially when enabling DevSecOps – be aware of the dreaded “out of sync” error. I just tried the workbook in a production system – I noticed a couple of things that I might not have otherwise noticed because they didn’t trigger a human response (yet). The latency probe is interesting – I think it originated from customer network performance scenarios where it was suspected that the firewall was the root cause.

Virtual Machines

Public preview: Azure Mv3 Medium Memory (MM) Virtual Machines

Today we are announcing the public preview of the next generation Mv3 Medium Memory (MM) virtual machine series. Powered by the 4th Generation Intel® Xeon® Scalable Processor and DDR5 DRAM technology, the Mv3 medium memory (MM) virtual machines can scale for SAP workloads from 250GB to 4TB. With Azure Boost, Mv3 MM provides a ~25% improvement in network throughput and up to 1.5X improvement in remote storage throughput over the previous M-series families. 

These machines start at 12 vCPUs and 240 GB RAM, scaling up to 176 vCPUs and 2794 RAM. That should just about be enough to run Teams.

Cosmos DB Replicas With Private Endpoint

This post explains how to make Cosmos DB replicas available using Private Endpoint.

The Problem

A lot of (most) Azure documentation and community content assumes that PaaS resources will be deployed using public endpoints. Some customers have the common sense not to use public endpoints – who wants to be a zero-day target for well-armed attackers?!

Cosmos DB is a commonly considered database for born-in-the-cloud workloads. One of the cool things about Cosmos DB is the ability to use any number of globally dispersed read-only or write replicas with pretty low replication latency.

But there is a question – what happens if you use Private Endpoint? The Cosmos DB account is created in a “primary” region. That Private Endpoint connects to a virtual network in the primary region. If the primary region goes offline (it does happen!) then how will clients redirect to another replica? Or if you are building a workload that will exist in many regions, how will a remote footprint connection to the local Cosmos DB replica?

I googled and landed on a Microsoft forum post that asked such a question. The answer was (in short) “The database will be available, how you connect to it is your and Azure Network’s problem”. Very helpful!

Logically, what we want is:

What I Figured Out

I’ve deployed some Cosmos DB using Private Endpoint as code (Terraform) in the recent past. I noticed that the DNS configuration was a little more complex than you usually find – I needed to create a Private DNS Zone for:

  • The Cosmos DB service type
  • Each Azure region that the replica exists in for that service type

I fired up a lab to simulate the scenario. I created Cosmos DB account in North Europe. I replicated the Cosmos DB account to East US. I created a VNet in North Europe and connected the account to the VNet using a Private Endpoint.

Here’s what the VNet connected devices looks like:

As you can see, the clients in/peered with the North Europe VNet can access their local replica and the East US replica via the local Private Endpoint.

I created a second VNet in East US. Now the important bit: I connected the same Cosmos Account to the VNet in East US. When you check out the connected devices in the East US VNet then you can see that clients in/peered to the North America VNet can connect to the local and remote replica via the local Private Endpoint:

DNS

Let’s have a look at the DNS configurations in Private Endpoints. Here is the one in North Europe:

If we enable the DNS zone configuration feature to auto-register the Private Endpoint in Azure Private DNS, then each of the above FQDNs will be registered and they will resolve to the North Europe NIC. Sounds OK.

Here is the one in East US:

If we enable the DNS zone configuration feature to auto-register the Private Endpoint in Azure Private DNS, then each of the above FQDNs will be registered and they will resolve to the East US NIC. Hmm.

If each region has its own Private DNS Zones then all is fine. If you use Private DNS zones per workload or per region then you can stop reading now.

But what if you have more than just this workload and you want to enable full name resolution across workloads and across regions? In that case, you probably (like me) run central Private DNS Zones that all Private Endpoints register with no matter what region they are deployed into. What happens now?

Here I have set up a DNS zone configuration for the North Europe Private Endpoint:

Now we will attempt to add the East US Private Endpoint:

Uh-oh! The records are already registered and cannot be registered again.

WARNING: I am not a Cosmos DB expert!

It seems to me that using the DNS Zone configuration feature will not work for you in the globally shared Private DNS Zone scenario. You are going to have to configure DNS as follows:

  • The global account FQDN will resolve to your primary region.
  • The North Europe FQDN will resolve to the North Europe Private Endpoint. Clients in North Europe will use the North Europe FQDN.
  • The East US FQDN will resolve to the East US Private Endpoint. Clients in East US will use the East US FQDN.

This means that you must manage the DNS record registrations, either manually or as code:

  1. Register the account record with the “primary” resource/Private Endpoint IP address: 10.1.04.
  2. Register the North Europe record with the North Europe Private Endpoint IP: 10.1.0.5.
  3. Register the East US record with the East US Private Endpoint IP: 10.2.0.6.

This will mean that clients in one region that try to access another region (via failover) will require global VNet peering and NSG/firewall access to the remote Private Endpoint.

Azure WAF and False Positives

This post will explain how to override false positives in the (network) Azure Web Application Firewall (WAF), without compromising security, using one of four methods in combination with a tiered WAF Policy architecture:

  1. Managed Rulesets
  2. Custom Rules
  3. Exclusions
  4. Disabled rules

False Positives

A WAF is a rather simple solution, attempting to inspect L7 (application layer) traffic and intercept attacks such as protocol misuse, SQL injection, or cross-site scripting. Unfortunately, false positives can occur.

For example, let’s assume that an API app is securely shared using a WAF. Messages sent to the API might be formatted in JSON, with lots of special characters to format the message. SQL Inspection defenses count special characters, trying to find where an attacker is trying to escape out of a web request to create a database command that will execute. If the defense counts too many special characters (it will!) then an alert will be created and the message will be blocked if Prevention mode is enabled.

One must allow that traffic through because it is expected traffic that the application (and the business) requires. But one must do this without opening up too many holes in the WAF, making the WAF a costly, pointless existence.

Log Analytics Ingestion Charge

There is a side effect to false positives. False positives will vastly outnumber actual attack/probing attempts. Busy workloads can generate huge amounts of logs for false positives. If you use Log Analytics, that data has a cost:

  • Storage: Not too bad
  • Ingestion: This one is painful

The way to reduce the cost is to reduce the noise by overriding the detections that create false positives. Organizations that have a lot of web traffic could save a significant amount of money here.

WAF Policies

The WAF functionality of the Azure Application Gateway (AppGw) is managed by a resource called an Application Gateway WAF Policy (WAF Policy). The typical approach is to associate 1 WAF Policy with a WAF resource. The WAF policy will create customizations. For reasons that should become apparent later, I am going to urge you to take a slightly more granular approach to manage your WAF if your WAF is used to securely share more than one workload or listener:

  • WAF parent policy: A WAF policy will be associated with the WAF. This policy will apply to the WAF and all listeners unless another WAF Policy overrides specific settings.
  • Per-Listener/Per-Workload policy: This is a policy that is created specifically for a listener or a workload (a set of listeners). Any customisations that apply only to a listener or a workload will be applied here, without affecting any other listener or workload.

Methodology

You will never know what false positives you will encounter. If your WAF goes straight into Prevention mode then you will create a world of pain and be the recipient of a lot of hate-messages/emails.

Here’s the approach that I recommend:

  1. Protect your WAF with an NSG that has Traffic Analytics enabled. The NSG should only allow the necessary HTTP, HTTPS, WAF monitoring (from Azure), and load balancing traffic. Use a custom deny-all rule to block everything else.
  2. Enable monitoring for the Application Gateway, sending all logs to a queryable destination such as Log Analytics.
  3. Monitor traffic for a period of time – enough to allow expected normal usage of the full systems. Your monitoring should detect the false positives.
  4. Verify that Traffic Analytics did not record malicious IP addresses hitting your WAF.
  5. Query your monitoring data to find the false positives for each listener. Identify the hostname, request URI, ruleset, rule group, and rule ID that is causing the issue on a per-listener/workload basis.
  6. Ideally, developers fix any issues that create false positives but this is unlikely – so we’ll move on.
  7. Determine your override strategy (see below).
  8. Deploy your overrides with the policies still in Detection mode.
  9. Monitor traffic for another period of time to ensure that there are no more false positives.
  10. Switch the parent policy to Prevention Mode.
  11. Swith each per-listener/per-workload policy to Prevention Mode
  12. Monitor

Managed Rule Sets

The WAF today has two rulesets that you can use:

  • OWASP: Used to detect attacks such as SQL Injection, Cross-site scripting, and so on.
  • Microsoft Bot Manager Rule Set: Used to prevent malicious bots from browsing/attacking your workloads.

You need the OWASP ruleset – but we will need to manage it (later). The bot ruleset, in my experience, creates a huge amount of noise will no way of creating granular overrides. One can override the bot ruleset using custom rules, but as you’ll see later, that’s a big stick that is not granular at all!

My approach to this is to disable the Microsoft Bot Manager Rule Set (or leave it disabled) in the parent and child rulesets. If I have a need to enable it somewhere, I can do it in a per-listener or per-workload ruleset.

Custom Rules

A custom rule is created in a WAF Policy to force traffic that matches certain criteria to be:

  • Always allowed
  • Always denied
  • Logged only without denying it

You can create a sequence of filters based on:

  • IP Address
  • Number
  • String
  • Geo Location

If the set of filters matches a request then your desired action will apply. For example, if I want to force traffic to be allowed to my API, I can enter the API URI as one of the filters (as above) and all traffic will be allowed.

Yes, all traffic will be allowed, including traffic that is not a false positive. If I only had a few OWASP rules that were blocking the traffic, the custom rule would disable all OWASP rules.

If you must use this approach, then implement it in the child policy so it is limited to the associated listener/workload.

Exclusions

This is the newest of the override types in WAF Policy – and I’ve found it to be the least useful.

The theory is that you can create an exclusion for one or more OWASP rules based on the values of request headers. For example, if a header called RequestHeaderKeys contains a value of X-Scanner you can instruct the affected OWASP rules to be disabled. This sounds really powerful and quite granular. But this starts to fall apart with other scenarios, such as the aforementioned SQL Injection.

Another common rule that alerts on or blocks traffic is Missing User Agent Header. Exclusions work on the value of a header, so if the header is missing, Exclusions cannot evaluate it.

Another gotcha is that you cannot combine header filters to create an exclusion. The Azure Portal experience for creating an Exclusion makes it look like you can. However, the result is two or more Exclusions that work independently.

If Exclusions will work for you, implement them in the per-listener/per-workload policy and specify only the rules that must be overridden. This approach will limit the effect of the exclusion:

  1. The scope is just the listener/workload that is associated with the WAF Policy.
  2. The scope is further limited to just requests where the header matches, allowing all other requests and all OWASP rules to be applied.

Disabled Rules

The final approach that you can use is to disable rules that are creating false positive alerts. A simple workload might only require one or two rules to be disabled. An older & larger workload might require many OWASP rules to be disabled!

If you are going to disable OWASP rules, then do it in the per-listener/per-workload policy. This will limit the effect of the changes to that listener/workload.

This is a fairly each approach and it is pretty granular – not as much as Exclusions. The downside is that you are completely disabling certain protections for an entire listener/workload, leaving the workload vulnerable to attacks of those previously protected types.

Combinations

If you have the time and the data, you can combine different approaches. For example:

  • A webhook that comes from the same IP address all of the time can be allowed via a Custom Rule based on an IP Address filter. Any other traffic will be subject to the fill defenses of the WAF.
  • If you have certain headers that must be allowed and you want to enable all other protections for all other traffic then use Exclusions.
  • If traffic can come from anywhere and you need to override OWASP rules, then disable those rules.

No Great Solution

In summary, there is no perfect solution. The best you can do is find the correct override solution for the specific false positive and deploy it to a specific listener or workload. This will limit the holes that you create in the WAF to the absolute minimum while enabling your workloads to function.

Checking If Client Has Access To KeyVault With Private Endpoint

How to detect connections to a PaaS resource using Private Endpoint.

In this post, I’ll explain how to check if a client service, such as an App Service, has access to an Azure Key Vault with Private Endpoint.

Private Endpoint

In case you do not know, Private Endpoint gives us a mechanism where we can attach a PaaS service, such as a Key Vault, to a subnet with a NIC and a private IP address. Public connections to the PaaS resources are disabled, and an (Azure) Private DNS Zone is used to alter the name resolution of the PaaS resource to point to the private IP address.

Note that communications to the private endpoint are inbound (and response only). The PaaS resource cannot make outbound connections over a Private Endpoint.

My Scenario

The customer has an App Service Plan that has VNet Integration enabled – this allows the App Services to make outbound connections from “random” IPs on this subnet – NSG/Firewall rules should permit access from the subnet prefix.

The App Services on the plan have Private Endpoints on a second subnet in the VNet. There is also a Key Vault, which also has a Private Endpoint. The “Private Endpoint subnet” has an NSG to deny everything except desired traffic, including allowing HTTPS from the VNet Integration subnet prefix to the Key Vault Private Endpoint.

A developer was wondering if connections from an App Service were working and asked if we could see this in the logs.

Problem

The dev in this case wanted to verify network connectivity. So the obvious place to check was … the network! The way to do that is usually to verify that packets arrived at the destination NIC. You can do that (normally) using NSG Flow Logs. There is sometimes up to 25 minutes (or longer during pandemic compute shortages) of a wait before a flow appears in Log Analytics (data export from the host, 10 minutes collection interval [in our case], data processing [15 minutes]). We checked the logs but nothing was there.

And that is because (at this time) NSG Flow Logs cannot produce data destined to Private Endpoints.

We need a different way to trace connections.

Solution

The solution is to check the logs of the target resource. We enable a lot of logging by standard, including the logs for Key Vault. A little bit of Kql-Fu produced this query:

AzureDiagnostics
| where ResourceProvider =="MICROSOFT.KEYVAULT"
| where ResourceId contains "nameOfVault"
| project CallerIPAddress, OperationName, requestUri_s, ResultType, identity_claim_xms_mirid_s

The resulting columns were:

  • CallerIPAddress: The IP address of the client (the IP address used by the App Service Plan VNet integration, in our case)
  • OperationName: Things like SecretGet, Authentication, VaultGet, and SecretList
  • requestUri_s: The URI of the secret being requested
  • ResultType: Was it a success or not?
  • identity_claim_xms_mirid_s: The resource ID of the requesting client (the resource ID of the App Service, in our case)

Armed with the resulting info, the dev got what they needed to prove that the App Service was connecting to the Key Vault.

PowerShell – Check VMSS Instance Image/Model Versions

Here is a PowerShell script to check the Image or Model versions of each instance in an Azure Virtual Machine Scale Set (VMSS):

$ResourceGroup = "p-we1dep"
$Vmss = "p-we1dep-windows-vmss"

// Find all the instances in the VMSS
$Instances = Get-AzVmssVM -ResourceGroupName $ResourceGroup -Name $Vmss

Write-Host "Instance image versions of VMMS: $Vmss"

// For each instance in the VMSS
foreach ($Instance in $Instances) {

    // Get the exact version of the instance
    $InstanceInfo = (Get-AzVmssVM -ResourceGroupName $ResourceGroup -Name $Vmss -InstanceId $Instance.instanceId).StorageProfile.ImageReference.ExactVersion
    $Id = $Instance.instanceId

    // Echo the instance ID and Exact Version
    Write-Host "Instance $Id - $InstanceExactVersion"
}

E: Could Not Open File /var/lib/apt/lists

In this post, I’ll show how I solved a failure, that occurred during an Azure Image Builder (Packer) build with a Ubuntu 20.04 image, which resulted in a bunch of errors that contained E: Could not open file /var/lib/apt/lists/ with a bunch of different file names.

Disclaimer

I am Linux-disabled. I started my career programming on UNIX but switched to being a Microsoft infrastructure person a year later – and that was a long time ago. I am not a frequent Linux user but I do acknowledge its existence and usefulness. In other words, I figured out a fix for me, but it might not be a fix for you.

The Problem

I was using Azure Image Builder, which is based on Packer, to allow the regular creation of a Ubuntu 20.04 image with the latest updates and bits for acting as the foundation of a self-hosted DevOps agent VM Scale Set in a secure Azure network.

I had simple needs:

  1. Install Unzip
  2. Install Terraform

What makes it different is that I need the installations to be non-interactive. Windows has a great community with that kind of challenge. After a lot of searching, I realise that Linux does not.

I set up the tasks in the image template and for a month, everything was fine. Images built and rebuilt. A few days ago, a weird issue started where the first version of a template build was fine, but subsequent builds failed. When I looked at the build log, I saw a series of errors when apt (the package installed) ran that started with:

E: Could not open file /var/lib/apt/lists/

The Solution

I tried a lot of things, including:

apt-get update
apt-get upgrade -y

But guess, what – the errors just moved.

I was at the end of my tether when I decided to try something else. The apt package installation for WinZip worked some of the time. What was wrong the rest of the time? Time – that was the key word.

Something needed more time before I ran any apt commands. I decided to embed a bunch of sleep commands to let things in Ubuntu catchup with my build process.

I have two tasks that run before I install Terraform. The first prepares Linux:

            {
                "type": "Shell",
                "name": "Prepare APT",
                "inline": [
                    "echo ABCDEFG",
                    "echo sleep for 90 seconds",
                    "sleep 1m 30s",
                    "echo apt-get update",
                    "apt-get update",
                    "echo apt-get upgrade",
                    "apt-get upgrade -y",
                    "echo sleep for 90 seconds",
                    "sleep 1m 30s"
                ]
            },

The second task installs WinZip and some other tools that assist with downloading the latest Terraform zip file:

            {
                "type": "Shell",
                "name": "InstallPrereqs",
                "inline": [
                    "echo ABCDEFG",
                    "echo sleep for 90 seconds",
                    "sleep 1m 30s",
                    "echo installing unzip",
                    "sudo apt install --yes unzip",
                    "echo installing jq",
                    "sudo snap install jq"
                ]
            },

I’ve ran this code countless times yesterday and it worked perfectly. Sure, the sleeps slow things down, but this is a batch task that (outside of testing) I won’t be waiting on so I am not worried.

Referencing Private Endpoint IP Addresses In Terraform

It is possible to dynamically retrieve the resulting IP address of an Azure Private Endpoint and use it in other resources in Terraform. This post will show you how.

Scenario

You are building some PaaS resources using Private Endpoints. You have no idea what the IP addresses are going to be. But you need to use those IP addresses elsewhere in your Terraform code, for example in an NSG rule. How do you get the IP addresses?

Find The Properties

The trick for this is to use the terraform state command. In my case, I deployed a Cosmos DB resource using azurerm_private_endpoint.cosmosdb-account1. To view the state of the resource, I can run:

terraform state show azurerm_private_endpoint.cosmosdb-account1

That outputs a bunch of code:

Terraform state of a Cosmos DB resource

You can think of the exposed state as a description of the resource the moment after it was deployed. Everything in that state is addressable. A common use might be to refer to the resource ID (azurerm_private_endpoint.cosmosdb-account1.id) or resource name (azurerm_private_endpoint.cosmosdb-account1.name) properties. But you can also get other properties that you don’t know in advance.

The Solution

Take another look at the above diagram. There is an array property called private_dns_zone_configs that has one item. We can address this property as azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].

In there there is another array property, with two items, called record_sets. There is one record set per IP address created for this private endpoint. We can address these properties as azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[0] and azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[1].

Cosmos DB creates a private endpoint with multiple different IP addresses. I deliberately chose Cosmos DB for this example because it shows a more complex probelm and solution, demonstrating a little bit more of the method.

Dig into record_sets and you’ll find an array property called ip_addresses with 1 item. If I want the two IP addresses of this private endpoint then I will use: azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[0].ip_addresses[0] and azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[1].ip_addresses[0].

Using the Addresses

destination_address_prefixes = [
 azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[0].ip_addresses[0], // Cosmos DB Private Endpoint IP 1
 azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[1].ip_addresses[0] // Cosmos DB Private Endpoint IP 2
 ]                       
}

And now I have code that will deploy an NSG rule with the correct destination IP address(es) of my private endpoint without knowing them. And even better, if something causes the IP address(es) to change, I can rerun my code without changing it, and the rules will automatically update.

Get The Diagnostics Logs Names For An Azure Resource

This post will show you how to get the ARM (also for Bicep, Terraform, etc) names of the diagnostics logs for an Azure resource.

Problem

When you are deploying Azure resources as code, you might need to enable diagnostics logs. This might require you to know the name of each log. Here’s the issue: the names of the logs in the Azure Portal are usually different from the names that are used in the code. Sure, they’ll remove the spaces and use camel-case, but that’s predictable. Often, the logs have completely different names.

Sometimes the names are documented – thank you App Services! Sometimes you cannot find the log names – boo Azure SQL!

Solution

The tip that I’m going to share is useful – this is the second time in a few weeks that I’ve used this approach.

If you know what you are looking for, diagnostics logs in this case, then do a search online for something like “Azure Diagnostics Settings REST API”. This will bring you to a Microsoft page that shares different methods for the API.

I wanted to see what the log names are for an Azure SQL Database. So I manually created the diagnostic setting. After that, grab the resource ID of the Azure SQL Database.

Then I did the above search. I clicked the Get method and then clicked the Try It button. Put the name of the diagnostic setting (that you created) in name. Put the resource ID of the Azure SQL Database in resourceID. And then click Run. A second later, the ARM for the diagnostic setting is presented on a screen below, including all the diagnostics log names.

Cannot Remove Subnet Because of App Service VNet Integration

This post explains how to unlock a subnet when you have deleted an App Service/Function App with Regional VNet Integration.

Here I will describe how you can deal with an issue where you cannot delete a subnet from a VNet after deleting an Azure App Service or Function App with Regional VNet Integration.

Scenario

You have an Azure App Service or Function App that has Regional VNet Integration enabled to connect the PaaS resource to a subnet. You are doing some cleanup or redeployment work and want to remove the PaaS resources and the subnet. You delete the PaaS resources and then find that you cannot:

  • Delete the subnet
  • Disable subnet integration for Microsoft.Web/serverFarms

The error looks something like this:

Failed to delete resource group workload-network: Deletion of resource group ‘workload-network’ failed as resources with identifiers ‘Microsoft.Network/virtualNetworks/workload-network-vnet’ could not be deleted. The provisioning state of the resource group will be rolled back. The tracking Id is ‘iusyfiusdyfs’. Please check audit logs for more details. (Code: ResourceGroupDeletionBlocked) Subnet IntegrationSubnet is in use by /subscriptions/sdfsdf-sdfsdfsd-sdfsdfsdfsd-sdfs/resourceGroups/workload-network/providers/Microsoft.Network/virtualNetworks/workload-network-vnet/subnets/IntegrationSubnet/serviceAssociationLinks/AppServiceLink and cannot be deleted. In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet. (Code: InUseSubnetCannotBeDeleted, Target: /subscriptions/sdfsdf-sdfsdfsd-sdfsdfsdfsd-sdfs/resourceGroups/workload-network/providers/Microsoft.Network/virtualNetworks/workload-network-vnet)

It turns out that deleting the PaaS resource leaves you in a situation where you cannot disable the integration. You have lost permission to access the platform mechanism.

In my situation, Regional VNet integration was not cleanly disabling so I did the next logical thing (in a non-production environment): started to delete resources, which I could quickly redeploy using IaC … but I couldn’t because the subnet was effectively locked.

Solutions

There are 2 solutions:

  1. Call support.
  2. Recreate the PaaS resources

Option 1 is a last resort because that’s days of pain – being frankly honest. That leaves you with Option 2. Recreate the PaaS resources exactly as they were before with Regional VNet Integration Enabled. Then disable the integration (go into the PaaS resource, go into Networking, and disconnect the integration).

That process cleans things up and now you can disable the Microsoft.Web/serverFarms delegation and/or delete the subnet.