Aidan Finn, IT Pro

Experts Live Europe 2023

I spoke at Experts Live Europe last week and this post is a report of my experience at this independently run tech conference.

Experts Live

I cannot claim to be a historian on Experts Live Europe (I’ll call it Experts Live after this) but it’s a brand that I’ve known of for years. Many of the MVPs (Microsoft Valuable Professionals) and community experts that I know have attended and presented at this conference for as long as it has been running. It started off as a System Center-focused event and evolved as Microsoft has done, transitioning to a cloud-focused conference covering M365 and Azure.

Previously, I never got to speak at Experts Live. When it started, I had mostly fallen off the System Center track and didn’t feel qualified to apply to speak. Later, as the conference evolved and our interests aligned, I was always booked to be on vacation abroad when the conference was running so I didn’t apply. This was a sickener because the likes of Kevin Greene and Damian Flynn raved about how good this event was for speakers and attendees.

This year, that changed and I applied to speak. I was delighted to hear that I was accepted and was looking forward to attending.

The organisation changed a little, but the central organiser, Isidora Maurer, was still at the helm. I knew that this would be a quality event.

Experts Live is a brand that has expanded and now includes local events across Europe. I’ve been lucky to speak at a couple of those over the years.

Prague 2023

This year’s conference was hosted in Prague, a beautiful city. I’ve spoken in Prague before but it was my usual speaker experience: fly in – taxi to the hotel – speak – taxi to the airport – fly home. This time, because flights home were a little awkward, I was staying an extra night so I could experience the city a little bit.

The conference center is just outside the city centre and the hotels were just next door. Many of the speakers booked into the Corinthian Hotel, a nice place, which was a 2-minute walk across a bridge or through a train station.

Attending

I arrived at the conference center to register on the last day, about 40 minutes before I was due to speak in the second slot. I registered quickly and was told to go upstairs. I did – and the place was a ghost town. I was sure that something was wrong. Whenever you go to a tech event, there are always people in the hallways either on calls or filling time because they don’t like the current sessions. I found the speakers’ room and did my final prep. Then I went to the room I was speaking in next, and it was packed. All of the rooms were packed. Almost no one was “filling time”. I’ve never seen that and it says a lot about the schedule organisers, the sessions/speakers, and the attendees’ dedication.

Another observation – that my wife made afterward while looking at event photos on social media – there were a lot more women at this event than one will usually see at other technical events. The main organiser, Isidora, is a well-known advocate for women in IT and I suspect that her activities help to restore some levels of balance.

My Session

My session was called “Azure Firewall: The Legacy Firewall Killer“. In the session, I compare & contrast Azure Firewall with third-party NVAs, while teaching a little about Azure Firewall features and demonstrate a simple DevSecOps process using infrastructure-as-code.

I had a full room which was pretty cool and there was lots of engagement after the session – throughout the day!

I attended sessions in all but one slot, catching the end of Carsten Rachfahl’s hybrid session, Didier Van Hoye’s session on QUIC, Damian Flynn’s Azure Policy session, and Eric Berg’s session on Azure networking native versus third-party options. All were excellent, as I expected.

It has been a long time since I’ve had the opportunity to attend technical sessions – the pandemic suspended in-person events for years, I can’t focus on digital events (for several reasons), and Microsoft Ignite is a marketing/vanity event now 🙁

Afterwards

The after-party featured some lovely snacks and drinks with some light-hearted entertainment. It was short – understandably – because many people were leaving straight away.

Entertainment for the evening was hosted for the speakers: we gathered at 19:00 and were taken on a riverboat tour where we had a few drinks and dinner while enjoying the city views in the warm autumn evening. It was quite enjoyable. And maybe, just maybe, many of the speakers continued on in various locations afterward!

Wrap Up

Experts Live is a very well-run event with lots of content spanning multiple expertise areas. I love that the sessions are technical – in fact, some of the speakers adjusted their content to suit the observed technical levels of the audience while at the event. In 2024, if you want to learn, then make sure you check out this conference and hopefully if I’m accepted, I’ll see you there!

Terrafying Azure – A Tale From The Dark Side

This post is a part of the Azure Back to School 2023 online event. In this post, I will discuss using Microsoft Azure Export for Terraform, also known as Aztfexport and previously known as Azure Terrafy (a great name!), to create Terraform code from existing Azure deployments, why you would do it, and share a few tips.

Terraform

Terraform is one of a few Infrastructure-as-Code (IaC) languages out there that support Microsoft Azure. You might wonder why I would use it when Azure has ARM and Bicep. I’ll do a quick introduction to Terraform and then explain my reasoning which you are free to disagree with 🙂

Terraform is a product of Hashicorp available as a free-to-use product that is supported with some paid-for services. Like other IaC languages, it describes and desired end result. The major feature that differs from the native Azure languages is the use of state files – a file that describes what is deployed in Azure. This state file has a few nice use cases, including:

The outputs of a resource are documented, enabling effortless integration between resources in the same or even different files – with some effort, outputs from different deployments can be included in another deployment.
A true what-if engine that (mostly) works, unlike the native what-if in Azure, greatly reducing the time required for deployments and the ability to plan (pre-review) a deployment’s expected changes.

My first encounter with Terraform was a government project where the customer wanted to use Terraform over Bicep. Their reasoning was that elected politicians come and go, and suppliers come and go. If they were going to invest in an IaC skillset, they wanted the knowledge to be transferrable across clouds.

That’s the big advantage of Terraform. While the code itself is not cloud portable, the skill is. Terraform uses providers to be able to manage different resource types. Azure is a provider, written by Microsoft. Azure AD is a provider – ARM/Bicep still do not support Azure AD! AWS and GCP have providers. VMware has a provider. GitHub has a provider – the list goes on and on. If a provider does not exist, you can (in theory) write your own.

On that project, I was meant to be hands-off as an architect. But there were staffing and scheduling issues so I stepped up. Having never written a line of Terraform before I had my first workload, with some review help from a teammate, written in under a day. By the way, the same thing in Bicep took three days! Terraform is really well documented, with lots of examples, and the language makes sense.

Unlike Bicep, which is still beholden to a lot of the complexity of ARM. Doing simple things can involve stupidly complicated functions that only a C programmer (I used to be one) could enjoy (and I didn’t). I got hooked on Terraform and convinced my colleagues that it was a better path than Bicep, which was our original plan to replace ARM/JSON.

Aztfexport

Switching Terraform creates a question – what do we do with our existing workloads which are either deploying using Click Ops (Portal), script, or ARM/Bicep?

Microsoft has created a tool called Azure Export for Terraform (Aztfexport) on GitHub. The purpose of this tool is to take an existing resource group/resource/Graph query string and export it as Terraform code.

The code that is produced is intended to be used in some other way. In other words, Microsoft is not exporting code that should be able to immediately deploy new resources. They say that the produced code should be able to pass a terraform plan where the existing resources are compared with the state file and the code and say “the code is clean and there are no changes required”.

The Terraform configurations generated by aztfexport are not meant to be comprehensive and do not ensure that the infrastructure can be fully reproduced from said generated configurations. For details, please see limitations).
Azure/aztfexport: (github.com)

Why Use Aztfexport?

If I can’t use the code to deploy resources then what value is it? Hopefully you will see what aztfexport is a central part of my toolkit. I see it being useful in the following ways:

Learning Terraform: If you’ve not used Terraform before then it’s useful to see how the code can be produced, especially from resources that you are already familiar with.
Creating TF for an existing workload: You need to “terrafy” a resource/resource group and you want a starting point.
Azure-to-Azure migrations: You have a set of existing resources and you want to get a dump of all the settings and configurations.
Learning how a resource type/solution is coded: My favourite learning method is to follow the step-by-step and then inspect the resource(s) as code.
Understand how a resource type/solution works: This is a logical jump from the previous example, now including more resources as a whole solution.
Auditing: Comparing what is there with what should be there – or not there.
Documentation: The best form of resource documentation is IaC – why create lengthy documentation when the code is the resource?

I did use Aztfexport to learn Terraform more. In my current project, I have used it again and again to do Azure-to-Azure migrations, taking legacy ClickOps deployments and rewriting them as new secure/governed deployments. I’ve save countless hours capturing settings and configurations and re-using them as new code.

The Bad Stuff

Nothing is perfect, and Aztfexport has some thorns too. Notice that the expected usage is that the produced code should pass a terraform plan. That is because in many situations (like with ARM exports) the code is not usable to deploy resources. That can be because:

ARM APIs do not expose everything, so how can Terraform get those settings?
The tool or the providers using used do not export everything.

One example I’ve seen includes App Services configurations that do not include the code type details. Another recent one was with WAF Policies overridden WAF rules were not documented. In both cases, the code would pass a plan. But neither would re-produce the resources. I’ve learned that I do need to double-check things with a resource type that I’ve never worked with before – then I know what to go and manually grab either from an ARM export or a visual inspection in the Portal.

Another thing is that the resources are named by a “machine” – there is no understanding of the role. Every resource is res-1, res-2, and so on, no matter the type or the role in the workload. That is a bit anonymous, but I find that useful when inspecting dependencies between resources.

A giant main.tf file is created, which I break up into many smaller files. I can find relationships based on those easy-to-track dependencies and logically group resources where it suits my coding style.

One feature of TF is the easy reuse of resource IDs. One can easy refer to resource_type.resource_name.id in a property and know that the resource ID of that resource will be used. Unfortunately, some Aztfsexport code doesn’t do that so you get static resource IDs that should be replaced – that happens with other properties of resources too, so that all should be cleaned up to make code more reusable.

Installing Aztfexport

You will need to install Terraform – I prefer to use a Package Manager for that – the online instructions for a manual installation are a mess. You will also require Azure CLI.

The full instructions for installing Aztfexport are shared on GitHub, covering Windows, MacOS and Linux. The Windows installation is easy:

winget install aztfexport

You will need to restart your terminal (Windows) to get an updated Path variable so the aztfexport binary can be found.

Before you use aztfexport, you will need to log in using Azure CLI:

Open your terminal

Login:
az login

Change subscription:
az account set -subscription <subscription ID>

Verify the correct subscription was selected by checking the resource groups:
az group list

Create an empty folder on your PC and navigate to that folder in your terminal. The aztfexport tool requires an empty folder, by default, to create an export including all the required provider files and the generated code.

If you want to create an export of a single resource then you can run:

aztfexport resource <resource ID>

If you want to create an export of a resource group, then you can run:

aztfexport resource-group -n <resource group name>

Not the -n above means “don’t bother me with manual confirmation of what resources to include in the export”. In Terraform, sub-resources that can be managed as their own Terraform resources would otherwise need to be confirmed and that gets pretty tiresome pretty fast.

Tips

I’ve got to hammer on this one again, the produced code is not intended for deployment. Take the code, copy and paste it into new files and clean it up.

If your goal is to take over an existing IaC/ClickOps deployment with Terraform then you are going to have some fun. The resources already exist and Terraform is going to be confused because there is no state file. You will have to produce a state file using Terraform export for every resource definition in your code. That means knowing the resource IDs of everything, including Azure AD objects, role assignments, and sub-resources. You’ll need to understand the format of those resource IDs – use an existing state file for that. Often the resource ID is the simple Azure resource ID, or a derivation of a parent resource ID that you can figure out from another state file. Sometimes you need to wander through Azure AD (look at assignments in scopes that you do have access to if you don’t have direct Azure AD rights), use Azure CLI to “list” resources or items, or browse around using Resource Explorer in the Azure Portal.

Do take some time to compare your code with any previous IaC code or with an ARM export. Look for things that are missing – Terraform has many defaults that won’t be included and that code is missing because it is not required. I often include that code because I know that they are settings that Devs/Ops might want to tune later.

If you have the misfortune of having to work an existing Terraform module library then you will have to translate the exported code as parameter/variable files for the new code – I do not envy you 🙂

Summary

This post is an introduction to Microsoft Azure Export for Terraform and a quick how-to-get-started guide. There is much more to learn about, such as how to use a custom backend (if resource names in Terraform are not a big deal and to eliminate the terraform import task) or even how to use a resource map to identify resources to export across many resource groups.

The tool is not perfect but it has saved me countless hours over the last year or so, dating back to when it was called Azure Terrafy. It’s one in my toolkit and I regularly break it out to speed up my work. In my opinion, anyone starting to work with Terraform should install and use this tool.

Microsoft Ignite 2023 – I Will Not Be Attending

Microsoft Ignite 2023 has been announced as a hybrid event. Let me explain why I have no interest in attending in person or taking part digitally.

Technical Education

One of the reasons that I became a pretty regular attendee of Microsoft’s technical conferences was to learn. My first time to attend TechEd Europe was a real eye-opener. I took part in hands-on labs, tried out new products, and went to sessions where I learned a lot about products/features that I worked with or was interested in.

When a past manager asked me about my training budget/plan it was quite simple: I had no interest in traditional training because I knew all that I could learn in the necessary areas – I could often rewrite the courses with better content. But attending a conference where the creators of the product/feature stood on stage and got into deep technical detail – that was unmatched.

The TechEd brand was killed off years ago and replaced with the much larger Ignite conference. The immediate noticeable change was that the main breakouts were 99% reserved for Microsoft staff and sponsors – I avoid sponsor sessions because they are 100% advertising. The Microsoft sessions slowly changed away from technical Program Managers to managers, and then to corporate vice presidents (CVPs). That meant that the level of technical content was dropping and there was a shift to marketing.

Pandemic

As we all know, COVID-19 shut the world down and brought down conferences with it. Microsoft switched to a digital format for Ignite. In theory, this should have increased the audience and potentially the breadth & depth of content. However, Ignite “online” featured 30 minute-long sessions (because of “feedback”) that featured only:

Bullet point announcements with no technical follow-up
Marketing by CVPs.

Sure, Ignite became a glossy, well-produced digital event but it was pointless. I don’t care how many live streams they had – how many of those people were paying attention? I don’t care how many downloads/non-live streams they had – how many of those people finished got more than 1/3 through the session?

I can read bullet point announcements in the blog posts on day 1 of the conference much more easily than I can from a PowerPoint – and there will be links to more detailed information.

I have no interest in some CVP trying to be the next Stephen Elop-style failed techie celebrity, burning up time that would have been better with a program manager sharing knowledge on the new tech that they’ve been working on for months/years.

I remember a few years ago that one group in Microsoft staged their own “Ignite” outside of the official content/site in order to get their news out – that didn’t happen again. I guess somebody squashed that.

Why Attend?

I attended the last few TechEd North America conferences and all but the very first Microsoft Ignite events. I have been in a couple of conversations about attending this year and I’ve made it clear: I have no interest – and that seems to be a common opinion.

It costs a lot of money to travel to such an event. A flight is between €600-€1200. A hotel will clock in at over €2000. The early bird ticket price this year is $1,525 (around €1,424). Don’t forget local expenses like travel and food. If you’re a consultant like me then the company has lost revenue while you are away. And then there is the priceless time away from family and the impact on the partner who has to keep things running while you are far away. Attending a conference is an investment. I always saw attending Ignite as an investment in the following year: I would have knowledge that only a few others in my market had. If the return is near zero then Microsoft Ignite is a bad investment.

OK, can’t I just watch it online? I think I have watched maybe 3 Ignite sessions from the Pandemic years. Last year there was supposed to be a deep dive in one area that I work in. I tuned in live, and it was a CVP in a digital marvel or marketing, uttering words that they probably have never used in that order before. Even the time to watch the online content is not worth the investment.

What Needs To Change?

I don’t think that any of this will happen – there are those in Microsoft who view Ignite as irrelevant (yuk! tech!), a distraction, or a cost. The switch to an online video brochure suits them. I think that sucks. I know that there is an in-person option, but check out the mostly pre-recorded content – are you going to pay to stream the same content as everyone else while sitting in a conference centre?

The presenters need to switch back to the program managers from the teams. These are people who have worked on the products/features since inception and are qualified to talk about the content at a technical level and are trained in public/customer interaction (it’s normally a part of the job description).

The length of session needs to return to either 60 minutes or 75 minutes. As a presenter, I can tell you that it is impossible to bring an audience through a progression from level 200 to level 300/400 in 30 minutes while doing all the necessary steps and delivering any meaningful amount of content. 60 minutes is the minimum. 75 minutes gives the presenter a real chance to drill deep – which a large part of the audience really wants.

Become an expert in automation and AI in 21 minutes during this breakout deepdive!

The content needs to include large amounts of technical sessions. Sure, go ahead and have those level 100-200 sessions for the C-suite or people getting into subjects for the first time. But give us techies a reason to participate, either in person or online.

Give Us TechEd!

The thing that is most missing today is knowledge. There is too much focus on introduction/bullet point announcements/blog posts, training to get a practically useless certification, and documentation that fails to explain the why’s and how’s.

We need technical content from the people who work on the product/features and really know them. I say this as a person who wants to learn but also as a person who witnesses the lack of knowledge or understanding in the market – the iPad generation is trying to use The Cloud without knowing why/how/what’s best/what’s secure because they’re limited to the next-next getting started docs that are the only technical information out there anymore.

Azure Infrastructure Announcements – August 2023

This post brings you a summary of the infrastructure announcements from Azure that were made during August 2023. There are lots of announcements from Storage and a few interesting notes for VMs, networking, and ASR.

Storage

Azure Managed Lustre: not your grandparents’ parallel file system

With a few clicks of a web interface or an Azure Resource Manager template, AMLFS lets you provision an all-flash Lustre file system in minutes. What’s different is that this Lustre file system is all yours. If someone else in Azure is running a job that creates a million files, you won’t ever know it because your Lustre servers and SSDs are exclusively yours.

Massively scaled and high performance file systems for HPC workloads.

General availability | Azure NetApp Files: SMB Continuous Availability (CA) shares

To enhance resiliency during storage service maintenance operations, SMB volumes used by Citrix App Layering, FSLogix user profile containers and Microsoft SQL Server on Microsoft Windows Server can be enabled with Continuous Availability

SMB Transparent Failover means that clients should not notice maintenance operations.

Public preview: Azure Storage Mover support for SMB and Azure Files

Storage Mover is a fully managed migration service that enables you to migrate on-premises files and folders to Azure Storage while minimizing downtime for your workload. Azure Storage Mover can now migrate your SMB shares to Azure file shares.

To be honest, I’ve not encountered a “replace the file server with Azure Files” scenario yet. Third-party vendors often won’t support it for LOB apps. User data typically ends up in SharePoint/OneDrive. And wouldn’t most Citrix/RDS admins want to start with new profiles?

Generally available: Azure Blob Storage Cold Tier

Azure Blob Storage Cold Tier is now generally available. It is a new online access tier that is the most cost-effective Azure Blob offering for storing infrequently accessed data with long-term retention requirements, while providing instant access. The pricing of the cold tier storage option lies between the cool and archive tiers, and it follows a 90-day early deletion policy. You can seamlessly utilize the cold tier in the same way as the hot and cool tiers.

Cool – Cold. Tell me that isn’t confusing. The scenario is that you want to store data for a long time, but you need it immediately available. Archive requires a 15-hour restore (“rehydration”) that can be accelerated with a charge. Cold is one step up, but not as cost-effective.

Public Preview: Azure NetApp Files Cloud Backup for Virtual Machines

With Cloud Backup for Virtual Machines, you can now create VM consistent snapshot backups of VMs on Azure NetApp Files datastores. The associated virtual appliance installs in the Azure VMware Solution cluster and provides policy-based automated and consistent backup of VMs integrated with Azure NetApp Files snapshot technology for fast backups and restores of VMs, groups of VMs (organized in resource groups) or complete datastores lowering RTO, RPO, and improving total cost of ownership.

General Availability: Incremental snapshots for Premium SSD v2 Disk and Ultra Disk Storage

You can now instantly restore Premium SSD v2 and Ultra Disks from snapshots and attach them to a running VM without waiting for any background copy of data. This new capability allows you to read and write data on disks immediately after creation from snapshots, enabling you to recover your data from accidental deletes or a disaster quickly

I can see third-party backup making use of this.

Azure Elastic SAN updates: Private Endpoints & Shared Volumes

As we approach general availability of Azure Elastic SAN, we continue improving the service and adding features based on your feedback. Today, we are releasing private endpoint support and volume sharing support via SCSI (Small Computer System Interface) Persistent Reservation.

This sounds like the sort of feature maturity one will expect as the service approaches general availability. I wonder what the actual target market is for this service.

Azure Site Recovery

Private Preview – DR for Shared Disks – Azure Site Recovery

We are excited to announce the Private Preview of DR for Azure Shared Disks for workloads running Windows Server Failover Clusters (WSFC) on Azure VMs. Now you can protect, monitor, and recover your WSFC-clusters as a single unit across its DR Lifecycle, while also generating cluster-consistent recovery points – which are consistent across all the disks (including the Shared Disk) of the cluster.

This feature is long overdue for customers using shared virtual hard disks to create failover clusters.

Networking

Public preview: Support for new custom error pages in Application Gateway

In addition to the response codes 403 and 502, the Azure Application Gateway now lets you configure company-branded error pages for more response codes – 400, 405, 408, 500, 503, and 504. You can configure these error pages at a global level to apply to all the listeners on your gateway or individually for each listener.

These pages can be shared on any publicly accessible URI.

Azure Firewall: New Monitoring and Logging Updates

Notes:

(Preview) With the Azure Firewall Resource Health check, you can now view the health status of your Azure Firewall and address service problems that may affect your Azure Firewall resource. Resource Health allows IT teams to receive proactive notifications regarding potential health degradations and recommended mitigation actions for each health event type
(Preview) The Azure Firewall Workbook presents a dynamic platform for analyzing Azure Firewall data. Within the Azure portal, you can utilize it to generate visually engaging reports.
(GA) The Latency Probe metric is designed to measure the overall latency of Azure Firewall and provide insight into the health of the service. IT administrators can use the metric for monitoring and alerting if there is observable latency and diagnosing if the Azure Firewall is the cause of latency in a network.

Resource health should make for a useful alert, especially when enabling DevSecOps – be aware of the dreaded “out of sync” error. I just tried the workbook in a production system – I noticed a couple of things that I might not have otherwise noticed because they didn’t trigger a human response (yet). The latency probe is interesting – I think it originated from customer network performance scenarios where it was suspected that the firewall was the root cause.

Virtual Machines

Public preview: Azure Mv3 Medium Memory (MM) Virtual Machines

Today we are announcing the public preview of the next generation Mv3 Medium Memory (MM) virtual machine series. Powered by the 4th Generation Intel® Xeon® Scalable Processor and DDR5 DRAM technology, the Mv3 medium memory (MM) virtual machines can scale for SAP workloads from 250GB to 4TB. With Azure Boost, Mv3 MM provides a ~25% improvement in network throughput and up to 1.5X improvement in remote storage throughput over the previous M-series families.

These machines start at 12 vCPUs and 240 GB RAM, scaling up to 176 vCPUs and 2794 RAM. That should just about be enough to run Teams.

Azure Infrastructure Announcements – July 2023

Many people in Europe take the month of July off for vacation so they would have missed out on an unusually busy few weeks of announcements for Microsoft. This post summarises the infrastructure announcements from Microsoft Azure during July 2023.

Update: 01/09/2023. I’m not sure how this happened but I missed a bunch of interesting items from the second half of July. I guess that I got distracted while putting this list together (there’s a lot of task hopping during the day job) and thought that I’d completed the list. I have added some items today.

Networking

Public Preview: Default Rule Set 2.1 for Regional WAF with Application Gateway

DRS 2.1 is baselined off the Open Web Application Security Project (OWASP) Core Rule Set (CRS) 3.3.2 and extended to include additional proprietary protection rules developed by Microsoft.

All improvements to the CRS claimed to reduce the detection of false positives and I never saw that in reality. I’m going to be skeptical about this one – a simple rules-based system will still detect the same false positives that I continue to see daily.

Public preview: Azure Virtual Network Encryption

With Virtual Network encryption, customers can enable encryption of traffic between Virtual Machines and Virtual Machines Scale Sets within the same virtual network and between regionally and globally peered virtual networks.

This will be useful in limited scenarios going forward for customers. Too many networking features are limited to VMs. Legacy ssytems that are migrated to Azure or niche solutions that are best on VMs are fewer in number every day – customers that are already in the cloud normally choose PaaS first.

General availability: ExpressRoute private peering support for BGP communities

ExpressRoute private peering now supports the use of custom Border Gateway Protocol (BGP) communities with virtual networks connected to your ExpressRoute circuits. Once you configure a custom BGP community for your virtual network, you can view the regional and custom community values on outbound traffic sent over ExpressRoute when originating from that virtual network.

This one could be useful for customers where they have multiple ExpressRoute circuits with 1:M or N:M site:gateway scenarios.

General availability: Always Serve for Azure Traffic Manager

Always Serve for Azure Traffic Manager (ATM) is now generally available. You can disable endpoint health checks from an ATM profile and always serve traffic to that given endpoint. You can also now choose to use 3rd party health check tools to determine endpoint health, and ATM native health checks can be disabled, allowing flexible health check setups.

Not much to say here 🙂

Public preview: Route Server Hub Routing Preference

When branch-to-branch is enabled and Route Server learns multiple routes across site-to-site (S2S) VPN, ExpressRoute, and SD-WAN NVAs, for the same on-premises destination route prefix, users can now configure connection preferences to influence Route Server route selection.

Azure Route Server is a great resource. It’s so simple to configure. I just wish there were native solutions where you could program routes into it when using only native Azure networking resources. Using BGP instead of UDRs in a hub & spoke would be so much more reliable and agile.

Azure’s Cross-Region Load Balancer is Now Generally Available

With cross-region Load Balancer, you can distribute traffic across multiple Azure regions with ultra-low latency and high performance.

This smells like one of those Azure resource types that was developed for other Azure or Microsoft cloud services (like telephony) and they released it to the public too.

Updated default TLS policy for Azure Application Gateway

We have updated the default TLS configuration for new deployments of the Application Gateway to Predefined AppGwSslPolicy20220101 policy to improve the default security. This recently introduced, generally available, predefined policy ensures better security with minimum TLS version 1.2 (up to TLS v1.3) and stronger cipher suites.

Those of you using older deployments or modular code for new deployments should consult your application owners and start a planning process to upgrade.

Generally available: Cloud Next-Generation Firewall (NGFW) by Palo Alto Networks – an Azure Native ISV Service

Cloud NGFW by Palo Alto Networks is the first ISV next-generation firewall service natively integrated in Azure. Developed through a collaboration between Microsoft and Palo Alto Networks, this service delivers the cutting-edge security features of Palo Alto Network’s NGFW technology while also offering the simplicity and convenience of cloud-native scaling and management.

If you really must stay with on-prem tech 😀

Azure Kubernetes Service

Public preview: Azure Application Gateway for Containers

Application Gateway for Containers is the next evolution of Application Gateway + Application Gateway Ingress Controller (AGIC), providing application (layer 7) load balancing and dynamic traffic management capabilities for workloads running in a Kubernetes cluster.

It sounds good, but AKS folks that I respect seem to prefer NGINX. That said, I know SFA about K8s.

Public preview: Network observability add-on for AKS

The new network observability add-on for AKS, now in public preview, provides complete observability into the network health and connectivity of your AKS cluster.

I’m surprised that something like this wasn’t already available. My current project might not include AKS, but monitoring network performance and health between services was critical. Doing the same between micro-services seems more important to me.

Public preview: Bring your own key on Ephemeral OS disk for AKS

BYOK support provides you the option to use your own customer managed keys (CMK) to encrypt your ephemeral OS Disks, providing you increased control over your encryption keys.

This sounds like one of those “a really big customer wanted it” features and it won’t be of interest to too many others.

Azure Virtual Desktop

Announcing Public Preview of Personal Desktop Autoscale on Azure Virtual Desktop

Personal Desktop Autoscale is Azure Virtual Desktop’s native scaling solution that automatically starts session host virtual machines according to schedule or using Start VM on Connect and then deallocates session host virtual machines based on the user session state (log off/disconnect).

This could be a real money saver for a very expensive solution – personal desktops in the cloud.

Announcing the General Availability of Private Link for Azure Virtual Desktop

Private Link for Azure Virtual Desktop is now generally available! With this feature, users can securely access their session hosts and workspaces using a private endpoint within their virtual network. Private Link enhances the security of your data by ensuring it stays within a trusted and secure private network environment.

I have encountered a customer scenario where the connection had to go over a “leased line”. Even, at the time, had “Windows Virtual Desktop” been ready, the use of a public endpoint would have forced us to use Citrix instead. The use of a Private Endpoint forces the client to connect over a private network.

Azure Virtual Desktop Watermarking Support

We are announcing the general availability for Watermarking support on Azure Virtual Desktop, an optional protection feature to Screen Capture that acts as a deterrent for data leakage.

A QR code is watermarked onto the screen. The QR code can be scanned to obtain the connection ID of the session. Then admins can trace that session through Log Analytics. There are limitations.

Virtual Machines

Announcing General Availability of Confidential VMs in Azure Virtual Desktop

Azure confidential VMs (CVMs) offer VM memory encryption with integrity protection, which strengthens guest protections to deny the hypervisor and other host management components code access to the VM memory and state.

This might sound like overkill to most of us, but I have encountered one virtual desktop scenario where the nature of the data and the legal requirements might mandate the use of this technology.

Public Preview: Azure Dedicated Host – Resize

With Azure Dedicated Host’s new ‘resize’ feature, you can easily move your existing dedicated host to a new Azure Dedicated Host SKU (e.g., from Dsv3-Type1 to Dsv3-Type4). This new ‘resize’ feature minimizes the impact and effort involved in configuring VMs when you want to upgrade your underlying dedicated host system.

For you Hyper-V folks out there: yes, Live Migration will be used to keep the VMs running for all but a second or two (just like vMotion).

Dev-optimized, cloud-based workstations—Microsoft Dev Box is now generally available

Dev Box combines developer-optimized capabilities with the enterprise-ready management of Windows 365 and Microsoft Intune.

Think of this as the cousin of Windows 365 which is aimed at developers. For me, this has two use cases:

Supplying pay-as-you-go virtual machines to contract developers instead of purchasing hardware or trusting their hardware.
Providing a full development experience that is in a secured network and can be trusted to connect to Azure services.

Hotpatch is now generally available on Windows Server VMs on Azure with the Desktop Experience installation mode

Hotpatch is now available for Windows Server Azure Edition VMs with Desktop Experience installation mode using the newly released image.

Hmm, did someone say that Server Core is not widely popular? It’s about time.

Announcing public preview of new burstable VMs – Bsv2, Basv2 and Bpsv2

The new additions to the B family consist of 3 new VM series – Bsv2, Basv2, and Bpsv2, each based on the Intel® Xeon® Platinum 8370C, AMD EPYC™ 7763v, and Ampere® Altra® Arm-based processors respectively. These new burstable v2-series virtual machines offer up to 15% better price-performance, up to 5X higher network bandwidth with accelerated networking, and 10X higher remote storage throughput when compared to the original B series.

This is easily the most popular series of VMs for any customer that I have gone near. It makes sense that new hardware is being introduced to enable continued growth.

Preview: Azure Boost

Azure Boost is a new system that offloads virtualization processes traditionally performed by the hypervisor and host OS onto purpose-built hardware and software … customers participating in the preview to achieve a 200 Gbps networking throughput and a leading remote storage throughput up to 10 GBps and 400K IOPS, enabling the fastest storage workloads available today.

Back when I was a Hyper-V MVP, this was the sort of feature that would have caught my attention and led to a bunch of really detailed blog posts. If you follow the links you can read:

“Azure Boost VMs in preview can achieve up to 200 Gbps networking throughput, marking a significant improvement with a doubling in performance over other existing Azure VMs … industry leading remote storage throughput and IOPS performance of 10 GBps and 400K IOPS with our memory optimized E112ibsv5 VM using NVMe-enabled Premium SSD v2 or Ultra Disk options.”

It doesn’t appear to be just the extreme spec VMs that get improved:

“Offloading storage data plane operations from the CPU to dedicated hardware results in accelerated and consistent storage performance, as customers are already experiencing on Ev5 and Dv5 VMs. This also enhances existing storage capabilities such as disk caching for Azure Premium SSDs.”

“Azure Boost’s isolated architecture inherently improves security by running storage and networking processes separately on Azure Boost’s purpose-built hardware instead of running on the host server.” This might only be a Linux feature based on Security Enhanced Linux (SELinux).

I wish that Ben Armstrong was still doing tech presentations for Microsoft. He did an amazing job at sharing how things worked. in Hyper-V (what Azure is built upon).

The Classic VMs retirement deadline is now September 6, 2023

he deadline to migrate your Iaas VMs from Azure Service Manager to Azure Resource Manager is now September 6*, 2023. To avoid service disruption, we recommend that you complete your migration as soon as possible. We will not provide any additional extenstions after September 6, 2023.

There won’t be too many pre-ARM virtual machines out there. But those that are out there are probably old and mostly un-touched in years. It’s already late to get planning … so get planning!

Azure Migrate

Azure Migrate – Product & Partner Updates

A few notes:

Components in financial estimates through the “TCO/Business case” feature to allow you to analyze cost more comprehensively before moving to the cloud.
Tanium’s (a partner) real-time operational data can be used by Azure Migrate for assessments and to generate a business case to move to Azure.
Azure Migrate will now support in-place upgrade of end-of-support (EOS) Windows Server 2012 and later operating system (OS), during the move to Azure.

I have never been able to use Azure Migrate in 4+ years of migrating customers to Azure due to various reasons so I cannot comment on the above.

Using Linux VM For SNAT With ExpressRoute

This post will show how you can use an Azure Linux virtual machine to implement SNAT on an ExpresssRoute circuit to a remote location.

Scenario

You must have a low-latency connection to a remote location. That remote location is a partner. That partner uses IP ranges. That partner has many organisations, such as yours, that connect in. All of those organisations could have address overlap, which prevents the use of site-to-site networking without using SNAT. Your solution will make outbound connections to the partner’s services over the ExpressRoute circuit. The partner will use a firewall to restrict traffic. You must also use a firewall to protect your network – you will use Azure Firewall.

The scenario requires:

You use a partner-assigned address space
All traffic leaving your site and going to the partner network must use a source IP address from the partner assigned address space (SNAT)

Normally, you would accomplish this using your firewall appliance. However, Azure Firewall does not offer SNAT for private IP connections.

You might think “I’ve read that Virtual Network Gateway can do NAT rules”. Yes, the VPN Gateway can do NAT rules but the ExpressRoute Gateway does not have that feature.

The solution in this post will use a Linux virtual machine to implement SNAT.

The Architecture

Here is an image of the architecture:

A feature of this design is that the workload that will use the partner service is separated from the NAT appliance and the ExpressRoute circuit. This is because:

It allows flexibility with the workload to change location, design, platform, etc.
The partner connection is isolated and must route through a firewall appliance in the hub, ideally with advanced security features enabled.

The Workload

Let’s start with the description of the workload. The workload, some kind of compute capable of egress traffic on a VNet, is deployed in a spoke virtual network. The virtual network is a part of a hub-and-spoke architecture – it is peered to a hub. The workload has a route table that forces all egress traffic (0.0.0.0/0) to use the Azure Firewall in the hub as the next hop.

The Hub

The hub features an AzureFirewallSubnet with the Azure Firewall. There is a route table assigned to the subnet. Route propagation is enabled – to allow routes to propagate from site-to-site networking that is used by the organization. The purpose of this route table is to add specific routes, such as this scenario where we want to force traffic to the partner address space (129.228.1.0/26) to travel via the backend interface of the NAT appliance.

The partner address space (129.228.1.0/26) should be added as an additional private IP address (SNAT) range on the Azure Firewall – traffic to this prefix should not be forced out to the Internet.

Ideally, this firewall is the Premium SKU and has IDPS enabled.

The NAT Solution

The NAT solution is deployed in a “NAT virtual network”, dedicated to the partner ExpressRoute circuit. The hub is peered with the NAT virtual network – “gateway sharing” and “use remote gateway” are disabled – this is to prevent route propagation and to prevent incompatibilities between the hub and the NAT virtual network because they both have Virtual Network Gateways.

The NAT virtual machine (I used Ubuntu) is deployed as a Ds3_v2 – a commonly used series in NVAs because it has good network throughput compared to price (there is no Hyperthreading). The VM has two network interfaces:

eth1: This is the backend NIC. This NIC is the next hop that is used by the AzureFirewallSubnet route table in the hub for traffic going to the partner subnet. In other words, traffic from the organisation workload will route through the firewall, and then through this interface to get to the partner. This subnet uses an internal address range. A route table forces all traffic to 0.0.0.0/0 to use the hub firewall as the next hop. Route propagation is disabled – we do not want this NIC to learn routes to the partner. An NSG on this subnet denies all inbound traffic – we want to reject packets from the partner network and all connections will be outbound.
eth1: This is the interface that will communicate with the partner over ExpressRoute. This subnet uses an address range that is assigned by the partner. All traffic going to the partner from the organisation will use the IP address of this NIC. A route table forces all traffic to 0.0.0.0/0 to use the hub firewall as the next hop. Route propagation is enabled – this NIC must learn routes to the partner from the ExpressRoute Gateway (a useful place to verify BGP routing via Effective Routes). An NSG on this subnet will only accept connections from the IP address of the workload compute (resource or subnet depending on the nature of networking) with the required protocol/port numbers.

An ExpressRoute Gateway is deployed in the NAT virtual network. The ExpressRoute Gateway is connected to a circuit that connects the organisation to the partner.

The partner has a firewall that only permits traffic from the organisation if it uses a source IP address from the address range that they assigned to the organization.

Configuring Linux

I am allergic to Penguins so this took some googling 🙂 Here are the things to note:

129.228.1.0/26 is the partner network.
129.228.250.4 is the address of eth0, the frontend or SNAT NIC on the Linux VM.

You will log into the VM (Azure Bastion is your friend) and elevate to root:

sudo -i

You will need to install some packages:

apt-get update
apt-get -y install net-tools
apt-get -y install iptables-persistent
apt-get -y install nc

Verify that that eth0 is the (default) frontend NIC and that eth1 is the backend NIC.

ifconfig

Enable forwarding in the kernel:

*echo 1 > /proc/sys/net/ipv4/ip_forward

Configure the change persistently by editing the sysctl.conf file using the vi editor:

vi /etc/sysctl.conf

Find the below line and remove the comment so that it becomes active:

net.ipv4.ip_forward = 1

Now for some vi fun: Type the following to save the changes:

:wp

Apply change:

sysctl -p

Verify the above change:

sysctl net.ipv4.ip_forward

Next, you will configure routing from eth0 to eth1.

iptables -A FORWARD -i eth1 -o eth0 -j ACCEPT
iptables -A FORWARD -i eth0 -o eth1 -m state --state RELATED,ESTABLISHED -j ACCEPT

And then you will enable iptables Masquerading

iptables -t nat -A POSTROUTING -o eth1 -j MASQUERADE

At this point, routing from eth1 to eth0 is enabled but the source address is not being changed. The following line will change the source address of traffic leaving eth0 to use the partner assigned address.

iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to 129.228.250.4

You can now test the connection from your workload to the partner. If everything is correct, a connection is possible … but your work is not done. The iptables configuration is not persistent! You will fix that with these commands:

sudo apt install iptables-persistent
sudo iptables-save > /etc/iptables/rules.v4
sudo ip6tables-save > /etc/iptables/rules.v6

Now you should reboot your virtual machine and verify that your iptables configuration is still there:

iptables -t nat -v -L POSTROUTING -n --line-number

A good tip now is to make sure that you have enabled Azure Backup and that your VM is being backed up. And also do other good practices such as managing patching for Linux and implementing Defender for Cloud for the subscription.

Wrapping Up

There you have it; you have created a “DMZ” that enables an ExpressRoute connection to a remote partner network. You have protected yourself against the partner. You have complied with the partner’s requirements to use an IP address that they have assigned to you. You still have the ability to use site-to-site networking for yourself and for other partners without introducing potential incompatibilities. And you have not handed over fists full of money to an NVA vendor.

Cosmos DB Replicas With Private Endpoint

This post explains how to make Cosmos DB replicas available using Private Endpoint.

The Problem

A lot of (most) Azure documentation and community content assumes that PaaS resources will be deployed using public endpoints. Some customers have the common sense not to use public endpoints – who wants to be a zero-day target for well-armed attackers?!

Cosmos DB is a commonly considered database for born-in-the-cloud workloads. One of the cool things about Cosmos DB is the ability to use any number of globally dispersed read-only or write replicas with pretty low replication latency.

But there is a question – what happens if you use Private Endpoint? The Cosmos DB account is created in a “primary” region. That Private Endpoint connects to a virtual network in the primary region. If the primary region goes offline (it does happen!) then how will clients redirect to another replica? Or if you are building a workload that will exist in many regions, how will a remote footprint connection to the local Cosmos DB replica?

I googled and landed on a Microsoft forum post that asked such a question. The answer was (in short) “The database will be available, how you connect to it is your and Azure Network’s problem”. Very helpful!

Logically, what we want is:

What I Figured Out

I’ve deployed some Cosmos DB using Private Endpoint as code (Terraform) in the recent past. I noticed that the DNS configuration was a little more complex than you usually find – I needed to create a Private DNS Zone for:

The Cosmos DB service type
Each Azure region that the replica exists in for that service type

I fired up a lab to simulate the scenario. I created Cosmos DB account in North Europe. I replicated the Cosmos DB account to East US. I created a VNet in North Europe and connected the account to the VNet using a Private Endpoint.

Here’s what the VNet connected devices looks like:

As you can see, the clients in/peered with the North Europe VNet can access their local replica and the East US replica via the local Private Endpoint.

I created a second VNet in East US. Now the important bit: I connected the same Cosmos Account to the VNet in East US. When you check out the connected devices in the East US VNet then you can see that clients in/peered to the North America VNet can connect to the local and remote replica via the local Private Endpoint:

DNS

Let’s have a look at the DNS configurations in Private Endpoints. Here is the one in North Europe:

If we enable the DNS zone configuration feature to auto-register the Private Endpoint in Azure Private DNS, then each of the above FQDNs will be registered and they will resolve to the North Europe NIC. Sounds OK.

Here is the one in East US:

If each region has its own Private DNS Zones then all is fine. If you use Private DNS zones per workload or per region then you can stop reading now.

But what if you have more than just this workload and you want to enable full name resolution across workloads and across regions? In that case, you probably (like me) run central Private DNS Zones that all Private Endpoints register with no matter what region they are deployed into. What happens now?

Here I have set up a DNS zone configuration for the North Europe Private Endpoint:

Now we will attempt to add the East US Private Endpoint:

Uh-oh! The records are already registered and cannot be registered again.

WARNING: I am not a Cosmos DB expert!

It seems to me that using the DNS Zone configuration feature will not work for you in the globally shared Private DNS Zone scenario. You are going to have to configure DNS as follows:

The global account FQDN will resolve to your primary region.
The North Europe FQDN will resolve to the North Europe Private Endpoint. Clients in North Europe will use the North Europe FQDN.
The East US FQDN will resolve to the East US Private Endpoint. Clients in East US will use the East US FQDN.

This means that you must manage the DNS record registrations, either manually or as code:

This will mean that clients in one region that try to access another region (via failover) will require global VNet peering and NSG/firewall access to the remote Private Endpoint.

Azure WAF and False Positives

This post will explain how to override false positives in the (network) Azure Web Application Firewall (WAF), without compromising security, using one of four methods in combination with a tiered WAF Policy architecture:

Managed Rulesets
Custom Rules
Exclusions
Disabled rules

False Positives

A WAF is a rather simple solution, attempting to inspect L7 (application layer) traffic and intercept attacks such as protocol misuse, SQL injection, or cross-site scripting. Unfortunately, false positives can occur.

For example, let’s assume that an API app is securely shared using a WAF. Messages sent to the API might be formatted in JSON, with lots of special characters to format the message. SQL Inspection defenses count special characters, trying to find where an attacker is trying to escape out of a web request to create a database command that will execute. If the defense counts too many special characters (it will!) then an alert will be created and the message will be blocked if Prevention mode is enabled.

One must allow that traffic through because it is expected traffic that the application (and the business) requires. But one must do this without opening up too many holes in the WAF, making the WAF a costly, pointless existence.

Log Analytics Ingestion Charge

There is a side effect to false positives. False positives will vastly outnumber actual attack/probing attempts. Busy workloads can generate huge amounts of logs for false positives. If you use Log Analytics, that data has a cost:

Storage: Not too bad
Ingestion: This one is painful

The way to reduce the cost is to reduce the noise by overriding the detections that create false positives. Organizations that have a lot of web traffic could save a significant amount of money here.

WAF Policies

The WAF functionality of the Azure Application Gateway (AppGw) is managed by a resource called an Application Gateway WAF Policy (WAF Policy). The typical approach is to associate 1 WAF Policy with a WAF resource. The WAF policy will create customizations. For reasons that should become apparent later, I am going to urge you to take a slightly more granular approach to manage your WAF if your WAF is used to securely share more than one workload or listener:

WAF parent policy: A WAF policy will be associated with the WAF. This policy will apply to the WAF and all listeners unless another WAF Policy overrides specific settings.
Per-Listener/Per-Workload policy: This is a policy that is created specifically for a listener or a workload (a set of listeners). Any customisations that apply only to a listener or a workload will be applied here, without affecting any other listener or workload.

Methodology

You will never know what false positives you will encounter. If your WAF goes straight into Prevention mode then you will create a world of pain and be the recipient of a lot of hate-messages/emails.

Here’s the approach that I recommend:

Protect your WAF with an NSG that has Traffic Analytics enabled. The NSG should only allow the necessary HTTP, HTTPS, WAF monitoring (from Azure), and load balancing traffic. Use a custom deny-all rule to block everything else.
Enable monitoring for the Application Gateway, sending all logs to a queryable destination such as Log Analytics.
Monitor traffic for a period of time – enough to allow expected normal usage of the full systems. Your monitoring should detect the false positives.
Verify that Traffic Analytics did not record malicious IP addresses hitting your WAF.
Query your monitoring data to find the false positives for each listener. Identify the hostname, request URI, ruleset, rule group, and rule ID that is causing the issue on a per-listener/workload basis.
Ideally, developers fix any issues that create false positives but this is unlikely – so we’ll move on.
Determine your override strategy (see below).
Deploy your overrides with the policies still in Detection mode.
Monitor traffic for another period of time to ensure that there are no more false positives.
Switch the parent policy to Prevention Mode.
Swith each per-listener/per-workload policy to Prevention Mode
Monitor

Managed Rule Sets

The WAF today has two rulesets that you can use:

OWASP: Used to detect attacks such as SQL Injection, Cross-site scripting, and so on.
Microsoft Bot Manager Rule Set: Used to prevent malicious bots from browsing/attacking your workloads.

You need the OWASP ruleset – but we will need to manage it (later). The bot ruleset, in my experience, creates a huge amount of noise will no way of creating granular overrides. One can override the bot ruleset using custom rules, but as you’ll see later, that’s a big stick that is not granular at all!

My approach to this is to disable the Microsoft Bot Manager Rule Set (or leave it disabled) in the parent and child rulesets. If I have a need to enable it somewhere, I can do it in a per-listener or per-workload ruleset.

Custom Rules

A custom rule is created in a WAF Policy to force traffic that matches certain criteria to be:

Always allowed
Always denied
Logged only without denying it

You can create a sequence of filters based on:

IP Address
Number
String
Geo Location

If the set of filters matches a request then your desired action will apply. For example, if I want to force traffic to be allowed to my API, I can enter the API URI as one of the filters (as above) and all traffic will be allowed.

Yes, all traffic will be allowed, including traffic that is not a false positive. If I only had a few OWASP rules that were blocking the traffic, the custom rule would disable all OWASP rules.

If you must use this approach, then implement it in the child policy so it is limited to the associated listener/workload.

Exclusions

This is the newest of the override types in WAF Policy – and I’ve found it to be the least useful.

The theory is that you can create an exclusion for one or more OWASP rules based on the values of request headers. For example, if a header called RequestHeaderKeys contains a value of X-Scanner you can instruct the affected OWASP rules to be disabled. This sounds really powerful and quite granular. But this starts to fall apart with other scenarios, such as the aforementioned SQL Injection.

Another common rule that alerts on or blocks traffic is Missing User Agent Header. Exclusions work on the value of a header, so if the header is missing, Exclusions cannot evaluate it.

Another gotcha is that you cannot combine header filters to create an exclusion. The Azure Portal experience for creating an Exclusion makes it look like you can. However, the result is two or more Exclusions that work independently.

If Exclusions will work for you, implement them in the per-listener/per-workload policy and specify only the rules that must be overridden. This approach will limit the effect of the exclusion:

The scope is just the listener/workload that is associated with the WAF Policy.
The scope is further limited to just requests where the header matches, allowing all other requests and all OWASP rules to be applied.

Disabled Rules

The final approach that you can use is to disable rules that are creating false positive alerts. A simple workload might only require one or two rules to be disabled. An older & larger workload might require many OWASP rules to be disabled!

If you are going to disable OWASP rules, then do it in the per-listener/per-workload policy. This will limit the effect of the changes to that listener/workload.

This is a fairly each approach and it is pretty granular – not as much as Exclusions. The downside is that you are completely disabling certain protections for an entire listener/workload, leaving the workload vulnerable to attacks of those previously protected types.

Combinations

If you have the time and the data, you can combine different approaches. For example:

A webhook that comes from the same IP address all of the time can be allowed via a Custom Rule based on an IP Address filter. Any other traffic will be subject to the fill defenses of the WAF.
If you have certain headers that must be allowed and you want to enable all other protections for all other traffic then use Exclusions.
If traffic can come from anywhere and you need to override OWASP rules, then disable those rules.

No Great Solution

In summary, there is no perfect solution. The best you can do is find the correct override solution for the specific false positive and deploy it to a specific listener or workload. This will limit the holes that you create in the WAF to the absolute minimum while enabling your workloads to function.

Checking If Client Has Access To KeyVault With Private Endpoint

How to detect connections to a PaaS resource using Private Endpoint.

In this post, I’ll explain how to check if a client service, such as an App Service, has access to an Azure Key Vault with Private Endpoint.

Private Endpoint

In case you do not know, Private Endpoint gives us a mechanism where we can attach a PaaS service, such as a Key Vault, to a subnet with a NIC and a private IP address. Public connections to the PaaS resources are disabled, and an (Azure) Private DNS Zone is used to alter the name resolution of the PaaS resource to point to the private IP address.

Note that communications to the private endpoint are inbound (and response only). The PaaS resource cannot make outbound connections over a Private Endpoint.

My Scenario

The customer has an App Service Plan that has VNet Integration enabled – this allows the App Services to make outbound connections from “random” IPs on this subnet – NSG/Firewall rules should permit access from the subnet prefix.

The App Services on the plan have Private Endpoints on a second subnet in the VNet. There is also a Key Vault, which also has a Private Endpoint. The “Private Endpoint subnet” has an NSG to deny everything except desired traffic, including allowing HTTPS from the VNet Integration subnet prefix to the Key Vault Private Endpoint.

A developer was wondering if connections from an App Service were working and asked if we could see this in the logs.

Problem

The dev in this case wanted to verify network connectivity. So the obvious place to check was … the network! The way to do that is usually to verify that packets arrived at the destination NIC. You can do that (normally) using NSG Flow Logs. There is sometimes up to 25 minutes (or longer during pandemic compute shortages) of a wait before a flow appears in Log Analytics (data export from the host, 10 minutes collection interval [in our case], data processing [15 minutes]). We checked the logs but nothing was there.

And that is because (at this time) NSG Flow Logs cannot produce data destined to Private Endpoints.

We need a different way to trace connections.

Solution

The solution is to check the logs of the target resource. We enable a lot of logging by standard, including the logs for Key Vault. A little bit of Kql-Fu produced this query:

AzureDiagnostics
| where ResourceProvider =="MICROSOFT.KEYVAULT"
| where ResourceId contains "nameOfVault"
| project CallerIPAddress, OperationName, requestUri_s, ResultType, identity_claim_xms_mirid_s

The resulting columns were:

CallerIPAddress: The IP address of the client (the IP address used by the App Service Plan VNet integration, in our case)
OperationName: Things like SecretGet, Authentication, VaultGet, and SecretList
requestUri_s: The URI of the secret being requested
ResultType: Was it a success or not?
identity_claim_xms_mirid_s: The resource ID of the requesting client (the resource ID of the App Service, in our case)

Armed with the resulting info, the dev got what they needed to prove that the App Service was connecting to the Key Vault.

PowerShell – Check VMSS Instance Image/Model Versions

Here is a PowerShell script to check the Image or Model versions of each instance in an Azure Virtual Machine Scale Set (VMSS):

$ResourceGroup = "p-we1dep"
$Vmss = "p-we1dep-windows-vmss"

// Find all the instances in the VMSS
$Instances = Get-AzVmssVM -ResourceGroupName $ResourceGroup -Name $Vmss

Write-Host "Instance image versions of VMMS: $Vmss"

// For each instance in the VMSS
foreach ($Instance in $Instances) {

    // Get the exact version of the instance
    $InstanceInfo = (Get-AzVmssVM -ResourceGroupName $ResourceGroup -Name $Vmss -InstanceId $Instance.instanceId).StorageProfile.ImageReference.ExactVersion
    $Id = $Instance.instanceId

    // Echo the instance ID and Exact Version
    Write-Host "Instance $Id - $InstanceExactVersion"
}