Terrafying Azure – A Tale From The Dark Side

This post is a part of the Azure Back to School 2023 online event. In this post, I will discuss using Microsoft Azure Export for Terraform, also known as Aztfexport and previously known as Azure Terrafy (a great name!), to create Terraform code from existing Azure deployments, why you would do it, and share a few tips.

Terraform

Terraform is one of a few Infrastructure-as-Code (IaC) languages out there that support Microsoft Azure. You might wonder why I would use it when Azure has ARM and Bicep. I’ll do a quick introduction to Terraform and then explain my reasoning which you are free to disagree with 🙂

Terraform is a product of Hashicorp available as a free-to-use product that is supported with some paid-for services. Like other IaC languages, it describes and desired end result. The major feature that differs from the native Azure languages is the use of state files – a file that describes what is deployed in Azure. This state file has a few nice use cases, including:

  • The outputs of a resource are documented, enabling effortless integration between resources in the same or even different files – with some effort, outputs from different deployments can be included in another deployment.
  • A true what-if engine that (mostly) works, unlike the native what-if in Azure, greatly reducing the time required for deployments and the ability to plan (pre-review) a deployment’s expected changes.

My first encounter with Terraform was a government project where the customer wanted to use Terraform over Bicep. Their reasoning was that elected politicians come and go, and suppliers come and go. If they were going to invest in an IaC skillset, they wanted the knowledge to be transferrable across clouds.

That’s the big advantage of Terraform. While the code itself is not cloud portable, the skill is. Terraform uses providers to be able to manage different resource types. Azure is a provider, written by Microsoft. Azure AD is a provider – ARM/Bicep still do not support Azure AD! AWS and GCP have providers. VMware has a provider. GitHub has a provider – the list goes on and on. If a provider does not exist, you can (in theory) write your own.

On that project, I was meant to be hands-off as an architect. But there were staffing and scheduling issues so I stepped up. Having never written a line of Terraform before I had my first workload, with some review help from a teammate, written in under a day. By the way, the same thing in Bicep took three days! Terraform is really well documented, with lots of examples, and the language makes sense.

Unlike Bicep, which is still beholden to a lot of the complexity of ARM. Doing simple things can involve stupidly complicated functions that only a C programmer (I used to be one) could enjoy (and I didn’t). I got hooked on Terraform and convinced my colleagues that it was a better path than Bicep, which was our original plan to replace ARM/JSON.

Aztfexport

Switching Terraform creates a question – what do we do with our existing workloads which are either deploying using Click Ops (Portal), script, or ARM/Bicep?

Microsoft has created a tool called Azure Export for Terraform (Aztfexport) on GitHub. The purpose of this tool is to take an existing resource group/resource/Graph query string and export it as Terraform code.

The code that is produced is intended to be used in some other way. In other words, Microsoft is not exporting code that should be able to immediately deploy new resources. They say that the produced code should be able to pass a terraform plan where the existing resources are compared with the state file and the code and say “the code is clean and there are no changes required”.

The Terraform configurations generated by aztfexport are not meant to be comprehensive and do not ensure that the infrastructure can be fully reproduced from said generated configurations. For details, please see limitations).

Azure/aztfexport: (github.com)

Why Use Aztfexport?

If I can’t use the code to deploy resources then what value is it? Hopefully you will see what aztfexport is a central part of my toolkit. I see it being useful in the following ways:

  • Learning Terraform: If you’ve not used Terraform before then it’s useful to see how the code can be produced, especially from resources that you are already familiar with.
  • Creating TF for an existing workload: You need to “terrafy” a resource/resource group and you want a starting point.
  • Azure-to-Azure migrations: You have a set of existing resources and you want to get a dump of all the settings and configurations.
  • Learning how a resource type/solution is coded: My favourite learning method is to follow the step-by-step and then inspect the resource(s) as code.
  • Understand how a resource type/solution works: This is a logical jump from the previous example, now including more resources as a whole solution.
  • Auditing: Comparing what is there with what should be there – or not there.
  • Documentation: The best form of resource documentation is IaC – why create lengthy documentation when the code is the resource?

I did use Aztfexport to learn Terraform more. In my current project, I have used it again and again to do Azure-to-Azure migrations, taking legacy ClickOps deployments and rewriting them as new secure/governed deployments. I’ve save countless hours capturing settings and configurations and re-using them as new code.

The Bad Stuff

Nothing is perfect, and Aztfexport has some thorns too. Notice that the expected usage is that the produced code should pass a terraform plan. That is because in many situations (like with ARM exports) the code is not usable to deploy resources. That can be because:

  • ARM APIs do not expose everything, so how can Terraform get those settings?
  • The tool or the providers using used do not export everything.

One example I’ve seen includes App Services configurations that do not include the code type details. Another recent one was with WAF Policies overridden WAF rules were not documented. In both cases, the code would pass a plan. But neither would re-produce the resources. I’ve learned that I do need to double-check things with a resource type that I’ve never worked with before – then I know what to go and manually grab either from an ARM export or a visual inspection in the Portal.

Another thing is that the resources are named by a “machine” – there is no understanding of the role. Every resource is res-1, res-2, and so on, no matter the type or the role in the workload. That is a bit anonymous, but I find that useful when inspecting dependencies between resources.

A giant main.tf file is created, which I break up into many smaller files. I can find relationships based on those easy-to-track dependencies and logically group resources where it suits my coding style.

One feature of TF is the easy reuse of resource IDs. One can easy refer to resource_type.resource_name.id in a property and know that the resource ID of that resource will be used. Unfortunately, some Aztfsexport code doesn’t do that so you get static resource IDs that should be replaced – that happens with other properties of resources too, so that all should be cleaned up to make code more reusable.

Installing Aztfexport

You will need to install Terraform – I prefer to use a Package Manager for that – the online instructions for a manual installation are a mess. You will also require Azure CLI.

The full instructions for installing Aztfexport are shared on GitHub, covering Windows, MacOS and Linux. The Windows installation is easy:

winget install aztfexport

You will need to restart your terminal (Windows) to get an updated Path variable so the aztfexport binary can be found.

Before you use aztfexport, you will need to log in using Azure CLI:

Open your terminal

Login:
az login

Change subscription:
az account set -subscription <subscription ID>

Verify the correct subscription was selected by checking the resource groups:
az group list

Create an empty folder on your PC and navigate to that folder in your terminal. The aztfexport tool requires an empty folder, by default, to create an export including all the required provider files and the generated code.

If you want to create an export of a single resource then you can run:

aztfexport resource <resource ID>

If you want to create an export of a resource group, then you can run:

aztfexport resource-group -n <resource group name>

Not the -n above means “don’t bother me with manual confirmation of what resources to include in the export”. In Terraform, sub-resources that can be managed as their own Terraform resources would otherwise need to be confirmed and that gets pretty tiresome pretty fast.

Tips

I’ve got to hammer on this one again, the produced code is not intended for deployment. Take the code, copy and paste it into new files and clean it up.

If your goal is to take over an existing IaC/ClickOps deployment with Terraform then you are going to have some fun. The resources already exist and Terraform is going to be confused because there is no state file. You will have to produce a state file using Terraform export for every resource definition in your code. That means knowing the resource IDs of everything, including Azure AD objects, role assignments, and sub-resources. You’ll need to understand the format of those resource IDs – use an existing state file for that. Often the resource ID is the simple Azure resource ID, or a derivation of a parent resource ID that you can figure out from another state file. Sometimes you need to wander through Azure AD (look at assignments in scopes that you do have access to if you don’t have direct Azure AD rights), use Azure CLI to “list” resources or items, or browse around using Resource Explorer in the Azure Portal.

Do take some time to compare your code with any previous IaC code or with an ARM export. Look for things that are missing – Terraform has many defaults that won’t be included and that code is missing because it is not required. I often include that code because I know that they are settings that Devs/Ops might want to tune later.

If you have the misfortune of having to work an existing Terraform module library then you will have to translate the exported code as parameter/variable files for the new code – I do not envy you 🙂

Summary

This post is an introduction to Microsoft Azure Export for Terraform and a quick how-to-get-started guide. There is much more to learn about, such as how to use a custom backend (if resource names in Terraform are not a big deal and to eliminate the terraform import task) or even how to use a resource map to identify resources to export across many resource groups.

The tool is not perfect but it has saved me countless hours over the last year or so, dating back to when it was called Azure Terrafy. It’s one in my toolkit and I regularly break it out to speed up my work. In my opinion, anyone starting to work with Terraform should install and use this tool.

Referencing Private Endpoint IP Addresses In Terraform

It is possible to dynamically retrieve the resulting IP address of an Azure Private Endpoint and use it in other resources in Terraform. This post will show you how.

Scenario

You are building some PaaS resources using Private Endpoints. You have no idea what the IP addresses are going to be. But you need to use those IP addresses elsewhere in your Terraform code, for example in an NSG rule. How do you get the IP addresses?

Find The Properties

The trick for this is to use the terraform state command. In my case, I deployed a Cosmos DB resource using azurerm_private_endpoint.cosmosdb-account1. To view the state of the resource, I can run:

terraform state show azurerm_private_endpoint.cosmosdb-account1

That outputs a bunch of code:

Terraform state of a Cosmos DB resource

You can think of the exposed state as a description of the resource the moment after it was deployed. Everything in that state is addressable. A common use might be to refer to the resource ID (azurerm_private_endpoint.cosmosdb-account1.id) or resource name (azurerm_private_endpoint.cosmosdb-account1.name) properties. But you can also get other properties that you don’t know in advance.

The Solution

Take another look at the above diagram. There is an array property called private_dns_zone_configs that has one item. We can address this property as azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].

In there there is another array property, with two items, called record_sets. There is one record set per IP address created for this private endpoint. We can address these properties as azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[0] and azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[1].

Cosmos DB creates a private endpoint with multiple different IP addresses. I deliberately chose Cosmos DB for this example because it shows a more complex probelm and solution, demonstrating a little bit more of the method.

Dig into record_sets and you’ll find an array property called ip_addresses with 1 item. If I want the two IP addresses of this private endpoint then I will use: azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[0].ip_addresses[0] and azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[1].ip_addresses[0].

Using the Addresses

destination_address_prefixes = [
 azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[0].ip_addresses[0], // Cosmos DB Private Endpoint IP 1
 azurerm_private_endpoint.cosmosdb-account1.private_dns_zone_configs[0].record_sets[1].ip_addresses[0] // Cosmos DB Private Endpoint IP 2
 ]                       
}

And now I have code that will deploy an NSG rule with the correct destination IP address(es) of my private endpoint without knowing them. And even better, if something causes the IP address(es) to change, I can rerun my code without changing it, and the rules will automatically update.

Pros & Cons: IaC Modules

This post will discuss the pros & cons of creating & using Infrastructure-as-Code/IaC Modules – based on 2 years of experience in creating and using a modular approach.

Why Modules?

Anyone who has done just a little bit of template work knows that ARM templates can get quickly get too big. Even a simple deployment, like a hub & spoke network architecture, can quickly expand out to several hundred lines without very much being added. Heck, when Microsoft first released the Cloud Adoption Framework “Enterprise Scale” example architecture, one of the ARM/JSON files was over 20,000 lines long!

The length of a template file can cause so many issues, including but definitely not limited to:

  • It becomes hard to find anything
  • Big code becomes hard code to update – one change has many unintended repercussions
  • Collaboration becomes near impossible
  • Agility is lost

One of the pain points that really annoyed one of my colleagues is that “big code” usually becomes non-standardised code; that becomes a big issue when a “service organisation” is supporting multiple clients (consulting company, managed services, Operations, or cloud centre of excellence).

Modularisation

The idea of modularisation is that commonly written code is written once as a module. That module is then referred to by other code whenever the functions of the module are required. This is nothing new – the concept of an “include” or “DLL” is very old in the computing world.

For example, I can create a Bicep/ARM/Terraform module for an Azure App Service. My module can deploy an App Service the way that I believe is correct for my “clients” and colleagues. It might even build some governance in, such as a naming standard, by automating the naming of the new resource based on some agreed naming pattern. Any customisations for the resource will be passed in as parameters, and any required values for inter-module dependencies can be passed out as outputs.

Quickly I can build out a library of modules, each deploying different resource types – now I have a module library. All I need now is code to call the modules, model dependencies, pass in parameters, and take outputs from one module and pass them in as parameters to others.

The Benefits

Quickly, the benefits appear:

  • You write less code because the code is written once and you reuse it.
  • Code is standardised. You can go from one workload to another, or one client to another, and you know how the code works.
  • Governance is built into the code. Things like naming standards are taken out of the hands of the human and written as code.
  • You have the potential to tap into new Azure features such as Template Specs.
  • Smaller code is easier to troubleshoot.
  • Breaking your code into smaller modules makes collaboration easier.

The Issues

Most of the issues are related to the fact that you have now built a software product that must be versioned and maintained. Few of us outside the development world have the know-how to do this. And quite frankly, the work is time-consuming and detracts from the work that we should be doing.

  • No matter how well you write a module, it will always require updates. There is always a new feature or a previously unknown use case that requires new code in the module.
  • New code means new versions. No matter how well you plan, new versions will change how parameters are used and will introduce breaking changes with some or all previous usage of the module.
  • Trying to create a one-size-fits-all module is hard. Azure App Services are a perfect example because there are dozens if not hundreds of different configuration options. Your code will become long.
  • The code length is compounded by code complexity. Many values require some sort of input, such as NULL. Quickly you will have if-then-elses all over your code.
  • You will have to create a code release and versioning system that must be maintained. These are skills that Ops people typically do not have.
  • Changes to code will now be slowed down. If a project needs a previously unwritten module/feature, the new code cannot be used until it goes through the software release mechanism. Now you have lost one of the key features of The Cloud: agility.

So What Is Right?

The answer is, I do not know. I know that “big code” without some optimisation is not the way forward. I think the type of micro-modularisation (one module per resource type) that we normally think of when “IaC Modules” is mentioned doesn’t work either.

One of the reasons that I’ve been working on and writing about Bicep/Azure Firewall/DevSecOps recently is to experiment with things such as the concept of modularisation. I am starting to think that, yes, the modularisation concept is what we need, but how we have implemented the module is wrong.

My biggest concern with the micro-module approach is that it actually slowed me down. I ended up spending more time trying to get the modules to run cleanly than I would have if I’d just written the code myself.

Maybe the module should be a smaller piece of code, but it shouldn’t be a read-only piece of code. Maybe it should be an example that I can take and modify to my own requirements. That’s the approach that I have used in my DevSecOps project. My Bicep code is written into smaller files, each handling a subset of the tasks. That code could easily be shared in a reference library by a “cloud centre of excellence” and a “standard workload” repo could be made available as a starting point for new projects.

Please share below if you have any thoughts on the matter.

Azure Firewall DevSecOps in Azure DevOps

In this post, I will share the details for granting the least-privilege permissions to GitHub action/DevOps pipeline service principals for a DevSecOps continuous deployment of Azure Firewall.

Quick Refresh

I wrote about the design of the solution and shared the code in my post, Enabling DevSecOps with Azure Firewall. There I explained how you could break out the code for the rules of a workload and manage that code in the repo for the workload. Realistically, you would also need to break out the gateway subnet route table user-defined route (legacy VNet-based hub) and the VNet peering connection. All the code for this is shared on GitHub – I did update the repo with some structure and with working DevOps pipelines.

This Update

There were two things I wanted to add to the design:

  • Detailed permissions for the service principal used by the workload DevOps pipeline, limiting the scope of change that is possible in the hub.
  • DevOps pipelines so I could test the above.

The Code

You’ll find 3 folders in the Bicep code now:

  • hub: This deploys a (legacy) VNet-based hub with Azure Firewall.
  • customRoles: 4 Azure custom roles are defined. This should be deployed after the hub.
  • spoke1: This contains the code to deploy a skeleton VNet-based (spoke) workload with updates that are required in the hub to connect the VNet and route ingress on-prem traffic through the firewall.

DevOps Pipelines

The hub and spoke1 folders each contain a folder called .pipelines. There you will find a .yml file to create a DevOps pipeline.

The DevOps pipeline uses Azure CLI tasks to:

  • Select the correct Azure subscription & create the resource group
  • Deploy each .bicep file.

My design uses 1 sub for the hub and 1 sub for the workload. You are not glued to this bu you would need to make modifications to how you configure the service principal permissions (below).

To use the code:

  1. Create a repo in DevOps for (1 repo) hub and for (1 repo) spoke1 and copy in the required code.
  2. Create service principals in Azure AD.
  3. Grant the service principal for hub owner rights to the hub subscription.
  4. Grant the service principal for the spoke owner rights to the spoke subscription.
  5. Create ARM service connections in DevOps settings that use the service principals. Note that the names for these service connections are referred to by azureServiceConnection in the pipeline files.
  6. Update the variables in the pipeline files with subscription IDs.
  7. Create the pipelines using the .yml files in the repos.

Don’t do anything just yet!

Service Principal Permissions

The hub service principal is simple – grant it owner rights to the hub subscription (or resource group).

The workload is where the magic happens with this DevSecOps design. The workload updates the hub suing code in the workload repo that affects the workload:

  • Ingress route from on-prem to the workload in the hub GatewaySubnet.
  • The firewall rules for the workload in the hub Azure Firewall (policy) using a rules collection group.
  • The VNet peering connection between the hub VNet and the workload VNet.

That could be deployed by the workload DevOps pipeline that is authenticated using the workload’s service principal. So that means the workload service principal must have rights over the hub.

The quick solution would be to grant contributor rights over the hub and say “we’ll manage what is done through code reviews”. However, a better practice is to limit what can be done as much as possible. That’s what I have done with the customRoles folder in my GitHub share.

Those custom roles should be modified to change the possible scope to the subscription ID (or even the resource group ID) of the hub deployment. There are 4 custom roles:

  • customRole-ArmValidateActionOperator.json: Adds the CUSTOM – ARM Deployment Operator role, allowing the ARM deployment to be monitored and updated.
  • customRole-PeeringAdmin.json: Adds the CUSTOM – Virtual Network Peering Administrator role, allowing a VNet peering connection to be created from the hub VNet.
  • customRole-RoutesAdmin.json: Adds the CUSTOM – Azure Route Table Routes Administrator role, allowing a route to be added to the GatewaySubnet route table.
  • customRole-RuleCollectionGroupsAdmin.json: Adds the CUSTOM – Azure Firewall Policy Rule Collection Group Administrator role, allowing a rules collection group to be added to an Azure Firewall Policy.

Deploy The Hub

The hub is deployed first – this is required to grant the permissions that are required by the workload’s service principal.

Grant Rights To Workload Service Principals

The service principals for all workloads will be added to an Azure AD group (Workloads Pipeline Service Principals in the above diagram). That group is nested into 4 other AAD security groups:

  • Resource Group ARM Operations: This is granted the CUSTOM – ARM Deployment Operator role on the hub resource group.
  • Hub Firewall Policy: This is granted the CUSTOM – Azure Firewall Policy Rule Collection Group Administrator role on the Azure Firewalll Policy that is associated with the hub Azure Firewall.
  • Hub Routes: This is granted the CUSTOM – Azure Route Table Routes Administrator role on the GattewaySubnet route table.
  • Hub Peering: This is granted the CUSTOM – Virtual Network Peering Administrator role on the hub virtual network.

Deploy The Workload

The workload now has the required permissions to deploy the workload and make modifications in the hub to connect the hub to the outside world.

Enabling DevSecOps with Azure Firewall

In this post, I will share how you can implement DevSecOps with Azure Firewall, with links to a bunch of working Bicep files to deploy the infrastructure-as-code (IaC) templates.

This example uses a “legacy” hub and spoke – one where the hub is VNet-based and not based on Azure Virtual WAN Hub. I’ll try to find some time to work on the code for that one.

The Concept

Hold on, because there’s a bunch of things to understand!

DevSecOps

The DevSecOps methodology is more than just IaC. It’s a combination of people, processes, and technology to enable a fail-fast agile delivery of workloads/applications to the business. I discussed here how DevSecOps can be used to remove the friction of IT to deliver on the promises of the Cloud.

The Azure features that this design is based on are discussed in concept here. The idea is that we want to enable Devs/Ops/Security to manage firewall rules in the workload’s Git repository (repo). This breaks the traditional model where the rules are located in a central location. The important thing is not the location of the rules, but the processes that manage the rules (change control through Git repo pull request reviews) and who (the reviewers, including the architects, firewall admins, security admins, etc).

So what we are doing is taking the firewall rules for the workload and placing them in with the workload’s code. NSG rules are probably already there. Now, we’re putting the Azure Firewall rules for the workload in the workload repo too. This is all made possible thanks to changes that were made to Azure Firewall Policy (Azure Firewall Manager) Rules Collection Groups – I use one Rules Collection Group for each workload and all the rules that enable that workload are placed in that Rules Collection Group. No changes will make it to the trunk branch (deployment action/pipelines look for changes here to trigger a deployment) without approval by all the necessary parties – this means that the firewall admins are still in control, but they don’t necessarily need to write the rules themselves … and the devs/operators might even write the rules, subject to review!

This is the killer reason to choose Azure Firewall over NVAs – the ability to not only deploy the firewall resource, but to manage the entire configuration and rule sets as code, and to break that all out in a controlled way to make the enterprise more agile.

Other Bits

If you’ve read my posts on Azure routing (How to Troubleshoot Azure Routing? and BGP with Microsoft Azure Virtual Networks & Firewalls) then you’ll understand that there’s more going on than just firewall rules. Packets won’t magically flow through your firewall just because it’s in the middle of your diagram!

The spoke or workload will also need to deploy:

  • A peering connection to the hub, enabling connectivity with the hub and the firewall. All traffic leaving the spoke will route through the firewall thanks to a user-defined route in the spoke subnet route table. Peering is a two-way connection. The workload will include some bicep to deploy the spoke-hub and the hub-spoke connections.
  • A route for the GatewaySubnet route table in the hub. This is required to route traffic to the spoke address prefix(es) through the Azure Firewall so on-premises>spoke traffic is correctly inspected and filtered by the firewall.

The IaC

In this section, I’ll explain the code layout and placement.

My Code

You can find my public repo, containing all the Bicep code here. Please feel free to download and use.

The Git Repo Design

You will have two Git repos:

  1. The first repo is for the hub. This repo will contain the code for the hub, including:
    • The hub VNet.
    • The Hub VNet Gateway.
    • The GatewaySubnet Route Table.
    • The Azure Firewall.
    • The Azure Firewall Policy that manages the Azure Firewall.
  2. The second repo is for the spoke. This skeleton example workload contains:

Action/Pipeline Permissions

I have written a more detailed update on this section, which can be found here. 

Each Git repo needs to authenticate with Azure to deploy/modify resources. Each repo should have a service principal in Azure AD. That service principal will be used to authenticate the deployment, executed by a GitHub action or a DevOps pipeline. You should restrict what rights the service principal will require. I haven’t worked out the exact minimum permissions, but the high-level requirements are documented below:

 

Trunk Branch Protection &  Pull Request

Some of you might be worried now – what’s to stop a developer/operator working on Workload A from accidentally creating rules that affect Workload X?

This is exactly why you implement standard practices on the Git repos:

  • Protect the Trunk branch: This means that no one can just update the version of the code that is deployed to your firewall or hub. If you want to create an updated, you have to create a branch of the trunk, make your edits in that trunk, and submit the changes to be merged into trunk as a pull request.
  • Enable pull request reviews: Select a panel of people that will review changes that are submitted as pull requests to the trunk. In our scenario, this should include the firewall admin(s), security admin(s), network admin(s), and maybe the platform & workload architects.

Now, I can only submit a suggested set of rules (and route/peering) changes that must be approved by the necessary people. I can still create my code without delay, but a change control and rollback process has taken control. Obviously, this means that there should be SLAs on the review/approval process and guidance on pull request, approval, and rejection actions.

And There You Have It

Now you have the design and the Bicep code to enable DevSecOps with Azure Firewall.