Azure Image Builder Job Fails With TCP 60000, 5986 or 22 Errors

In this post, I will explain how to solve the situation when an Azure Image Builder job fails with the following errors:

[ERROR] connection error: unknown error Post “https://10.1.10.9:5986/wsman”: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.9:5986/wsman”: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout

[ERROR]connection error: unknown error Post “https://10.1.10.8:5986/wsman”: context deadline exceeded
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.8:5986/wsman”: context deadline exceeded

Note that the second error will probably be the following if you are building a Linux VM:

[ERROR]connection error: unknown error Post “https://10.1.10.8:22/ssh”: context deadline exceeded
[ERROR] SSH connection err: unknown error Post “https://10.1.10.8:22/ssh”: context deadline exceeded

The Scenario

I’m using Azure Image Builder to prepare a reusable image for Azure Virtual Desktop with some legacy software packages from external vendors. Things like re-packaging (MSIX) will be a total support no-no, so the software must be pre-installed into the image.

I need a secure solution:

  • The virtual machine should not be reachable on the Internet.
  • Software packages will be shared from a storage account with a private endpoint for the blob service.

This scenario requires that I prepare a virtual network and customise the Image Template to use an existing subnet ID. That’s all good. I even looked at the PowerShell example from Microsoft which told me to allow TCP 60000-60001 from Azure Load Balancer to Virtual Network:

I also added my customary DenyAll rule at priority 4000 – the built in Deny rule doesn’t deny all that much!

I did that … and the job failed, initially with the first of the errors above, related to TCP 60000. Weird!

Troubleshooting Time

Having migrated countless legacy applications with missing networking documentation into micro-segmented Azure networks, I knew what my next steps were:

  1. Deploy Log Analytics and a Storage Account for logs
  2. Enable VNet Flow Logs with Traffic Analytics on the staging subnet (where the build is happening) NSG
  3. Recreate the problem (do a new build)
  4. Check the NTANetAnalytics table in Log Analytics

And that’s what I did. Immediately I found that there were comms problems between the Private Endpoint (Azure Container Instance) and the proxy VM. TCP 60000 was attempted and denied because the source was not the Azure Load Balancer.

I added a rule to solve the first issue:

I re-ran the test (yes, this is painfully slow) and the job failed.

This time the logs showed failures from the proxy VM to the staging (build) VM on TCP 5986. If you’re building a Linux VM then this will be TCP 22.

I added a third rule:

Now when I test I see the status switch from Running, to Distributing, to Success.

Root Cause?

Adding my DenyAll rule caused the scenario to vary from the Microsoft docs. The built-in AllowVnetInBound rule is too open because it allows all sources in a routed “network”, including other networks in a hub & spoke. So I micro-segment using a low priority DenyAll rule.

The default AllowVnetInBound rule would have allowed the container>proxy and proxy>VM traffic, but I had overridden it. So I need to create rules to allow that traffic.

E: Could Not Open File /var/lib/apt/lists

In this post, I’ll show how I solved a failure, that occurred during an Azure Image Builder (Packer) build with a Ubuntu 20.04 image, which resulted in a bunch of errors that contained E: Could not open file /var/lib/apt/lists/ with a bunch of different file names.

Disclaimer

I am Linux-disabled. I started my career programming on UNIX but switched to being a Microsoft infrastructure person a year later – and that was a long time ago. I am not a frequent Linux user but I do acknowledge its existence and usefulness. In other words, I figured out a fix for me, but it might not be a fix for you.

The Problem

I was using Azure Image Builder, which is based on Packer, to allow the regular creation of a Ubuntu 20.04 image with the latest updates and bits for acting as the foundation of a self-hosted DevOps agent VM Scale Set in a secure Azure network.

I had simple needs:

  1. Install Unzip
  2. Install Terraform

What makes it different is that I need the installations to be non-interactive. Windows has a great community with that kind of challenge. After a lot of searching, I realise that Linux does not.

I set up the tasks in the image template and for a month, everything was fine. Images built and rebuilt. A few days ago, a weird issue started where the first version of a template build was fine, but subsequent builds failed. When I looked at the build log, I saw a series of errors when apt (the package installed) ran that started with:

E: Could not open file /var/lib/apt/lists/

The Solution

I tried a lot of things, including:

apt-get update
apt-get upgrade -y

But guess, what – the errors just moved.

I was at the end of my tether when I decided to try something else. The apt package installation for WinZip worked some of the time. What was wrong the rest of the time? Time – that was the key word.

Something needed more time before I ran any apt commands. I decided to embed a bunch of sleep commands to let things in Ubuntu catchup with my build process.

I have two tasks that run before I install Terraform. The first prepares Linux:

            {
                "type": "Shell",
                "name": "Prepare APT",
                "inline": [
                    "echo ABCDEFG",
                    "echo sleep for 90 seconds",
                    "sleep 1m 30s",
                    "echo apt-get update",
                    "apt-get update",
                    "echo apt-get upgrade",
                    "apt-get upgrade -y",
                    "echo sleep for 90 seconds",
                    "sleep 1m 30s"
                ]
            },

The second task installs WinZip and some other tools that assist with downloading the latest Terraform zip file:

            {
                "type": "Shell",
                "name": "InstallPrereqs",
                "inline": [
                    "echo ABCDEFG",
                    "echo sleep for 90 seconds",
                    "sleep 1m 30s",
                    "echo installing unzip",
                    "sudo apt install --yes unzip",
                    "echo installing jq",
                    "sudo snap install jq"
                ]
            },

I’ve ran this code countless times yesterday and it worked perfectly. Sure, the sleeps slow things down, but this is a batch task that (outside of testing) I won’t be waiting on so I am not worried.

Understanding the Azure Image Builder Resources

In this post, I will explain the roles of and links/connections between the various resources used by Azure Image Builder.

Background

I enjoy the month of July. My customers, all in the Nordics, are off for the entire month and I am working. This year has been a crazy busy one so far, so there has been almost no time in the lab – noticeable I’m sure by my lack of writing. But this month, if all goes to plan, I will have plenty of time in the lab. As I type, a pipeline is deploying a very large lab for me. While that runs, I’ve been doing some hands on lab work.

Recently I helped develop and use an image building process, based on Packer, to regularly create images for a Citrix farm hosted in Microsoft Azure. It’s a pretty sweet solution that is driven from Azure DevOps and results in a very automated deployment that requires little work to update app versions or add/remove apps. At the time, I quickly evaluated Azure Image Builder (also based on Packer but still in Preview back then) but I thought it was too complicated and would still require the same pieces as our Packer solution. But I did decide to come back to Azure Image Builder when there was time (today) and have another look.

The first mission – figure out the resource complexity (compared to Packer by itself).

The Resources

I believe that one of Microsoft’s failings when documenting these services is their inability to explain the functions of the resources and how they work together. Working primarily in ARM templates, I get to see that stuff (a little). I’ve always felt that understanding the underlying system helps with understanding the solution – it was that way with Hyper-V and that continues with Azure.

Managed Identity – Microsoft.ManagedIdentity/userAssignedIdentities

A managed identity will be used by an Image Template to authorise Packer to use the imaging process that you are building. A custom role is associated with this Managed Identity, granting Packer rights to the resource group that the Shared Image Gallery, Image Definition, and Image Template are stored in.

Shared Image Gallery – Microsoft.Compute/galleries/images

The Shared Image Gallery is the management resource for images. The only notable attribute in the deployment is the name of the resource, which sadly, is similar to things like Storage Accounts in lacking standardisation with the rest of Microsoft Azure resource naming.

Image Definition- Microsoft.Compute/galleries/images

The Image Definition documents your image as you would like to present it to your “customers”.

The Image Definition is associated with the Shared Image Gallery by naming. If your Shared Image Gallery was named “myGallery” then an image definition called “myImage” would actually be named as “myGallery/myImage”.

The properties document things including:

  • VM generation
  • OS type
  • Generalised or not
  • How you will brand the images build from the Image Definition

Image Template – Microsoft.VirtualMachineImages/imageTemplates

This is where you will end up spending most of your time while operating the imaging process over time.

The Image Template describes to Packer (hidden by Azure) how it will build your image:

  • Identity points to the resource ID of the Managed Identity, permitting Packer to sign in as that identity/receiving its rights when using this Image Template to build an Image Version.
  • Properties:
    • Source: The base image from the Azure Marketplace to start the build with.
    • Customize: The tasks that can be run, including PowerShell scripts that can be downloaded, to customise the image, including installing software, configuring the OS, patching and rebooting.
    • Distribute: Here you associate the Image Template with an Image Definition, referencing the resource ID of the desired Image Definition. Everytime you run this Image Template, a new Image Version of the Image Definition will be created.

Image Version – Microsoft.Compute/galleries/images/versions

An Image Version, a resource with a messy resource name that will break your naming standards, is created when you build from an Image Template. The name of the Image Version is based on the name of the Image Definition plus an incremental number. If my Image Definition is named “myGallery/myImage” then the Image Version will be named “myGallery/myImage/<unique number>”.

The properties of this resource include a publishing profile, documenting to what regions an image is replicated and how it is stored.

What Is Not Covered

Packer will create a resource group and virtual machine (and associated resources) to build the new image. The way that the virtual machine is networked (public IP address by default) can normally be manipulated by the Image Template when using Packer.

Summary

There is a lot more here than with a simple run of Packer. But, Azure Image Builder provides a lot more functionality for making images available to “customers” across an enterprise-scale deployment; that’s really where all the complexity comes from and I guess “releasing” is something that Microsoft knows a lot about.