The Secret Sauce That Devs Don’t Want IT Pros to Know About

Honesty time: that title is a bit click-baitish, but the dev community is using a tool that most IT pros don’t know much/anything about, and it can be a real game changer, especially if you write scripts or work with deployment solutions such as Azure Resource Manager (ARM) JSON templates.

Shot Time

As soon as I say “DevOps” you’re already reaching for that X on the top right or saying “Oh, he’s lost it … again”. But that’s what I’m going to talk about: DevOps. Or to be more precise, Azure DevOps.

Methodology

For years, when I’ve thought about DevOps, I’ve thought “buggy software with more frequent releases”. And that certainly can be the case. But DevOps is born out of the realisation that how we have engineered software (or planned almost anything, to be honest) for the past few decades has not been ideal. Typically, we have some start-middle-end waterfall approach that assumes we know the end state. If this is a big (budget) or long (time) project, then getting half way to that planned end-state and realising that the preconception was wrong is very expensive – it leads to headlines! DevOps is a way of saying:

  • We don’t know the end-state
  • We’re going to work on smaller scenarios
  • We will evolve what we create based on those scenarios
  • The more we create, the more we’ll learn about what the end state will be
  • There will be larger milestones, which will be our releases

This is where the project management gurus will come out and say “this is Scrum” or some other codswallop that I do not care about; that’s the minutia for the project management nerds to worry about.

Devs started leaning this direction years ago. It’s not something that is always the case – most devs that I encountered in the past didn’t use some of the platform tools for DevOps such as GitHub or Azure DevOps (previously Teams). But here’s an interesting thing: some businesses are adopting the concepts of “DevOps” for building a business, even if IT isn’t involved, because they realised that some business problems are very like tech problems: big, complex, potentially expensive, and with an unknown end-state.

Why I Started

I got curious about Azure DevOps last year when my friend, Damian Flynn, was talking about it at events. Like me, Damian is an Azure MVP/IT Pro but, unlike me, he stayed in touch with development after college. I tried googling and reading Microsoft Docs, but the content was written in that nasty circular way that Technet used to be – there was no entry point for a non-dev that I could find.

And then I changed jobs … to work with Damian as it happens. We’ve been working on a product together for the last 7 months. And on day 1, he introduced me to DevOps. I’ll be honest, I was lost at first, but after a few attempts and then actually working with it, I have gotten to grips with it and it gives me a structured way to work, plan, and collaborate on a product that will never have and end-state.

What I’m Working On

I’m not going to tell you exactly what I’m working on but it is Azure and it is IT Pro with no dev stuff … from me, at least. Everything I’ve written or adjusted is PowerShell or Azure JSON. I can work with Damian (who is 2+ hours away by car) on Teams on the same files:

  • Changes are planned as features or tasks in Azure DevOps Boards.
  • Code is stored in an Azure DevOps repo (repository).
  • Major versions are built as branches (changes) of a master copy.
  • Changes to the master copy are peer reviewed when you try to merge a branch.
  • Repos are synchronized to our PCs using Git.
  • VS Code is our JSON and PowerShell editor.

It might all sound complex … but it really was pretty simple to set up. Now behind the scenes, there is some crazy-mad release “pipelines” stuff that Damian built, and that is far from simple, but not mandatory – don’t tell Damian that I said that!

Confusing Terminology

Azure DevOps inherits terminology from other sources, such as Git. And that is fine for devs in that space, but some of it made me scratch my head because it sounded “the wrong way around”. Here’s some of the terms:

  • Repo: A repository is where you store code.
  • Project: A project might have 1 or more repos. Each repo might be for a different product in that project.
  • Boards: A board is where you do the planning. You can create epics, tasks and issues. Typically, an Epic is a major part of a solution, a task is what you need to do to make that work, and an issue is a bug to be fixed.
  • Sprint: In managed projects, sprints are a predefined period of time that you assign people to. Tasks are pulled into the sprint and assigned to people (or pulled by people to themselves) who have available time and suitable skills.
  • Branch: You always have one branch called the master or trunk. This is the “master copy” of the code. Branches can be made from the master. For example, if I have a task, I might create a branch from the master in VS Code to work on that task. Once I am complete, I will sync/push that branch back up to Azure DevOps.
  • Pull Request: This is the one that wrecked my head for years. A pull request is when you want to take your changes that are stored in a branch and push it back into the parent branch. From Git’s or DevOps’ point of view, this is a pull, not a push. So you create a pull request for (a) identify the tasks you did, get someone to review/approve your changes, merge the branch (changes) back into the parent branch.
  • Nested branch: You can create branches from branches. Master is typically pretty locked down. A number of people might want a more flexible space to work in so they create a branch of master, maybe for a new version – let’s call this the second level branch. Each person then creates their own third level branches of the first branch. Now each person can work away and do pull requests into the more flexible second-level branch. And when they are done with that major piece of work, they can do a pull request to merge the second-level back into the master or trunk.
  • Release: Is what it sounds like – the “code” is ready for production, in the opinion of the creators.

Getting Started

The first two tools that you need are free:

  • Git command line client – you do not need a GitHub account.
  • Visual Studio Code

And then you need Azure DevOps. That’s where the free pretty much stops and you need to acquire either per-user/plan licensing or get it via MSDN/Visual Studio licensing.

Opinion

I came into this pretty open minded. Damian’s a smart guy and I had long conversations with one of our managers about these kinds of methodologies after he attended Scrum master training.

Some of the stuff in DevOps is nasty. The terminology doesn’t help, but I hope the above helps. Pipelines is still a mystery to me. Microsoft shared a doc to show how to integrate a JSON release via Pipelines and it’s a big ol’ mess of things to be done. I’ll be honest … I don’t go near that stuff.

I don’t think that me and Damian could have collaborated the way we have without DevOps. We’ve written many thousands of lines of code, planned tasks, fought bugs. It’s been done without a project manager – we discuss/record ideas, prioritize them, and then pull (assign to ourselves) the tasks when we have time. At times, we have worked in the same spaces and been able to work as one. And importantly, when it comes to pull requests, we peer review. The methodology has allowed other colleagues to participate and we’re already looking at how we can grow that more in the organization to bring in more skills/experience into the product. Without (Azure) DevOps we could not have done that … certainly storing code on some file storage in the cloud would have been a total disaster and lacked the structure that we have had.

Backing Up Azure Firewall

In this post, I will outline how you can back up your Azure Firewall, enabling you to rebuild it in case it is accidentally/maliciously deleted or re-configured by an authorized person.

With the Azure Firewall adding new features, we should expect more customers to start using it. And if you are using it like I do with my customers, it’s the centre of everything and it can quickly contain a lot of collections/rules which took a long time to write.

Wait – what new features? Obviously, Threat Detection (using the MS security graph) is killer, but support for up to 100 public IP addresses was announced and is imminent, availability zones are there now for this mission critical service, application rule FQDN support was added for SQL databases, and HD Insight tags are in preview.

So back on topic: how do I backup Azure Firewall? It’s actually pretty simple. You will need to retrieve your firewall’s resource ID:

$AzureFirewallId = (Get-AzFirewall -Name "MyFirewall" -ResourceGroupName "MyVnetRg").id

Then you will export a JSON copy of the firewall:

$BackupFileName = ".\MyFirewallBackup.json"
Export-AzResourceGroup -ResourceGroupName "MyVnetRg" -Resource $AzureFirewallId -SkipAllParameterization -Path $BackupFileName

And that’s the guts of it! To do a restore you simply redeploy the JSON file to the resource group:

New-AzResourceGroupDeployment -name "FirewallRestoreJob" -ResourceGroupName "MyVnetRg" -TemplateFile ".\MyFirewallBackup.json"

I’ve tested a delete and restore and it works. The magic here is using -SkipAllParameterization in the resource export to make the JSON file recreate exactly what was lost at the time of the backup/export.

If you wanted to get clever you could wrap up the backup cmdlets in an Azure Automation script. Add some lines to copy the alter the backup file name (date/time), and copy the backup to blob storage in a GPv2 storage account (with Lifecycle Management for automatic blob tiering and a protection policy to prevent deletion). And then you would schedule to the automation to run every day.

Physical Disks are Missing in Disk Management

In this post, I’ll explain how I fixed a situation where most of my Storage Spaces JBOD disks were missing in Disk Management and Get-PhysicalDisk showed their OperationalStatus as being stuck on “Starting”.

I’ve had some interesting hardware/software issues with an old lab at work. All of the hardware is quite old now, but I’ve been trying to use it in what I’ll call semi-production. The WS2016 Hyper-V cluster hardware consists of a pair of Dell R420 hosts and an old DataON 6 Gbps SAS Storage Spaces JBOD.

Most of the disks disappeared in Disk Management and thus couldn’t be added to a new Storage Spaces pool. I checked Device Manager and they were listed. I removed the devices and rebooted but the disks didn’t appear in Disk Management. I then ran Get-PhysicalDisk and this came up:

image

As you can see, the disks were there, but their OperationalStatus was hung on “Starting” and their HealthStatus was “Unknown”. If this was a single disk, I could imagine that it had failed. However, this was nearly every disk in the JBOD and spanned HDD and SSD. Something else was up – probably Windows Server 2016 or some firmware had threw a wobbly and wasn’t wrapping up some task.

The solution was to run Reset-PhysicalDisk. The example on docs.microsoft.com was incorrect, but adding a foreach loop fixed things:

$phydisk = (Get-Physicaldisk | Where-Object -FilterScript {$_.HealthStatus -Eq “Unknown”})

foreach ($item in $phydisk)
{
Reset-PhysicalDisk -FriendlyName $item.FriendlyName
}

A few seconds later, things looked a lot better:

image

I was then able to create the new pool and virtual disks (witness + CSVs) in Failover Cluster Manager.

Azure: PowerShell versus ARM Templates

In this post, I’m going to make the PowerShell acolytes angry (not hard) by explaining why they are too slow, and ARM/JSON is they best way to deploy things in Azure.

The PowerShell Experience

Let’s imagine that you & your significant other go into a restaurant, and let’s say you order a steak and your other wants to order something else. How does the ordering process go? Is it something like this .. let’s start with your order:

  • Customer: Waiter!
  • <Wait 1 minute>
  • Waiter: What would you like sir?
  • Customer: Could you ask the chef to go to the fridge?
  • <Wait while the chef is asked to go to the fridge>
  • Waiter: Yes?
  • Customer: Would you ask the chef to open the fridge?
  • <Wait while the chef opens the fridge>
  • Waiter: Yes?
  • Customer: Would you ask the chef to take a steak out of the fridge?
  • <Wait while the chef takes a steak out of fridge>
  • Waiter: Yes?
  • Customer: Please ask the check to put a pan on the cooker.
  • <Wait while the chef puts a pan on the cooker>
  • Waiter: Yes?

You see what’s going on here? Meanwhile your significant other is getting no love from the restaurant. Ouch!

With PowerShell you describe the deployment process, one step at a time, connecting each and every dot. The deployment is serialized, with no parallelism unless you use PowerShell features to run parallel jobs. The result isn’t much faster than you doing all the clicking for yourself.

The ARM Experience

I like to describe the ARM as a waiter, and the Azure resource providers as the kitchen cooks. How does the order go?

  • Customer: Waiter, I would like a Salmon dish for my wife and steak for myself.
  • Waiter: Yes, sir, in the meantime, would you like a drink?

That’s a bit better, right?

ARM or JSON templates describe the result, not the process. Once you submit the deployment, ARM divides up the job and orders the deployment based on your dependencies. That means that the deployment can be parallelized. If I need 100 web servers, all 100 will be deployed at once, not in some 1..100 loop, one at a time (or 5 at a time if you are clever).

Best of Both Worlds

For some of the training that I do at work, I deploy the training lab in Azure as follows:

  • A PowerShell script that asks me how many attendees there are, and then it runs a glorified 2 line loop.
  • The loop iterates through different subscriptions, adding a resource group and then doing an ARM deployment.

In other words, PowerShell automates my very fast ARM deployments.

PowerShell Still Required

PowerShell is still very useful for some fiddly deployment things that don’t have ARM options, or are once-offs and don’t have a GUI option. To be honest, I do use GUI for most of my once-offs because it is convenient and gets the end result faster than researching/tweaking/fixing PowerShell examples. When it comes to learning about settings and troubleshooting, PowerShell can be pretty awesome.

But PowerShell is much slower than ARM for deployments. Now let’s hear the screams of outraged PowerShellers!

Was This Post Interesting?

If you found this information useful, then imagine what 2 days of training might mean to you. I’m delivering a 2-day course in Amsterdam on April 19-20, teaching newbies and experienced Azure admins about Azure Infrastructure. There’ll be lots of in-depth information, covering the foundations, best practices, troubleshooting, and advanced configurations. You can learn more here.

Adding Azure Monitor Performance Alerts Using PowerShell

Below is a sample script for adding Azure Metrics alerts using Azure Monitor. It is possible to create alerts using the Azure Portal, but that doesn’t scale well because each alert is specific to one VM. For example, if you have 4 alerts per VM, and 10 VMs, then you have to create 40 alerts! One could say: Use Log Analytics, but there’s a cost to that, and I find the OMS Workspace to be immature. Instead, one can continue to use Resource/Azure Monitor metrics, but script the creation of the metrics alerts.

Once could use JSON, but again, there’s a scale-out issue there unless you build this into every deployment. But the advantage with PowerShell is that you can automatically vary thresholds based on the VM’s spec, as you will see below – some metric thresholds vary depending on the spec of a machine, e.g. the number of cores.

The magic cmdlet for doing this work is Add-AzureRmMetricAlertRule. And the key to making that cmdlet work is to know the name of the metric. Microsoft’s docs state that you can query for available metrics using Get-AzureRmMetricDefinition, but I found that with VMs, it only returned back the Host metrics and not the Guest metrics. I had to do some experimenting, but I found that the names of the guest metrics are predictable; they’re exactly what you see in the Azure Portal, e.g. \System\Processor Queue Length.

The below script is made up of a start and 2 functions:

  1. The start is where I specify some variables to define the VM, resource group name, and query for the location of the VM. The start can then call a series of functions, one for each metric type. In this example, I call ProcessorQLength.
  2. The ProcessorQLength function takes the VM, queries for it’s size, and then gets the number of cores assigned to that VM. We need that because the alert should be triggers if the average queue length per core is over 4, e.g. 12 for a 4 core VM. The AddMetric function is called with a configuration for the \System\Processor Queue Length alert.
  3. The AddMetric function is a generic function capable of creating any Azure metrics alert. It is configured by the parameters that are fed into it, in this case by the ProcessorQLength function.

Here’s my example:

#A generic function to create an Azure Metrics alert
function AddMetric ($FunMetricName, $FuncMetric, $FuncCondition, $FuncThreshold, $FuncWindowSize, $FuncTimeOperator, $FuncDescription)
{
    $VMID = (Get-AzureRmVM -ResourceGroupName $RGName -Name $VMName).Id
    Add-AzureRmMetricAlertRule -Name $FunMetricName -Location $VMLocation -ResourceGroup $RGName -TargetResourceId $VMID -MetricName $FuncMetric -Operator $FuncCondition -Threshold $FuncThreshold -WindowSize $FuncWindowSize -TimeAggregationOperator $FuncTimeOperator -Description $FuncDescription
}

#Create an alert for Processor Queue Length being 4x the number of cores in a VM
function ProcessorQLength ()
{
    $VMSize = (Get-AzureRMVM -ResourceGroupName $RGName -Name $VMName).HardwareProfile.VmSize
    $Cores = (Get-AzureRMVMSize -Location $VMLocation | Where-Object {$_.Name -eq $VMSize}).NumberOfCores
    $QThreshold = $Cores * 4
    AddMetric "$VMname - CPU Q Length" "\System\Processor Queue Length" "GreaterThan" $QThreshold "00:05:00" "Average" "Created using PowerShell"
}

#The script starts here
#Specify a VM name/resource group
$VMName = "vm-test-01"
$RGName = "test"
$VMLocation = (Get-AzureRMVM -ResourceGroupName $RGName -Name $VMName).Location

#Start running functions to create alerts
ProcessorQLength

Was This Post Useful?

If you found this information useful, then imagine what 2 days of training might mean to you. I’m delivering a 2-day course in Amsterdam on April 19-20, teaching newbies and experienced Azure admins about Azure Infrastructure. There’ll be lots of in-depth information, covering the foundations, best practices, troubleshooting, and advanced configurations. You can learn more here.

Restore An Azure VM to an Availability Set From Azure Backup in the Azure Portal

Microsoft has shared how to restore an Azure VM to an availability set using PowerShell from Azure Backup. It’s nasty-hard looking PowerShell, and my problem with examples of VM creation using PowerShell is that they’re never feature complete.

While writing some Azure VM training recently, I stumbled across a cool option in the Azure Portal that I tried out … and it worked … and it means that I never have to figure that nasty PowerShell out Smile

The key to all this is to start using Managed Disks. Even if your existing VMs are using un-managed (storage account) disks, that’s not a problem because you can still use this restore method. The other thing you should remember is that the metadata of the VM is irrelevant – everything of value is in the disks.

Restore the Disks of the VM

Using these steps you can restore the disks of your VM, managed or un-managed, to a storage location, referred to as the staging account.. Each disk is restored as a blob VHD file, and a JSON file describes the disks so that you can identify which one is the “osDisk”.

Create Managed Disks from the Restored VHDs

In this process, you create a managed disk from each restored VHD or blob file in the staging location. You have the option to restore the disks as Standard (HDD) or Premium (SSD) disks, which offers you some flexibility in your restore (you can switch storage types!). Make sure you ID the osDisk from the JSON file and mark it as either a Windows or Linux OS disk, depending on the contents.

Create a VM From the OS Managed Disk

The third set of steps bring your VM back online. You use the previously restored/identified osDisk and create a new virtual machine using that managed disk. Make sure you select the availability set that you want to restore the VM to.

Clean Up

The last step is the clean up. If you had any data disks in the original machine then you need to re-attach them to the new virtual machine. You’ll also need to configure the network settings of the Azure NIC resource. For example, if the new VM is replacing the old one, you should enter the IP settings of the old VM into the new NIC Azure resource, change any NAT/load balancing rules, NSGs, PIPs, etc.

And that’s it! There’s no PowerShell, and it’s all pretty simple clicking in the Azure Portal that won’t take that long to do after the disks are restored from the recovery services vault.

Template For ASR Recovery Plan Runbook

I’m sharing a template PowerShell-based Azure Automation runbook that I’ve written, to enable advanced automation in the Recovery Plans that can be used in Azure Site Recovery (ASR) failover.

ASR allows us to orchestrate the failover and failback of virtual machines. This can be a basic ordering of virtual machine power-up and power-down actions, but we can also inject runbooks from Azure Automation. This allows us to do virtually anything inside of Azure during a failover or failback.

I needed to write a runbook for ASR recently and I found that I needed this runbook to be able to work differently in different scenarios:

  • Planned failover
  • Unplanned failover
  • Test failover
  • Cleanup after a test failover

That last one is tricky, but I’ll come back to it. They key to the magic is a parameter in the runbook called $RecoveryPlanContext. When an ASR executes a runbook it passes some information into the script via this parameter. The parameter has multiple attributes, as documented by Microsoft here. Some of the interesting values we can query for are:

  • FailoverType: test, unplanned, or planned.
  • FailoverDirection: primary or secondary are the destinations.
  • VmMap: An array listing the names of the VMs included in the runbook.
  • ResourceGroupName: The name of the resource group that the failover VMs are in.

My template currently only queries for FailoverType, because my goal was to automate failover to Azure. However, there is as much value in dealing with FailoverDirection, because the runbook can be used to orchestrate a failback.

Let’s assume that I use a runbook to start up some machine(s) during a failover. This machine could be a jump box, enabling remote access into that machine, and from there we can jump to failed over machines. That would be useful in all kinds of failover, including a test failover. But what happens if I run a test, am happy with everything and run a cleanup task? Guess what … our recovery plan does not execute in reverse order, and the runbook is not executed. So what I’ve come up with is a way to tell the runbook that I want to do a cleanup. The runbook will typically be executed before the cleanup (some tasks must be done before VMs are removed to make scripting easier, e.g. finding PIPs used by VMs before the VMs and their NICs are deleted). The runbook still expects a parameter; you are prompted to enter a value, so I enter “cleanup”. Once my script sees that task it runs a cleanup function.

My template has 4 functions, one for each of the above 4 scenarios. There is an if statement to look for cleanup, and if that’s not the value, a switch statement checks $RecoveryPlanContext.FailoverType to see what the scenario is. If some unexpected value is found, the runbook will exit.

param ( 
[Object]$RecoveryPlanContext 
)

function DoTestCleanup ()
{
Write-Output ("This is a cleanup after test failover")
# Do stuff here
}


function DoTestFailover ()
{
Write-Output ("This is a test failover")
# Do stuff here
}


function DoPlannedFailover ()
{
Write-Output ("This is a planned failover")
# Do stuff here
}


function DoUnplannedFailover ()
{
Write-Output ("This is an unplanned failover")
# Do stuff here
}


# The runbook starts here
Sleep 10
$connectionName = "AzureRunAsConnection"
try
{
# Get the connection "AzureRunAsConnection "
$servicePrincipalConnection=Get-AutomationConnection -Name $connectionName

"Logging in to Azure using $connectionName ..."
Add-AzureRmAccount -ServicePrincipal -TenantId $servicePrincipalConnection.TenantId -ApplicationId $servicePrincipalConnection.ApplicationId -CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint 
}
catch {
if (!$servicePrincipalConnection)
{
$ErrorMessage = "Connection $connectionName not found."
throw $ErrorMessage
} else{
Write-Error -Message $_.Exception
throw $_.Exception
}
}

write-output ("The failover type parameter is $RecoveryPlanContext")


if($RecoveryPlanContext -eq 'cleanup')
{
DoTestCleanup
}
else
{
switch ($RecoveryPlanContext.FailoverType)
{
"Test" { DoTestFailover }
"Planned" { DoPlannedFailover }
"Unplanned" { DoUnplannedFailover }
default { Write-Output ("Runbook aborted because there no failover type was specified") }
}
}

 

Ignite 2016 – Discover What’s New In Windows Server 2016 Virtualization

This post is a collection of my notes from the Ben Armstrong’s (Principal Program Manager Lead in Hyper-V) session (original here) on the features of WS2016 Hyper-V. The session is an overview of the features that are new, why they’re there, and what they do. There’s no deep-dives.

A Summary of New Features

Here is a summary of what was introduced in the last 2 versions of Hyper-V. A lot of this stuff still cannot be found in vSphere.

image

And we can compare that with what’s new in WS2016 Hyper-V (in blue at the bottom). There’s as much new stuff in this 1 release as there were in the last 2!

image

Security

The first area that Ben will cover is security. The number of attack vectors is up, attacks are on the rise, and the sophistication of those attacks is increasing. Microsoft wants Windows Server to be the best platform. Cloud is a big deal for customers – some are worried about industry and government regulations preventing adoption of the cloud. Microsoft wants to fix that with WS2016.

Shielded Virtual Machines

Two basic concepts:

  • A VM can only run on a trusted & healthy host – a rogue admin/attacker cannot start the VM elsewhere. A highly secured Host Guardian Service must authorize the hosts.
  • A VM is encrypted by the customer/tenant using BitLocker – a rogue admin/attacker/government agency cannot inspect the VM’s contents by mounting the disk(s).

image

There are levels of shielding, so it’s not an all or nothing.

Key Storage Drive for Generation 1 VMs

Shielding, as above, required Generation 2 VMs. You can also offer some security for Generation 1 virtual machines: Key Storage Drive. Not as secure as shielded virtual machines or virtual TPM, but it does give us a safe way to use BitLocker inside a Generation 1 virtual machine – required for older applications that depend on older operating systems (older OSs cannot be used in Generation 2 virtual machines).

 

image

Virtual Secure Mode (VSM)

We also have Guest Virtual Secure Mode:

  • Credential Guard: protecting ID against pass-the-hash by hiding LSASS in a secured VM (called VSM) … in a VM with a Windows 10 or Windows Server 2016 guest OS! Malware running with admin rights cannot steal your credentials in a VM.
  • Device Guard: Protect the critical kernel parts of the guest OS against rogue s/w, again, by hiding them in a VSM in a Windows 10 or Windows Server 2016 guest OS.

image

Secure Boot for Linux Guests

Secure boot was already there for Windows in Generation 2 virtual machines. It’s now there for Linux guest OSs, protecting the boot loader and kernel against root kits.

image

Host Resource Protection (HRP)

Ben hopes you never see this next feature in action in the field Smile This is because Host Resource Protection is there to protect hosts/VMs from a DOS attack against a host by someone inside a VM. The scenario: you have an online application running in a VM. An attacker compromises the application (example: SQL injection) and gets into the guest OS of the VM. They’re isolated from other VMs by the hypervisor and hardware/DEP, so they attack the host using DOS, and consume resources.

A new feature, from Azure, called HRP will determine that the VM is aggressively using resources using certain patterns, and start to starve it of resources, thus slowing down the DOS attack to the point of being pointless. This feature will be of particular interest to:

  • Companies hosting external facing services on Hyper-V/Windows Azure Pack/Azure Stack
  • Hosting companies using Hyper-V/Windows Azure Pack/Azure Stack

image

This is another great example of on-prem customers getting the benefits of Azure, even if they don’t use Azure. Microsoft developed this solution to protect against the many unsuccessful DOS attacks from Azure VMs, and we get it for free for our on-prem or hosted Hyper-V hosts. If you see this happening, the status of the VM will switch to Host Resource Protection.

Security Demos

Ben starts with virtual TPM. The Windows 10 VM has a virtual TPM enabled and we see that the C: drive is encrypted. He shuts down the VM to show us the TPM settings of the VM. We can optionally encrypt the state and live migration traffic of the VM – that means a VM is encrypted at rest and in transit. There is a “performance impact” for this optional protection, which is why it’s not on by default. Ben also enables shielding – and he loses console access to the VM – the only way to connect to the machine is to remote desktop/SSH to it.

Note: if he was running the full host guardian service (HGS) infrastructure then he would have had no control over shielding as a normal admin – only the HGS admins would have had control. And even the HGS admins have no control over BitLocker.

He switches to a Generation 1 virtual machine with Key Storage Drive enabled. BitLocker is running. In the VM settings (Generation 1) we see Security > Key Storage Drive Enabled. Under the hood, an extra virtual hard disk is attached to the VM (not visible in the normal storage controller settings, but visible in Disk Management in the guest OS). It’s a small 41 MB NTFS volume. The BitLocker keys are stored there instead of a TPM – virtual TPM is only in Generation 2, but it’s using the same sorts of tech/encryption/methods to secure the contents in the Key Storage Drive, but it cannot be as secure as virtual TPM, but it is better than not having BitLocker. Microsoft can make the same promises with data at rest encryption for Generation 1 VMs, but it’s still not as good as a Generation 2 VM with vTPM or even a shielded VM (requires Generation 2).

Availability

The next section is all about keeping services up and running in Hyper-V, whether it’s caused by upgrades or infrastructure issues. Everyone has outages and Microsoft wants to reduce the impact of these. Microsoft studied the common causes, and started to tackle them in WS2016

Cluster OS Rolling Upgrades

Microsoft is planning 2-3 updates per year for Nano Server, plus there’ll be other OS upgrades in the future. You cannot upgrade a cluster node. And in the past we could only do cluster-cluster migrations to adopt new versions of Windows Server/Hyper-V. Now, we can:

  1. Remove cluster node 1
  2. Rebuild cluster node 1 with the new version of Windows Server/Hyper-V
  3. Add cluster node 1 to the old cluster – the cluster runs happily in mixed-mode for a short period of time (weeks), with failover and Live Migration between the old/new OS versions.
  4. Repeat steps 1-3 until all nodes are up to date
  5. Upgrade the cluster functional level – Update-ClusterFunctionalLevel (see below for “Emulex incident”)
  6. Upgrade the VMs’ version level

Zero VM downtime, zero new hardware – 2 node cluster, all the way to a 64 node cluster.

If you have System Center:

  1. Upgrade to SCVMM 2016.
  2. Let it orchestrate the cluster upgrade (above)

Supports starts with WS2012 R2 to WS2016. Re-read that statement: there is no support for W2008/W2008 R2/WS2012. Re-read that last statement. No need for any questions now Smile

image

To avoid an “Emulex incident” (you upgrade your hosts – and a driver/firmware fails even though it is certified, and the vendor is going to take 9 months to fix the issue) then you can actually:

  1. Do the node upgrades.
  2. Delay the upgrade to the cluster functional level for a week or two
  3. Test your hosts/cluster for driver/firmware stability
  4. Rollback the cluster nodes to the older OS if there is an issue –> only possible if the cluster functional level is on the older version.

And there’s no downtime because it’s all leveraging Live Migration.

Virtual Machine Upgrades

This was done automatically when you moved a VM from version X to version X+1. Now you control it (for the above to work). Version 8 is WS2016 host support.

image

Failover Clustering

Microsoft identified two top causes of outages in customer environments:

  • Brief storage “outages” – crashing the guest OS of a VM when an IO failed. In WS2016, when an IO fails, the VM is put in a paused-critical state (for up to 24 hours, by default). The VM will resume as soon as the storage resumes.
  • Transient network errors – clustered hosts being isolated causing unnecessary VM failover (reboot), even if the VM was still on the network. A very common 30 seconds network outage will cause a Hyper-V cluster to panic up to and including WS2012 R2 – attempted failovers on every node and/or quorum craziness! That’s fixed in WS2016 – the VMs will stay on the host (in an unmonitored state) if they are still networked (see network protection from WS2012 R2). Clustering will wait (by default) for 4 minutes before doing a failover of that VM. If a host glitches 3 times in an hour it will be automatically quarantined, after resuming from the 3rd glitch, (VMs are then live migrated to other nodes) for 2 hours, allowing operator inspection.

image

Guest Clustering with Shared VHDX

Version 1 of this in WS2012 R2 was limited – supported guest clusters but we couldn’t do Live Migration, replication, or backup of the VMs/shared VHDX files. Nice idea, but it couldn’t really be used in production (it was supported, but functionally incomplete) instead of virtual fibre channel or guest iSCSI.

WS2016 has a new abstracted form of Shared VHDX – it’s even a new file format. It supports:

  • Backup of the VMs at the host level
  • Online resizing
  • Hyper-V Replica (which should lead to ASR support) – if the workload is important enough to cluster, then it’s important enough to replicate for DR!

image

One feature that does not work (yet) is Storage Live Migration. Checkpoint can be done “if you know what you are doing” – be careful!!!

Replica Support for Hot-Add VHDX

We could hot-add a VHDX file to a VM, but we could not add that to replication if the VM was already being replicated. We had to re-replicate the VM! That changes in WS2016, thanks to the concept of replica sets. A new VHDX is added to a “not-replicated” set and we can move it to the replicated set for that VM.

image

Hot-Add Remove VM Components

We can hot-add and hot-remove vNICs to/from running VMs. Generation 2 VMs only, with any supported Windows or Linux guest OS.

We can also hot-add or hot-remove RAM to/from a VM, assuming:

  • There is free RAM on the host to add to the VM
  • There is unused RAM in the VM to remove from the VM

This is great for those VMs that cannot use Dynamic Memory:

  • No support by the workload
  • A large RAM VM that will benefit from guest-aware NUMA

A nice GUI side-effect is that guest OS memory demand is now reported in Hyper-V Manager for all VMs.

Production Checkpoints

Referring to what used to be called (Hyper-V) snapshots, but were renamed to checkpoints to stop dumb people from getting confused with SAN and VSS snapshots – yes, people really are that stupid – I’ve met them.

Checkpoints (what are now called Standard Checkpoints) were not supported by many applications in a guest OS because they lead to application inconsistency. WS2016 adds a new default checkpoint type called a Production Checkpoint. This basically uses backup technology (and IT IS STILL NOT A BACKUP!) to create an application consistent checkpoint of a VM. If you apply (restore) the checkpoint the VM:

  • The VM will not boot up automatically
  • The VM will boot up as if it was restoring from a backup (hey dumbass, checkpoints are STILL NOT A BACKUP!)

For the stupid people, if you want to backup VMs, use a backup product. Altaro goes from free to quite affordable. Veeam is excellent. And Azure Backup Server gives you OPEX based local backup plus cloud storage for the price of just the cloud component. And there are many other BACKUP solutions for Hyper-V.

Now with production checkpoints, MSFT is OK with you using checkpoints with production workloads …. BUT NOT FOR BACKUP!

image

Demos

Ben does some demos of the above. His demo rig is based on nested virtualization. He comments that:

  • The impact of CPU/RAM is negligible
  • There is around a 25% impact on storage IO

Storage

The foundation of virtualization/cloud that makes or breaks a deployment.

Storage Quality of Service (QOS)

We had a basic system in WS2012 R2:

  • Set max IOPS rules per VM
  • Set min IOPS alerts per VM that were damned hard to get info from (WMI)

And virtually no-one used the system. Now we get storage QoS that’s trickled down from Azure.

In WS2016:

  • We can set reserves (that are applied) and limits on IOPS
  • Available for Scale-Out File Server and block storage (via CSV)
  • Metrics rules for VHD, VM, host, volume
  • Rules for VHD, VM, service, or tenant
  • Distributed rule application – fair usage, managed at storage level (applied in partnership by the host)
  • PoSH management in WS2016, and SCVMM/SCOM GUI image

You can do single-instance or multi-instance policies:

  • Single-instance: IOPS are shared by a set of VMs, e.g. a service or a cluster, or this department only gets 20,000 IOPS.
  • Multi-instance: the same rule is applied to a group of VMs, the same rule for a large set of VMs, e.g. Azure guarantees at least X IOPS to each Standard storage VHD.

image

Discrete Device Assignment – NVME Storage

DDA allows a virtual machine to connect directly to a device. An example is a VM connects directly to extremely fast NVME flash storage.

Note: we lose Live Migration and checkpoints when we use DDA with a VM.

image

Evolving Hyper-V Backup

Lots of work done here. WS2016 has it’s only block change tracking (Resilient Change Tracking) so we don’t need a buggy 3rd party filter driver running in the kernel of the host to do incremental backups of Hyper-V VMs. This should speed up the support of new Hyper-V versions by the backup vendors (except for you-know-who-yellow-box-backup-to-tape-vendor-X, obviously!).

Large clusters had scalability problems with backup. VSS dependencies have been lessened to allow reliable backups of 64 node clusters.

Microsoft has also removed the need for hardware VSS snapshots (a big source of bugs), but you can still make use of hardware features that a SAN can offer.

ReFS Accelerated VHDX Operations

Re-FS is the preferred file system for storing VMs in WS2016. ReFS works using metadata which links to data blocks. This abstraction allows very fast operations:

  • Fixed VHD/X creation (seconds instead of hours)
  • Dynamic VHD/X expansion
  • Checkpoint merge, which impacts VM backup

Note, you’ll have to reformat WS2012 R2 ReFS to get the new version of ReFS.

Graphics

A lot of people use Hyper-V (directly or in Azure) for RDS/Citrix.

RemoteFX Improvements

image

The AVC444 thing is a lossless codec – lossless 3D rendering, apparently … that’s gobbledegook to me.

DDA Features and GPU Capabilities

We can also use DDA to connect VMs directly to CPUs … this is what the Azure N-Series VMs are doing with high-end NVIDIA GFX cards.

  • DirectX, OpenGL, OpenCL, CUDA
  • Guest OS: Server 2012 R2, Server 2016, Windows 10, Linux

The h/w requirements are very specific and detailed. For example, I have a laptop that I can do RemoteFX with, but I cannot use for DDA (SRIOV not supported on my machine).

Headless Virtual Machine

A VM can be booted without display devices. Reduces the memory footprint, and simulates a headless server.

Operational Efficiency

Once again, Microsoft is improving the administration experience.

PowerShell Direct

You can now to remote PowerShell into a VM via the VMbus on the host – this means you do not need any network access or domain join. You can do either:

  • Enter-PSSession for an interactive session
  • Invoke-Command for a once-off instruction

Supports:

  • Host: Windows 10/WS2016
  • Guest: Windows 10/WS2016

You do need credentials for the guest OS, and you need to do it via the host, so it is secure.

This is one of Ben’s favourite WS2016 features – I know he uses it a lot to build demo rigs and during demos. I love it too for the same reasons.

PowerShell Direct – JEA and Sessions

The following are extensions of PowerShell Direct and PowerShell remoting:

  • Just Enough Administration (JEA): An admin has no rights with their normal account to a remote server. They use a JEA config when connecting to the server that grants them just enough rights to do their work. Their elevated rights are limited to that machine via a temporary user that is deleted when their session ends. Really limits what malware/attacker can target.
  • Justin-Time Administration (JITA): An admin can request rights for a short amount of time from MIM. They must enter a justification, and company can enforce management approval in the process.

vNIC Identification

Name the vNICs and make that name visible in the guest OS. Really useful for VMs with more than 1 vNIC because Hyper-V does not have consistent device naming.

image

Hyper-V Manager Improvements

Yes, it’s the same MMC-based Hyper-V Manager that we got in W2008, but with more bells and whistles.

  • Support for alternative credentials
  • Connect to a host IP address
  • Connect via WinRM
  • Support for high-DPI monitors
  • Manage WS2012, WS2012 R2 and WS2016 from one HVM – HVM in Win10 Anniversary Update (The big Redstone 1 update in Summer 2016) has this functionality.

VM Servicing

MS found that the vast majority of customers never updated the Integration services/components (ICs) in the guest OS of VMs. It was a horrible manual process – or one that was painful to automate. So customers ran with older/buggy versions of ICs, and VMs often lacked features that the host supported!

ICs are updated in the guest OS via Windows Update on WS2016. Problem sorted, assuming proper testing and correct packaging!

MSFT plans to release IC updates via Windows Update to WS2012 R2 in a month, preparing those VMs for migration to WS2016. Nice!

Core Platform

Ben was running out of time here!

Delivering the Best Hyper-V Host Ever

This was the Nano Server push. Honestly – I’m not sold. Too difficult to troubleshoot and a nightmare to deploy without SCVMM.

I do use Nano in the lab. Later, Ben does a demo. I’d not seen VM status in the Nano console before, which Ben shows – the only time I’ve used the console is to verify network settings that I set remotely using PoSH Smile There is also an ability to delete a virtual switch on the console.

Nested Virtualization

Yay! Ben admits that nested virtualization was done for Hyper-V Containers on Azure, but we people requiring labs or training environments can now run multiple working hosts & clusters on a single machine!

VM Configuration File

Short story: it’s binary instead of XML, improving performance on dense hosts. Two files:

  • .VMCX: Configuration
  • .VMRS: Run state

Power Management

Client Hyper-V was impacted badly by Windows 8 era power management features like Connected Standby. That included Surface devices. That’s sorted now.

Development Stuff

This looks like a seed for the future (and I like the idea of what it might lead to, and I won’t say what that might be!). There is now a single WMI (Root\HyperVCluster\v2) view of the entire Hyper-V cluster – you see a cluster as one big Hyper-V server. It really doesn’t do much now.

And there’s also something new called Hyper-V sockets for Microsoft partners to develop on. An extension of the Windows Socket API for “fast, efficient communication between the host and the guest”.

Scale Limits

The numbers are “Top Gear stats” but, according to a session earlier in the week, these are driven by Azure (Hyper-V’s biggest customer). Ben says that the numbers are nuts and we normals won’t ever have this hardware, but Azure came to Hyper-V and asked for bigger numbers for “massive scale”. Apparently some customers want massive super computer scale “for a few months” and Azure wants to give them an OPEX offering so those customers don’t need to buy that h/w.

Note Ben highlights a typo in max RAM per VM: it should say 12 TB max for a VM … what’s 4 TB between friends?!?!

image

Ben wraps up with a few demos.

RunAsRadio Podcast – Hyper-V in Server 2016

I recently recorded an episode of the RunAsRadio podcast with Richard Campbell on the topic of Windows Server 2016 (WS2016) Hyper-V. We covered a number of areas, including containers, nested virtualization, networking, security, and PowerShell.

image