Ignite 2015–Hyper-V Storage Performance with Storage Quality of Service

I am live blogging this session so hit refresh to see more.

Speakers: Senthil Rajaram and Jose Barreto.

This session is based on what’s in TPv2. There is a year of development and FEEDBACK left, so things can change. If you don’t like something … tell Microsoft.

Storage Performance

You need to measure to shape
Storage control allows shaping
Monitoring allows you to see the results – do you need to make changes?

Rules

Maximum Allowed: Easy – apply a cap.
Minimum Guaranteed: Not easy. It’s a comparative value to other flows. How do you do fair sharing? A centralized policy controller avoids the need for complex distributed solutions.

The Features in WS2012 R2

There are two views of performance:

From the VM: what the customer sees – using perfmon in the guest OS
From the host: What the admin sees – using the Hyper-V metrics

VM Metrics allow performance data to move with a VM. (get-vm –name VM01) | Measure-VM).HardDiskMetrics …. it’s Hyper-V Resource Metering – Enable-VMResourceMetering.

Normalized IOPS

Counted in 8K blocks – everything is a multiple of 8K.
Smaller than 8K counts as 1
More than 8K counted in multiples, e.g 9K = 2.

This is just an accounting trick. Microsoft is not splitting/aggregating IOs.

Used by:

Hyper-V Storage Performance Counters
Hyper-V VM Metrics (HardDiskMetrics)
Hyper-V Storage QoS

Storage QoS in WS2012 R2

Features:

Metrics – per VM and VHD
Maximum IOPS per VHD
Minimum IOPS per VHD – alerts only

Benefits:

Mitigate impact of noisy neighbours
Alerts when minimum IOPS are not achieved

Long and complicated process to diagnose storage performance issues.

Windows Server 2016 QoS Instroduction.

Moving from managing IOPS on the host/VM to managing IOPS on the storage system.

Simple storage QoS system that is installed in the base bits. You should be able to observe performance for the entire set of VMs. Metrics are automatically collected, and you can use them even if you ar enot using QoS. No need to log into every node using the storage subsystem to see performance metrics. Can create policies per VM, VHD, service or tenant. You can use PoSH or VMM to manage it.

This is a SOFS solution. One of the SOFS nodes is elected as the policy manager – a HA role. All of the nodes in the cluster share performance data, and the PM is the “thinker”.

Measure current capacity at the compute layer.
Measure current capacity at the storage layer
use algorithm to meet policies at the policy manager
Adjust limits and enforce them at the compute layer

In TP2, this cycle is done every 4 seconds. Why? Storage and workloads are constantly changing. Disks are added and removed. Caching makes “total IOPS” impossible to calculate. The workloads change … a SQL DB gets a new index, or someone starts a backup. Continuous adjustment is required.

Monitoring

On by default You can query the PM to get a summary of what’s going on right now.

Available data returned by a PoSH object:

VHD path
VM Name
VM Host name
VM IPOS
VM latency
Storage node name
Storage node IOPS
Storage node latency

Get-StorageQoSFlow – performance of all VMs using this file server/SOFS

Get0StorageQoSVolume – performance of each volume on this file server/SOFS

There are initiator (the VM’s perspective) metric and storage metrics. Things like caching can cause differences in initiator and storage metrics.

Get-StorageQoSFlow | Sort InitiatorIPOS | FT InitiarorName, InitiatorIIOPS, InitiatorLatency

Working not with peaks/troughs but with averages over 5 minutes. The Storage QoS metrics, averaged over the last 5 minutes, are rarely going to match the live metrics in perfmon.

You can use this data: export to CSV, open in Excel pivot tables

Deploying Policies

Three elements in a policy:

Max: hard cap
Min: Guaranteed allocation if required
Type: Single or Multi-instance

You create policies in one place and deploy the policies.

Single instance: An allocation of IOPS that are shared by a group of VMs. Multi-instance: a performance tier. Every VM get’s the same allocation, e.g. max IOPS=100 and each VM gets that.

Storage QoS works with Shared VHDX

Active/Active: Allocation split based on load. Active/Passive: Single VM can use full allocation.

This solution works with Live Migration.

Deployment with VMM

You can create and apply policies in VMM 2016. Creaate in Fabric > Storage > QoS Policies. Deploy in VM Properties > Hardware Configuration > <disk> > Advanced. You can deploy via a template.

PowerShell

New-StorageQoSPolicy –CimSession FS1 –Name sdjfdjsf –PolicyType MultiInstance – MaximumIOPS 200

Get-VM –Name VM01 | Get-VMHardDiskDrive | Set-VMHardDiskDrive –QosPolicy $Policy

Get-StorageQoSPolicy –Name sdfsdfds | Get-StorageQoSFlow … see data on those flows affected by this policy. Pulls data from the PM.

Demo

The way they enforce max IOPS is to inject latency in that VM’s storage. This reduces IOPS.

Designing Policies

No policy: no shaping. You’re just going to observe uncontrolled performance. Each VM gets at least 1 IOPS
Minimum Only: A machine will get at least 200 IOPS, IF it needs it. VM can burst. Not for hosters!!! Don’t set false expectations of maximum performance.
Maximum only: Price banding by hosters or limiting a noisy neighbour.
Minimum < Maximum, e.g. between 100-200: Minimum SLA and limited max.
Min = Max: VM has a set level of performance, as in Azure.

Note that VMs do not use min IOPS if they don’t have the workload for it. It’s a min SLA.

Storage Health Monitoring

If total Min of all disks/VMs exceeds the storage system then:

QoS does it’s best to do fair share based on proportion.
Raises an alert.

In WS2016 there is 1 place to get alerts for SOFS called Storage health Monitoring. It’s a new service on the SOFS cluster. You’ll get alerts on JBOD fans, disk issues, QoS, etc. The alerts are only there while the issue is there, i.e. if the problem goes away then the alert goes away. There is no history.

Get-StorageSubSystem *clsuter* | Debug-StorageSubSystem.

You can register triggers to automate certain actions.

Right now – we spend 10x more than we need to to ensure VM performance. Storage QoS reduces spend by using a needle to fix issues instead of a sledge hammer. We can use intelligence to solve performance issues instead of a bank account.

In Hyper-V converged solution, the PM and rate limiters live on the same tier. Apparently there will be support for a SAN – I’m unclear on this design.