I’m building a demo for some upcoming events, blatantly ripping off what Ben Armstrong did at TechEd – copying is the best form of flattery, Ben In the demo, I have 2 Dell R420 hosts with a bunch of NICs:
- 2 disabled 1 GbE NICs
- 2 Enabled 1 GbE NICs teamed for Live Migration
- 2 10 GbE iWARP (RDMA) NICs not teamed for cluster, SMB Live Migration, and SMB 3.0 storage
- 2 10 GbE NICs teamed for VM networking and host management
It’s absolutely over the top for real world but it gives me demo flexibility, especially to do the following. In the demo, I have a PowerShell script that will perform a measured Live Migration of a VM with 8 GB RAM (statically assigned). The VM is a pretty real workload: it’s running WS2012 R2, SQL Server, and VMM 2012 R2.
The script then does:
- Configure the cluster to use the 1 GbE team for Live Migration with TCPIP Live Migration
- Live migrate the VM (measured)
- Configure the cluster to use the 1 GbE team for Live Migration with Compressed Live Migration
- Live migrate the VM (measured)
- Configure the cluster to use a single 10 GbE iWARP NIC Live Migration with SMB Live Migration (SMB Direct)
- Live migrate the VM (measured)
- Configure the cluster to use a both 10 GbE iWARP NIC Live Migration with SMB Live Migration (SMB Direct + Multichannel)
- Live migrate the VM (measured)
What I observed in my test runs:
- TCP/IP: About 95% of a 1 GbE NIC is utilised consistently for the duration.
- Compressed: The bandwidth utilisation has a saw tooth pattern up to around 98%, as one should expect with the dynamic nature of compression. CPU utilisation is higher (as expected), but remember that Live Migration will switch to TCP/IP if compression is contending for resources with the host/VMs.
- SMB Direct:
Nearly 10 Gbps over a single NIC.
- SMB Direct + SMB Multichannel:
Nearly 20 Gbps over the two iWARP rNICs.
And the time taken for each Live Migration?
Over 78 seconds to move a running VM over a 1 GbE network without optimizations! Imagine that scaled out to a host with 250 GB RAM of production VM memory, needing to be drained for preventative maintenance. That’s over 40 minutes, but it could be longer. That’s a long time to wait to get critical services off of a host before a hardware warning becomes a host failure.
As the Live Migrations get faster they get closer to the theoretical minimum time. There are four operations:
- Build the VM on the destination host (that magic 3% point, where the VM’s dependencies are attempted to be prepared)
- Copy RAM
- Sync RAM if required
- Destroy the VM on the source host
The first and last operation cannot be accelerated, generally taking a couple of seconds each. In fact, the first operation could take longer if you use Virtual Fiber Channel.
This test with with a more common VM with 8 GB RAM. Remember that I moved a VM with 56 GB RAM in 35 seconds using SMB Direct + Multichannel? That test was 33 seconds earlier today on the same preview release. Hmm, I think that hardware would take 2.5 minutes to drain 250 GM RAM of VMs, versus 42 minutes of un-optimised Live Migrations. I hope the point of this post is clear; if you need dense hosts then:
- Use 10 GbE Networking; If you can’t upgrade to WS2012 R2 Hyper-V and use compression
- If you’re using rNICs for storage then leverage that bandwidth and offload for optimising Live Migration, subject to QoS and SMB Bandwidth Constraints