How W2008 R2 Live Migration Works

Let’s recap the different types of migration that we can get with Windows Server Hyper-V and System Center Virtual Machine Manager:

  • Quick Migration: Leveraging Windows Failover Clustering, a VM is treated as a clustered resource.  To quick migrate, the running state is saved to disk (hibernating the VM), the disk failed over to another node in the cluster, and the saved state is loaded (waking up the VM).
  • Offline Migration: This is when we use VMM to move a powered down VM from one un-clustered Hyper-V server to another or from one cluster to another.
  • Quick Storage Migration: This is a replacement for Offline Migration for Windows Server 2008 R2 Hyper-V servers when using VMM 2008 R2. A running VM can be moved from one un-clustered host to another or from one cluster to another with only around 2 minutes.
  • Live Migration: This is the process of moving a virtual machine from one cluster node to another with no perceivable downtime to network applications or users.  VMware refer to this as VMotion.  It was added in Windows Server 2008 R2 Hyper-V and is supported by VMM 2008 R2.

Live Migration was the big stick that everyone beat up Windows Server 2008 Hyper-V.  A few seconds downtime for a quick migration was often good enough for 75%-90% of VM’s but not for 100%.  But you can relax now; we have Live Migration.  I’m using it in production and it is good!  I can do host maintenance and enable completely automated PRO tips in VMM without worrying of any downtime, no matter how brief, for VM’s.  How does Live Migration Work?  Let’s look at how it works.

imageAbove, we have a virtual machine running on host 1.  It has a configuration and a “state”.

imageWhen we initiate a live migration the configuration of the VM is copied from host 1 when the VM is running to host 2, the destination host.  This builds up a new VM.  The VM is still running on host 1.

imageWhile the VM remains running on host 1, the memory of the VM is broken down and tracked using a bitmap.  Each page is initially marked as clean.  The pages are copied from the running VM on host 1 to the new VM sitting paused on host 2.  Users and network applications continue to use the VM on host 1.  If a RAM page changes in the running VM on host 1 after it has been copied to host 2 then Windows changes the state from clean to dirty.  This means that Windows needs to copy that page again during another copy cycle.  After the first RAM page copy cycle, only dirty pages are copied. As memory is copied again it is marked as clean.  As it changes again, it is marked as dirty.  This continues …

So when does all this stop?

  1. The process will cease if all pages have been copied over from host 1 to host 2 and are clean.
  2. The process cease if there is only a tiny, tiny amount of memory left to copy, i.e. the state. This is tiny.
  3. The process will cease if it has done 10 iterations of the memory copy. In this scenario the VM is totally trashing it’s RAM and it might never have a clean bitmap or tiny state remaining.  It really is a worst case scenario.

Note: The memory is being copied over a GB network.  I talked about this recently when I discussed the network requirements for Live Migration and Windows Server 2008 R2 Hyper-V clusters.

Remember, the VM is still running on host 1 right now.  No users or network applications have seen any impact on uptime.

imageStart your stop watch.  This next piece is very, very quick.  The VM is paused on host 1.  The remaining state is copied over to the VM on host 2 and the files/disk are failed over from host 1 to host 2.

imageThat stop watch is still ticking.  Once the state is copied from the VM on host 1 to host 2 Windows will un-pause it on host 2.  Stop your stop watch.  The VM is removed from host 1 and it’s running away on host 2 as it had been on host 1.

Just how long was the VM offline between being paused on host 1 and un-paused on host 2?  Microsoft claims the time is around 2 milliseconds on a correctly configured cluster.  No network application will time out and no user will notice.  I’ve done quite a bit of testing on this.  I’ve pinged, I’ve done file copies, I’ve used RDP sessions, I’ve run web servers, I’ve got OpsMgr agents running on them and not one of those applications has missed a beat.  It’s really impressive.

Now you should understand why there’s this "long" running progress bar when you initiate a live migration.  There’s a lot of leg work going on while the VM is running on the original host and then suddenly it’s running on the destination host.

VMware cluster admins might recognise the above technique described above.  I think it’s pretty much how they accomplish VMotion.

Are there any support issues?  The two applications that come to mind for me are the two most memory intensive ones.  Microsoft has a support statement to say that SQL 2005 and SQL 2008 are supported on Live Migration clusters.  But what about Exchange?  I’ve asked and I’ve searched but I do not have a definitive answer on that one.  I’ll update this post if I find out anything either way.

Edit #1

Exchange MVP’s Nathan Winters and Jetze Mellema both came back to me with a definitive answer for Exchange.  Jetze had a link (check under hardware virtualization).  The basic rule is that a DAG (Data Availability Group) does not support hardware virtualisation if the hosts are clustered, i.e. migration of an Exchange 2010 DAG member is not supported.

One thought on “How W2008 R2 Live Migration Works”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.