Virtualisation Made A Server Replacement Easy

I’ve been running a “security” server for years in different jobs.  It’s a server that runs several security roles, for example, SUS and then WSUS, antivirus, certificate services, etc.  Very often these are different servers, quite unnecessarily eating up resources and licenses.

In my current job, our security server started life as a x86 Windows Server 2003 1U rack server.  Not long after the launch of our Hyper-V based private cloud, I ran a VMM 2008 P2V job to convert that machine to be a virtual machine, freeing up the hardware for other purposes.  This was quite appropriate.  These sorts of servers are usually very lightweight. 

Earlier this year I decided to upgrade the machine to Windows Server 2008.  That was easy and safe.  I took a snapshot (knowing I had space on the LUN) and performed the upgrade.  Now it was running W2008 x86.  The upgrade went well.  If it hadn’t I could have easily applied and then deleted the snapshot to return the machine back to W2003.

I now faced a challenge now.  The next upgrade would be to Windows Server 2008 R2.  W2008 R2 is a 64-bit operating system and you cannot upgrade from 32-bit to 64-bit Windows.  There was only one choice – a rebuild.  Virtualisation made this so easy – and VMM 2008 R2 made it easier.

We have a Hyper-V lab server.  I use it to prep new images, test security updates, and to try out scenarios and solutions.  I deployed a VM running W2008 R2 Enterprise edition onto the host and configured the VLAN ID for our test network.  Enterprise edition would allow me to run customised certificates for OpsMgr usage.  Here I could specify the computer name to be the same as the machine I would eventually replace and prepare it identically to the original – excepting the operating system version and architecture.  On went SQL Express 2008 SP1, our antivirus and prepare those services.  Downloads, approvals, patching, etc were all done.  Meanwhile, the production server was still operating away with customers unaware it was to be replaced.

Eventually it was ready.  I powered it down.  I removed the OpsMgr agent from the original server and then used VMM to move that VM elsewhere.  I used VMM to move the new VM onto the desired host.  All that was required now was to change the VLAN id, boot it up, join it to our management network domain and deploy the OpsMgr agent.  10 minutes of service downtime in total to completely replace a server.   Not bad!  I went on to add Certificate Services after the domain join.

I’m leaving the original VM to one side just in case there’s a problem.  If so I can bring it back – but that would then require some ADSIEDIT surgery to remove the certificate services configuration.  So far, though, so good.

Can You Install Hyper-V in a VM?

The answer is sort of.  Strictly speaking it is possible.  You can indeed enable the Hyper-V role in a Server Core installation of Windows Server 2008 and Windows Server 2008 R2.  I’ve done it on both OS’s on both VMware Workstation 6.5 and on Hyper-V.  Logically this means you can deploy Hyper-V Server 2008 and Hyper-V Server 2008 R2 in a VM.

You can even create VM’s on the hosts.  However, the hardware requirements are not passed through to the VM’s and therefore the hypervisor never starts up.  That means you cannot start up those VM’s.

Why would you care?  You certainly cannot do it in a production scenario.  But you might find it handy when doing some demos, lab work or testing of clustering or VMM.

EDIT:

I have been told (but I have not tried this so I cannot say it will work) that you can get Hyper-V to install and run in an ESXi 3.X virtual machine.  The performance is said to be awful, but might be useful for a lab with limited hardware.

Cannot Delete Cluster Object From Operations Manager 2007

I recently decommissioned a Windows Server 2008 Hyper-V cluster.  It was monitored by OpsMgr 2007 R2.  When we shutdown the last cluster node I tried to remove both its agent object and the agentless managed cluster object from OpsMgr administration.  I couldn’t.  The cluster just refused to disappear.  The server agent would delete because there was a remaining dependency – the cluster object which relied on it as a proxy.

It had a red state (ruining my otherwise all green status view) and, more annoyingly, many of the migrated resources (VM’s) still seemed to be linked to the old cluster despite being moved to the new cluster.

I searched and found lots of similar queries.  The official line from MS is that there is no supported way to do this deletion.  There is a hack but the instructions didn’t work for me – I couldn’t find the key piece of info – plus it is unsupported.

So I uninstalled the agent manually.  No joy.  I waited.  No joy.  I rebuilt the server and added it to our Windows Server 2008 R2 Hyper-V cluster.  No joy.  I installed the OpsMgr agent and enabled the proxy setting.

That was yesterday.  This morning I logged in and the old cluster object is gone.  Vamoose!  I guess OpsMgr figured out that the server was now in a new cluster and everything was good.

VMM 2008 R2 Quick Storage Migration

Without System Center Virtual Machine Manager 2008 R2 (and pre-Windows Server 2008 R2 Hyper-V) there is only one way to move a virtual machine between un-clustered hosts or between Hyper-V clusters.  That is to perform what is referred to as a network migration.  Think of this as an offline migration.  The VM must be powered down, exported, the files moved, the VM imported again and powered up, maybe with the integration components being manually added.  The whole process means a production VM can be offline for a significant amount of time.  Moving a 100GB VHD takes time, even over 10GB-E.

However, if you have Windows Server 2008 R2 (on both source and destination) and VMM 2008 R2 then you can avail of Quick Storage Migration:

image

This is a clever process where a VM can remain up and running for the bulk of the file move.  Microsoft claims that the VM only needs to be offline for maybe two minutes.  That really does depend, as you’ll see.

We need to discuss something first.  Hyper-V has lots of several different types of virtualised storage.  One of them is a virtual hard disk (VHD) called a differencing disk.  It is specially an AVHD (advanced virtual hard disk).  It is used during a snapshot.  That’s a Hyper-V term.  VMM refers to it as a checkpoint.  The AVHD is created and the VM switches all write activity from it’s normal VHD to the AVHD.  All new data goes into the AVHD.  All reads for old data come from the original VHD.  That means the VHD is no longer locked, preventing copies.  See where we’re going here?

Here we have two un-clustered host machines, 1 and 2.  Host 1 is running a VM which has a single VHD for all of its storage.  We want to move it from Host 1 to Host 2 with the minimum amount of downtime.  We have W2008 R2 Hyper-V on both hosts and manage them with VMM 2008 R2.

image

We open up the VMM 2008 R2 console, right-click on the VM and select Migrate.  In the wizard we select Host 2 as the destination and select the storage destination and the Virtual Network connection(s).  Once we finish the wizard you’ll see the original screenshot above.

image

The VMM job creates a checkpoint (AKA snapshot) of the VM to be migrated.  This means the VM will put all writes in the AVHD file.  All reads of non-changed data will be from the VHD file.  Now the VHD file is no-longer prevented from being copied.

imageThe VMM job uses BITS to copy the no-longer write-locked VHD from Host 1 to the destination storage location on Host 2.  During this time the VM is still running on Host 1.

Here’s where you have to watch out.  That AVHD file will grow substantially if the VM is writing like crazy.  Make sure you have sufficient disk space.  Anyone still doing 1-VM-per-LUN cluster deployments will need to be really careful, maybe pick a specific storage location for snapshots that has space.  Once the physical disk fills the VM will be paused by Hyper-V to save its continuity.  If your VM is write-happy then pick a quiet time for this migration.

image 

Start your stop watch.  Now the VM is put into a saved state (not paused) on Host 1.  We have to move that AVHD which is otherwise write locked.  If we don’t move it then we lose all the written data since the job started.  Again, BITS is used by VMM to move the file from Host 1 to Host 2.

imageWhen the files are moved VMM will export the configuration of the VM from Host 1 and import it onto Host 2.

imageThe checkpoint (AKA snapshot) is deleted.  The VM needs to be offline here.  Otherwise the AVHD would not be merged into the VHD.  That would eventually kill the performance of the VM.  But, the machine is offline and the AVHD can be merged into the VHD.  All those writes are stored away safely.

imageStop your stop watch.  The virtual network connection(s) are restored and then the very last step is to change the virtual machine’s running state, bringing it back to where it was before it went offline. 

The entire process is automated from when you finish the wizard and up to when you check on the machine after the job has ended.  It’s storage is moved and the VM continues running on the new host.

Note that a VM with multiple VHD’s will have multiple AVHD’s; it’s a 1-to-1 relationship.

How long does this take?

  • The offline time depends on how much data is written to the AVHD, ho fast your network can transmit that AVHD from Host 1 to Host 2 and how fast the disk is on Host 2 to merge the AVHD back into the VHD.
  • The entire process takes as long as it takes to copy the VHD and then complete the AVHD process and do the tidy up work at the end of the job.

In my tests with an idle VM, the offline time (not timed scientifically) felt to be under a minute.

I moved a VM from a cluster to an un-clustered lab machine and back again.  Both times, the highly available setting was appropriately changed.  I was able to modify the virtual network connections appropriately in the migrate wizard.

Live Migrations Are Serial, Not Concurrent

Normally when you move 2 VM’s from one host to another using Live Migration they move one at a time.  Yes, the VMM job pauses at 50% for the second machine for a while – that’s because it hasn’t started to replicate memory yet.  The live migrations are serial, not concurrent.  The memory of a running VM is being copied across a network so the network becomes a bottleneck.

I ran a little test across 3 Windows Server 2008 R2 Hyper-V cluster nodes to see what would happen.  I started moving a VM from Host A to Host C.  I also started moving a VM from Host B to host C.  The first one ran straight through.  The second one paused at 50% until the first one was moved – just like moving 2 VM’s from one host to another.

Adding a Node To A VMM Managed Hyper-V Cluster

I’ve just gone through this process so I thought I’d document what I did:

  • Have a test VM ready and running on the cluster.  You’ll be moving it around to/from the new node. Don’t use a production machine in case something doesn’t work.
  • Built the new node.  Set up hardware, drivers and patches, making sure the machine was identical to the other nodes in the cluster.  I mean identical.
  • Enable Hyper-V role and Failover Clustering feature.
  • Configure the virtual networks to be identical as the other nodes – VMM won’t do this in the “add” step and we know it messes up the configuration of External networks.
  • Used the SAN manager to present all cluster disks to the new node.
  • Put the cluster, Hyper-V cluster nodes and VMM server into maintenance mode in OpsMgr.
  • Add the new node to the cluster in Failover Clustering.  Modified the cluster quorum settings to be recommended.
  • Refreshed the cluster in VMM 2008 R2.  Waited for the new node to appear under the cluster in a pending state.
  • Right-clicked on the new pending node and selected Add Node To Cluster.  Entered administrator credentials (good for all nodes in the cluster).  VMM ran a job to deploy the VMM agent.
  • If everything is good and matches up (watch out for virtual networks) then you won’t see the dreaded “Unsupported Cluster Configuration” error.
  • Move that test VM around from the new node to all the other nodes and back again using Live Migration.
  • Re-run the validation tests against your cluster ASAP.

All should be well at this point.  If so, deploy your OpsMgr agent and take the OpsMgr agents out of maintenance mode.

How W2008 R2 Live Migration Works

Let’s recap the different types of migration that we can get with Windows Server Hyper-V and System Center Virtual Machine Manager:

  • Quick Migration: Leveraging Windows Failover Clustering, a VM is treated as a clustered resource.  To quick migrate, the running state is saved to disk (hibernating the VM), the disk failed over to another node in the cluster, and the saved state is loaded (waking up the VM).
  • Offline Migration: This is when we use VMM to move a powered down VM from one un-clustered Hyper-V server to another or from one cluster to another.
  • Quick Storage Migration: This is a replacement for Offline Migration for Windows Server 2008 R2 Hyper-V servers when using VMM 2008 R2. A running VM can be moved from one un-clustered host to another or from one cluster to another with only around 2 minutes.
  • Live Migration: This is the process of moving a virtual machine from one cluster node to another with no perceivable downtime to network applications or users.  VMware refer to this as VMotion.  It was added in Windows Server 2008 R2 Hyper-V and is supported by VMM 2008 R2.

Live Migration was the big stick that everyone beat up Windows Server 2008 Hyper-V.  A few seconds downtime for a quick migration was often good enough for 75%-90% of VM’s but not for 100%.  But you can relax now; we have Live Migration.  I’m using it in production and it is good!  I can do host maintenance and enable completely automated PRO tips in VMM without worrying of any downtime, no matter how brief, for VM’s.  How does Live Migration Work?  Let’s look at how it works.

imageAbove, we have a virtual machine running on host 1.  It has a configuration and a “state”.

imageWhen we initiate a live migration the configuration of the VM is copied from host 1 when the VM is running to host 2, the destination host.  This builds up a new VM.  The VM is still running on host 1.

imageWhile the VM remains running on host 1, the memory of the VM is broken down and tracked using a bitmap.  Each page is initially marked as clean.  The pages are copied from the running VM on host 1 to the new VM sitting paused on host 2.  Users and network applications continue to use the VM on host 1.  If a RAM page changes in the running VM on host 1 after it has been copied to host 2 then Windows changes the state from clean to dirty.  This means that Windows needs to copy that page again during another copy cycle.  After the first RAM page copy cycle, only dirty pages are copied. As memory is copied again it is marked as clean.  As it changes again, it is marked as dirty.  This continues …

So when does all this stop?

  1. The process will cease if all pages have been copied over from host 1 to host 2 and are clean.
  2. The process cease if there is only a tiny, tiny amount of memory left to copy, i.e. the state. This is tiny.
  3. The process will cease if it has done 10 iterations of the memory copy. In this scenario the VM is totally trashing it’s RAM and it might never have a clean bitmap or tiny state remaining.  It really is a worst case scenario.

Note: The memory is being copied over a GB network.  I talked about this recently when I discussed the network requirements for Live Migration and Windows Server 2008 R2 Hyper-V clusters.

Remember, the VM is still running on host 1 right now.  No users or network applications have seen any impact on uptime.

imageStart your stop watch.  This next piece is very, very quick.  The VM is paused on host 1.  The remaining state is copied over to the VM on host 2 and the files/disk are failed over from host 1 to host 2.

imageThat stop watch is still ticking.  Once the state is copied from the VM on host 1 to host 2 Windows will un-pause it on host 2.  Stop your stop watch.  The VM is removed from host 1 and it’s running away on host 2 as it had been on host 1.

Just how long was the VM offline between being paused on host 1 and un-paused on host 2?  Microsoft claims the time is around 2 milliseconds on a correctly configured cluster.  No network application will time out and no user will notice.  I’ve done quite a bit of testing on this.  I’ve pinged, I’ve done file copies, I’ve used RDP sessions, I’ve run web servers, I’ve got OpsMgr agents running on them and not one of those applications has missed a beat.  It’s really impressive.

Now you should understand why there’s this "long" running progress bar when you initiate a live migration.  There’s a lot of leg work going on while the VM is running on the original host and then suddenly it’s running on the destination host.

VMware cluster admins might recognise the above technique described above.  I think it’s pretty much how they accomplish VMotion.

Are there any support issues?  The two applications that come to mind for me are the two most memory intensive ones.  Microsoft has a support statement to say that SQL 2005 and SQL 2008 are supported on Live Migration clusters.  But what about Exchange?  I’ve asked and I’ve searched but I do not have a definitive answer on that one.  I’ll update this post if I find out anything either way.

Edit #1

Exchange MVP’s Nathan Winters and Jetze Mellema both came back to me with a definitive answer for Exchange.  Jetze had a link (check under hardware virtualization).  The basic rule is that a DAG (Data Availability Group) does not support hardware virtualisation if the hosts are clustered, i.e. migration of an Exchange 2010 DAG member is not supported.

Virtualisation: The Undersold Truth

Ease of administration.

To a sys admin, those 3 words mean a lot.  To a decision maker like a CIO or a CFO (often one and the same) they mean nothing.

It’s rare enough that I find myself working with physical boxes these days.  Most everyone is looking for a virtualised service which is cool with me.  Over the last 2 weeks I’ve been doing some physical server builds with Windows Server 2008 R2.  I know the techniques for a automated installation.  I just haven’t had time to deploy them for the few builds I needed to do.  Things like Offline Servicing for VM’s and MDT/WDS (upgrade) are in my plans but things had to be prioritised.  I’ve just kicked off a reboot of a blade server.  By the time that’s finished it’s POST I’ll have made and half drunk a cup of coffee.  After working with VM’s almost exclusively for the last 18 months, working with a physical box seems slow.  These are fine machines but the setup time required seems slow.  Those reboots take forever!  VM reboots: well there’s no POST and they reboot extremely quickly.

Let’s compare the process of deploying a VM and a physical box

Deploy a VM

  • Deploy a VM.
  • Log in and tweak.
  • Handover the VM.

Notes on this:

  • The free Offline Servicing Tool can allow you to deploy VM’s that already have all the security updates.
  • This process can be done by a delegate “end user” using the VMM self servicing web interface.
  • The process was probably just an hour or two from end to end.

Deploy a Physical Server

  • Create a purchase request for a new server.
  • Wait 1-7 days for a PO number.
  • Order the server.
  • Wait for up to 7 days for the server to be delivered.
  • Rack, power and network the server.
  • We’ll assume you have all your ducks in a row here: Use MDT 2010 or ConfigMgr to deploy an operating system.
  • The OS installs and the task sequence deploys updates (reboots), then applications (reboots), then more updates (reboots) and then makes tweaks (more updates and a reboot).
  • You have over the server.

Notes on this:

  • Most people don’t automate a server build.  Manual installs typically take 1 to 1.5 days.
  • There will probably be up to 1 day of a delay for networking.
  • The “end user” can’t do self service and must wait for IT, often getting frustrated.
  • The entire process will probably take 10.5 to 16.5 days.

Total Hardware Breakdown

Let’s assume the VM scenario used a cluster.  If the hardware failure crashed the host then the VM stops running.  The cluster moves the VM resource to another host (VMM will choose the most suitable one) and the VM starts up again.  Every VM on the cluster has hardware fault tolerance.  If the hardware failure was non-critical then you can use Live Migration to move all the VM’s to another host (VMM 2008 R2 maintenance mode) and then power down the host to work on it.  There’s no manual intervention at all in keeping things running.

What if you used standalone (un-clustered) hosts.  As long as you have an identical server chassis available you can swap the disks and network cables to get back up and running in a matter of minutes.

Unbelievably worst case scenario with un-clustered hosts: you can take the data disks and slap them into another machine and do some manual work to get running again.  As long as the processor is from the same manufacturer you’re good to go in a few hours.

If a physical box dies then you can do something similar to that.  However, physical boxes tend to vary quite a lot.  A farm of virtualisation hosts don’t usually vary too much at all.  If a DL380 dies then you can expect to put the disks into a DL160 and have a good result.  It might work. 

Most companies don’t purchase the “within 4 hours” response contracts.  And even if they do, some manufacturers will do their very best to avoid sending anyone out by asking for one diagnostic test after another and endless collections of logs.  It could be 1 to 3 days (and some angry phone calls) before an engineer comes out to fix the server.  In that time the hosted application has been offline, negatively affecting the business and potentially your customers.  If only a physical server was a portable container like a VM – see boot from VHD.

Summary

You’ve heard all those sales lines on virtualisation: carbon footprint, reduced rack space, lower power bills, etc.  Now you can see how easier administration can make your life easier but positively impact the business.

My experience has been that when you translate techie-speak into Euros, Dollars, Pounds, Rubles, Yen or Yuan then that get’s the budget owners attention.  The CFO will sit up and listen and probably decide in your favour.  And if you can explain how these technologies will have real positive impacts on the business then the other decision makers will also have your attention.

Finished Our W2008 R2 Hyper-V Cluster Migration

Last night we finished migrating the last of the virtual machines from our Windows Server 2008 Hyper-V cluster to the new Windows Server 2008 R2 Hyper-V cluster.  As before, all the work was done using System Center Virtual Machine Manager (VMM) 2008 R2.  The remaining host has been rebuilt and is half way to being a new member of the R2 Hyper-V cluster.

I also learned something new today.  There’s no supported way to remove a cluster from OpsMgr 2007.  Yuk!

Using VMM To Convert VHD Types

One of the most common queries I used to get on my old blog was “how do I convert Hyper-V disks?”.  Converting a VHD is easy enough.  In the Hyper-V console you shut down the VM, edit the disk, select a location for the new VHD.  Once that’s done you can rename the old disk and grant it’s old name to the new disk.  Start up the VM and it’s using the new disk.  Remember to remove the old disk to save disk space.

Before you even think about this you need to be sure you have enough space for the new disk of the desired type to be created.  How much space will it need?  Check how much data is on that disk (in the VM’s OS) and allow for another GB or two to be safe.  This applies to both the Hyper-V console and VMM.

VMM is a bit more elegant.  You shut down your VM, hopefully from the OS itself or via the IC’s, rather than just turn it off.  Then you edit the properties of the VM and navigate to the disk in question.

clip_image001

Here I have opened up the VM that I wanted to work on and I’ve navigated to the disk in question.  It’s a fixed VHD and I want to replace it with a dynamic VHD without losing my data.  You can see in the right-hand side that there is a tick box to convert the disk.  I ticked this and clicked on OK.

Notice that there is also a tick box for expand the VHD?  That’s how we can grant more space to a VM.  Remember to follow that up by running DISKPART in the VM to expand the volume.  That’s nothing to do with the convert task but I thought I’d mention it.

clip_image003

Once I clicked on OK the job runs.  How long this will take depends on the amount of data we’re dealing with.

VMM is pretty clever here.  It will convert the disk and then swap out the old disk with the new disk.  The old disk is removed.  This is a much less manual task than using the Hyper-V console.

clip_image004

Once the job is done you should check out your VM.  You can see above that the disk is now a dynamic disk.  And notice how much space I’ve saved?  I’ve gone from 20GB down to 12.5GB.  I’ve just saved my employer 40% of the cost to store that VM with a couple of mouse clicks while waiting for my dinner.  That goes to back up my recent blog post about simpler and more cost effective storage.  And like I said then, I’ve lost nothing in performance because I am running Windows Server 2008 R2.