Keeping Hyper-V Storage Simplified and More Economic

I hate when people talk about disk being cheap.  I want to just smack them.  Disk for laptops and PC’s is cheap.  Disk for fault tolerance server computing is far from cheap.

If you’re running a Hyper-V cluster then you know that you can’t just go out and buy some cheap storage.  You’re looking at shared storage with cluster support.  That means either a Fibre Channel or iSCSI SAN.  And then there’s the disk.  You could go budget and use SATA or worry about performance and go with SAS.  Odds are it will be the latter which provides less storage for a higher price.  Add in fault tolerance and you’ve use more expensive HBA’s, doubled your switch port requirements, etc.

At work we’re using a HP EVA with fibre channel connections via dual port HBA’s and 15K disks.  We had to worry about the ability to scale.  We’re a hosting company so unlike most virtualisation projects we had no end point.  I couldn’t say “we have 20 machines to virtualise”.  As a hosting company, 20 VM’s wouldn’t make much and wouldn’t justify the existence of a company.  We also had to worry about performance.

When we deployed we use Windows Server 2008 Hyper-V.  As you may know, the recommended (and VMM required) way to deploy storage for Hyper-V in a cluster is 1 LUN per VM.  It’s a storage nightmare.  I had to be paranoid about our processes.  LUN’s, volumes, fail over cluster storage definitions and VM’s all had identical names based on a naming standard I’ve been using for years.  But even with all that, I had that nightmare that I’d accidentally wipe a hosted customer’s VM if I deleted the wrong LUN on the SAN or someone wasn’t careful with documentation.

Windows Server 2008 R2 gives us the Cluster Shared Volume (CSV) for Hyper-V.  Note the last bit; CSV is custom written by MS to be only used for Hyper-V.  Instead of creating lots of LUNS in the SAN and managing lots of failover cluster storage definitions and tracking documentation, you simply carve out a single LUN and deploy lots of VM’s onto it.  I’ve deployed a single LUN on our new cluster.  It’s set up as a GPT volume and used as a CSV on our W2008 R2 cluster.  That means we can simply deploy lots of VM’s onto it without worrying.  Deleting VM’s doesn’t bring me to the EVA command view console to worry over accidentally deleting volumes.  I just don’t need to go there anymore unless adding more space to the CSV.  And sure, OpsMgr will let me know when I need to do that!

Another nice perk of the CSV is that self service deployment on a Hyper-V cluster becomes a real world possibility.  Sure, you could have allocated lots of individual LUNS in W2008.  But wouldn’t it be a waste if a person deployed a 50GB VM onto a 200GB LUN?

What about performance?  You shouldn’t suffer at all really.  Our EVA uses a concept where all storage is striped across all disks in a disk group.  RAID is a secondary decision when you create the LUN.  So, when you put lots of VM’s on a single CSV, they’ll all use all physical disks in the CSV.

There’s also another thing you can do.  You can split storage for VM’s across different CSV’s.  Maybe you have RAID1, RAID5 and RAID6 CSV’s.  They all are suited for different kinds of storage, even if it is virtual.  So now your virtualised SQL could have the OS and log files on a RAID1 CSV and the data files on a RAID5 CSV.  You get the most from the physical storage while maintaining disk performance.

In Windows Server 2008, dynamic disks did not meet our requirements for storage because it was too slow.  So we went with fixed disks.  I think our experience is not unique in any regard.  Our customers often don’t know what their storage requirements are going to be.  We advise them to consume only what they need and take more disk later; it’s a quick operation to add space to a VHD.  But even then, physical disk gets wasted.  Consider a Windows Server C drive.  The recommended minimum is 40GB.  Our average consumption is less than 50% of that.  That means 50% of physical disk for the C: drive is wasted.  That means a person with a 40GB C drive who is storing 13GB in the VM is really using 40GB of physical disk, plus whatever to allow for free space and save states.

Windows Server 2008 R2 offers greater performance for dynamic disks.  In fact, they nearly match the performance of fixed size disks.  We’ve made the switch so we can keep costs down.  That means that hypothetical consumer of 13GB of disk is really only consuming slightly more than that.  Over the number of VM’s I’ve observed so far, we could save on 50% of our C: drives.  Data drives are hard to figure.  Early on we definitely save but storage requirements only ever go up.  But we are definitely saving there too.

Oh yes, you can convert from a fixed VHD to a dynamic VHD.  You just need to bring down the VM (why we won’t do this to existing customers) and have space for the new VHD to be created.

So CSV simplifies storage administration.  Especially if you are in an organisation where SAN management is split from server management.  Using Dynamic Disks allows you to consume only the physical disk that you need for the data you’ve stored.  Add in other things like Live Migration, Core Parking, SLAT, improved networking, etc and you might want to do the cost benefit analysis of upgrading from Windows Server 2008 Hyper-V to Windows Server 2008 R2 Hyper-V.  The costs of those licenses could be easily negated by the savings you’ll make on storage costs (literally consume what you use) and reduced administration.

Gone Into Production: W2008 R2 Hyper-V Cluster

At 11pm GMT last night, we put our new Windows Server 2008 R2 cluster into production.  We use Virtual Machine Manager (VMM) 2008 R2 to migrate the first machines from our W2008 cluster to the W2008 R2 cluster.  We’re a hosting company so we had to do this at times that suited the customers and we had to do some other steps so their “sites” were not unresponsive.

The VMM moves ran pretty well.  One of the machines failed to install the updated IC’s in the job so I reran the IC upgrade by itself.  Once each machine moved over to the new cluster (on the CSV) I tested live migration.  These were all web servers so the tests were simple – RDP into the machines via VPN, run a continuous ping from them to their respective default gateways and refresh websites from a browser while the migration was running.  RDP didn’t have a disconnect or a hitch, ping didn’t miss a packet and none of the IE refreshes failed.  All worked well.

The real test would be what would happen over night.  As usual, the phone stayed close by.  My real dread was seeing my inbox when I would come down in the morning.  Would it be full of alerts from OpsMgr?  We use OpsMgr 2007 R2 to monitor server hardware, virtualisation, operating systems, services, applications and to do some client side perspective monitoring of websites.  One of the migrated customers is a web developer/hoster with a lot of sites.  They’ve identified a decent number of critical sites for client perspective monitoring.  Any problems at all over night and Outlook would be a scary proposition.

I might have only gotten to sleep at 02:00 but I was awake at 07:30.  I came downstairs and powered up my work laptop.  Outlook had … no new mails.  Phew, what a relief!  I was very confident after a rigorous test program but you never know when you make a big change.  I fell good now about completing this migration, hopefully next week.

Going Into Production With Windows Server 2008 R2 Hyper-V Cluster

I’m happy enough now with our W2008 R2 Hyper-V cluster that I’m putting it into production tomorrow night.  We’ll be migrating some of our production machines from the old W2008 cluster to the new cluster.  Today I deployed OpsMgr agents onto the hosts and did some more testing.

OpsMgr and VMM don’t synchronise their maintenance modes.  I submitted feedback suggesting that this would be good.  I also noticed that even if both System Center products had a node in maintenance mode, the VMM management pack would alert when that node rebooted.  Ouch.  That’s a bit painful.  I also submitted feedback on that.

So far, I haven’t had any problems with CSV or Live Migration.  Everything has worked fine.  One tip I’ve picked up on is to set a static MAC on Linux guests.  SUSE 10 SP2 binds the IP configuration to the MAC address and a change due to any sort of VMM/Hyper-V migration can screw it up – I’ve seen this with an export/import.

So 11PM tomorrow, the first production machine moves over, followed by the second at midnight.  Hopefully there won’t be any calls on Saturday morning!

Share ISO Images From VMM

Last year I blogged about this.  I had difficulties getting this working so I fired a question to MS on the subject.  For any MS person reading, the case number was case#SRX081210600013.  The PSS engineer said this was not possible.  I would have to continue the time and space consuming process of copying the ISO files over.  That sucked.

I’d since read on one of the MS blogs that sharing an ISO or DVD image over the network from Virtual Machine Manager was actually possible.  The required configuration was blogged by Jose Barreto.  What you need to do is edit the properties of the AD computer account object of every Hyper-V server managed by VMM.  Edit the delegation and configure constrained delegation.  Add the names of the VMM library server(s) and add them with the CIFS (file sharing) protocol.  To be save I did a reboot of the hosts (live migration rocks!).

I finally had an opportunity to deploy this configuration.  I tested and I was then able to share an ISO over the network.

You’ll note that Jose didn’t actually do this for VMM.  His example was where he was using the Hyper-V console to access file server resources, e.g. VHD (not supported in production) or ISO’s.

EDIT #1

Make sure that either the computer account of the Hyper-V host or EVERYONE has at least read access to the library share(s).

Boot Hyper-V Server 2008 R2 From USB

Ben Armstrong has posted an article on this subject.  There is a complicated TechNet method and there is a simpler tool you can use.  Running Windows Server from USB is completely unsupported.  This is intended only for Hyper-V Server 2008 R2.  If using portable USB then beware that you really shouldn’t go from machine to machine with this – it isn’t supported and it messes up virtual switches. 

Windows Server 2008 to 2008 R2 Hyper-V Migration

I’ve previously talked about the process of going from a W2008 to a W2008 R2 Hyper-V cluster.  Today, I’ve tested the process out from end to end.  I set up a VM on the W2008 cluster and made sure the integration components were updated by VMM 2008 R2.  I then went through this process:

  • I shut down the VM in VMM 2008 R2.
  • I used a network migration to move the VM from the old cluster to the CSV in the new cluster.
  • The job exported the VM configuration, used BITS to transfer the files and imported the VM configuration.  It wrapped up the job by updating the IC’s and starting the VM.
  • I logged into the VM and tested everything.  All was good.

I then did some more testing to complete things:

  • I RDP’d into the VM and fired up a ping –t to the default gateway.  I started using IE to surf the net in the VM.
  • I initiated a live migration from one host to another.
  • I put a host into maintenance mode to move the VM (and another) back to the original host.
  • I re-ran live migration.

Ping stayed up and running the entire time.  RDP never timed out.  I never saw an issue while surfing the net using IE.  That a 100% pass on the tests.  I think I’m feeling good about pushing this into production.  I think I’ll deploy the OpsMgr agents first and then do some more tests.

Virtualisation Memory Over Commitment

Working in the server hosting business I’m used to “VPS” terms like over commit, burstable, etc.  What they mean is that although your virtual machine is granted 4GB RAM (for example) it only ever is given whatever it is using.  The idea is that the server hoster might have 29GB RAM available for VM’s but could possibly sell 40GB on that host machine.  You could see how this would be attractive to anyone.  Let’s face it, we tend to spec servers based on peak requirements, not average ones.  A web server might have 2GB RAM but it probably only uses 1GB of that 95% of the time.  Wouldn’t this be appealing in testing labs, development farms and enterprise virtualisation deployments?  But what happens if the VM with 4GB of RAM can’t burst to 4GB when it needs it?  What if either too many VM’s are bursting at once or what if the hosting company abuses over commitment?  The best case scenario is that the host machine starts to page like crazy.  The worst case scenarios is that VM’s start to blue screen when the RAM the believe to be available cannot be accessed.  At work, our virtualisation solution (Hyper-V) doesn’t have this and even if it did, I’d be very conservative about using it.

That’s why I read this article with interest.  Let me preface this by saying that I’ve found this blogger, in my opinion (i.e. not fact), to have a slanted viewpoint.

The blogger talks about the Burton Group and how they compare/measure virtualisation solutions for the enterprise.  They have 27 requirements and a number of preferred standards.  Yes, they measure VMware above Hyper-V.  Fair enough.  I’d agree that VMware have been in this market longer and have a more mature solution.  It might not be the right solution for me right now, but it is around longer and had more time to develop.  VMware do have more features.  For example, VMware has memory over commitment of sorts.  Hyper-V does not.  MS did try to add it into W2008 R2 but had to pull it very late (pre beta) for whatever reason.  I suspect they didn’t feel they had time to get it perfect before the release date.  Instead of releasing a nearly perfect solution they waited to ensure something critical like this would be right.

One of the really cool things VMware does is their power management by putting idle hosts to sleep after using VMotion.  It’s like Core Parking across host servers.

The blogger says that one of the preferred features, Memory Over Commitment, should be a requirement.  Oh really?  Let’s just analyse this for a second.  Would it save companies money?  Absolutely.  With server costs exploding in the last 12 months the less we have to buy of them, the better.  Is memory over commitment supported in production?  Oh – no it isn’t, at least not by VMware.  I guess that puts a dampener on that.

Would I like to see memory over commitment supported in production?  Yes.  I’d love it.  But it isn’t right now so I guess it shouldn’t be a requirement for any measure of virtualisation suitability for the enterprise.

Live Migration Up and Running

I’ve added a second node to our Hyper-V cluster.  The servers are HP BL460 G5 blades.  The setup was simple:

  • Install Windows Server 2008 R2
  • Install HP’s MPIO 4.0
  • Install the HP PSP 8.30
  • Set up the NIC’s
  • Set up the computer name and computer domain membership
  • Enable Hyper-V role
  • Install the 2 fixes I’ve blogged about before for W2008 R2 and Hyper-V
  • Enable Failover Clustering feature
  • Set up/add to the cluster
  • Add the cluster to VMM 2008 R2
  • Configure the virtual networks for the hosts in VMM on one node – which replicates to the other nodes in the cluster via a job

I deployed a test VM to the cluster and ensure the IC’s were up to date.  I set up the IP configuration of the VM for the VLAN that it was located in.  I then set up a continuous ping from the VM to its default gateway (a Cisco ASA firewall cluster) and initiated a live migration.  As expected, the console window terminated as the VM left node 1 and moved to node 2.  Problem!  My ping failed.

Not with Live Migration, though.  It worked perfectly.  When I set up the virtual networks on node 1 in VMM, VMM set them up as Internal networks on the other node.  Doh!  I changed the virtual networks to External and reran the tests.  Perfect!  I set a node into maintenance mode – the VM live migrated.  Not a single ping was dropped.  Perfect!

First W2008 R2 Hyper-V Cluster Operational

OK … it is a single node cluster 🙂  But it is running!  Live Migration is great and all but to be honest, the 2 things I want out of Windows Server 2008 R2 Hyper-V are Core Parking (to reduce our power bill) and Cluster Shared Volume.  I really, really hated having to do per-LUN deployment of VM’s on the cluster.  They stressed me out when it came to alterations or deletions.  Luckily, I’d settled on a consistent naming standard for every component in the W2008 cluster.  But still, one oversight and bang – a production VM goes off the air.  With CSV, you deploy your storage once and add to it as required later.  Love that!

Setting up CSV was easy.  I set up a LUN in the SAN management console.  I linked this to the cluster node(s).  I initialised it and brought it online with the GPT disk partition system.  This is optimised for LUN’s over 2TB in size.  Our CSV will keep on growing so 2TB will be nothing.  I did a quick format and labelled the disk as CSV1.  I did not add a letter to the drive because there was no point.

Next I added the storage to the cluster.  I renamed it as CSV1.  I enabled CSV in the cluster (select the cluster, centre pane, it’s a hyperlink in there).  The MMC refreshed and now I had a Cluster Shared Volume item in the navigation pane on the left.  I selected this and added storage: I selected the disk I’d just added to the cluster.  Badda bing, a CSV was created! 

The disk is now mounted as C:ClusterStorageVolume1.  Additional CSV’s would be Volume2, Volume3, etc.

Now, I can add VM’s into the CSV.  Note that any VM that was on the disk before being converted to CSV will be “corrupted", i.e. their storage location will have changed so Hyper-V no longer knows where they are.  Make sure there are no VM’s created on the disk before you convert it to a CSV.

I’ve also added 2 patches for W2008 R2 that I’ve blogged about recently.  1 is related to Nehelem processors and the other is related to power management, i.e. Core Parking.

That’s it!  Next I need to build node 2 and add it to the cluster.  Then I get to try out Live Migration!

VMM 2008 R2 Cannot Manage A Single Node Hyper-V Cluster

I have an update on this post with a workaround from Microsoft PSS.

How do you migrate from a Windows Server 2008 Hyper-V cluster to Windows Server 2008 R2?  The process is that you build a new cluster and migrate the VM’s over.  If you have a tight budget you will be evicting a cluster node from the W2008 cluster, rebuilding it with W2008 R2 and then setting up a new cluster.  OK, not perfect, but at least you get a clean new cluster. 

You then migrate the VM’s over from the old cluster to the new one.  Because you do not have W2008 R2 on the old cluster you cannot use Storage Quick Migration.  This means shutting down each VM in a maintenance window, exporting it and importing it in the new cluster.  That’s quite manual.  If you have VMM 2008 R2 you could use a cold migration.  Here, you shut down the VM and use VMM to migrate the files.  It does all the export/import and does the file transfer using BITS. 

As you clear out the VM’s from each W2008 node, you evict it from the old cluster, rebuild it with W2008 R2 and add it to the new cluster.

Problem!  What if you can only free up one machine for the new W2008 R2 cluster?  OK, you can build up a one node cluster.  Windows Server has no issue with that.  Neither does Hyper-V.  Obviously you have no server fault tolerance until you add a second node.  But you’ll do that once you free up a host in the old cluster.

Unfortunately though, VMM 2008 R2 does have a problem with one node clusters.  I’ve set one up and this is what happens when I added the cluster to the console.  The node cannot be refreshed and cannot be used by VMM:

“Warning (13926)
Host cluster <cluster FQDN> was not fully refreshed because not all of the nodes could be contacted. Highly available storage and virtual network information reported for this cluster might be inaccurate. 

Recommended Action
Ensure that all the nodes are online and do not have Not Responding status in Virtual Machine Manager. Then refresh the host cluster again.”

I’m not the only person to experience this.  Another virtual machine MVP has posted in Connect (I added a note) discussing the issue.  It does appear to be a logic bug in VMM 2008 R2, preventing us from using VMM 2008 R2 as part of the initial migration.  It looks like we’ll have to use the Hyper-V console until we can free up a second node from the old W2008 cluster and add it into the new cluster.  Of course, you then face a scenario where VMM cannot manage the last remaining node in the W2008 cluster and you’ll have to use the Hyper-V console to manually move the VM’s to the W2008 R2 cluster.

Ouch.  This is why MS should give me €30K worth of hardware and somewhere to host it 🙂  I found a similarly annoying logic bug in VMM 2008 which I got a fix written for (released as part of a rollup back around March/April 2009).

Unless we get a fix then this appears to be the scenario:

  • You don’t have unlimited h/w budget:  You will have a single node W2008 R2 cluster at the start of the migration and a single node W2008 cluster at the end of the migration.  You will need to use the Hyper-V console to manually migrate VM’s while you have single node clusters.
  • You have unlimited budget and can justify having 2 more host servers at the end of the project than you did at the start: Buy 2 new W2008 R2 host servers and set up your new cluster.  You can use VMM 2008 R2 to cold migrate the VM’s from the W2008 cluster to the W2008 R2 cluster.  At the end you will have 2 vacant W2008 cluster hosts that you will have to find a new use for.

This is a pity.  I hope MS fixes it.  It’s a shame to deprive people of the power of VMM and it’s PowerShell module during these critical stages of a Hyper-V W2008-W2008 R2 migration.

EDIT:

I put out a shout to my fellow MVP’s and got a response pretty quick.  One of them says he’s managing a single node cluster with no issues.  He accomplished this by editing the properties of the cluster in VMM and setting the “Cluster Reserve (Nodes)” to 0.  The effect of this is that you tell VMM that you want zero redundant nodes in the cluster.  It is set to one by default, giving you an N+1 cluster with 1 node for fault tolerance.

I did this and had no joy with the W2008 R2 cluster.  I ended up migrating a node into it later today and re-adding the cluster.  It’s working perfectly.  The setting does appear to work for a single node W2008 cluster that we have up.