Hyper-V Clusters – There Are Only 26 Letters In the Alphabet

If you’ve looked at putting Hyper-V in a cluster you might have read Jose Barreto’s blog post on clustering options, viewed Dave Northey’s videos demonstrating it in action or considered trying to recreate what ESX with Virtual Center does.  You’ll soon see that to have failover or mobility on a per-VM basis with Hyper-V on Windows Server 2008, each VM must reside in it’s on disk/LUN on your shared storage.  Windows Server 2008 doesn’t have the ability (yet) to do shared file systems like that in ESX’s VMFS.

You’ll now think … I can have 16 nodes in a cluster and potentially dozens of VM’s in my N+1 or N+2 architecture.  Wait … how many drive letters am I going to need?  I’ve already consumed A, B, C and D … does this mean a cluster can have only 22 VM’s?  This is probably something where some certain-product-fanatic gets to write some blog FUD without digging just a little deeper.  It’s amazing to see how prejudice is tainting the commentary and reviews that are out there right now 🙂

You have the option to use "letterless" drives in Windows Server 2008.  Instead of using a drive letter to identify the physical drive that each VM can reside on, you can use a GUID to identify the drives. 

The only question now is, how do you use these drives?  VirtuallyAware has done a post on the subject.  The hardest part of the process is getting the GUID of the LUN that you’re working with.  Who really wants to type out something nasty like "fc247e42-0a5e-11dd-94db-001b785788b0"?  PowerShell helps at there as the blog post indicates. 

You’ll now have a virtually unlimited set of drive identifiers that will allow your cluster to scale out to the limitations of your CPU, storage and RAM.

On a tangent, this is just another example of where PowerShell is a necessary skill, not only in PowerShell but in all new MS technologies.  I’ve started learning it.  It’s different, that’s for sure, but it’s not optional any longer.

Auditing Your Data Centre

I have a strong dislike for auditing.  It’s a time consuming process.  But you know, if you use the right systems management tools it doesn’t need to be.  Microsoft’s Optimised Infrastructure model and Dynamic Systems Initiative preach automation and expertise built into the network.  The latest generation of System Centre allows for this.  Microsoft released a short white paper that looks at data centre auditing.  It’s not something I’d really considered until the last few months.

Network and some *NIX administrators have long used SYSLOG tools.  The idea is that all events are forwarded to a central store.  It gives a synchronised view of what is happening across a multitude of devices.  It allows for diagnostics.  But from an auditors point of view, it gives an audit trail of who did what and when.  You can get this sort of functionality going with Windows as well.  I’m not a network or *NIX admin but I’m guessing their security logs are not that different to one on a Windows box, i.e. lots of noise and they require significant time to filter through to figure out what was really going on.

System Centre Operations Manager (SCOM or OpsMgr) 2007 includes Audit Collection Services.  I first heard of ACS at TechEd Europe in Amsterdam in 2004.  It was going to be a standalone tool but after a lengthy delay it finally saw the light as a part of OpsMgr.  You can turn on ACS on your OpsMgr agents to enable centralised security logging for Windows platforms.  What makes it different to SYSLOG is that Microsoft’s developers have identified the important events that illustrate what is going on and they only forward those events to the ACS database.  The ACS database is separate to the rest of the OpsMgr databases so you can permission it differently, i.e. only your auditors or security staff would have access to it if required.

I don’t know if the new Cross Platform Extensions for OpsMgr will allow for ACS on Linux platforms.  I suspect that they won’t.  Anyway, you’re going to still need SYSLOG for your network devices.  From what I’m seeing recently, network monitoring tools (which are often freeware) seem to run and be supported best when running on Linux.  Yes, you read that on my blog … something running best on Linux.  I am open to non-MS products!

That’s great for monitoring your security activities, but that’s only half of the story.  You need to build a secure and regulatory complaint infrastructure and maintain that integrity.  I knew a security consultant in Germany who spent a huge amount of time building an automated auditing tool set that dumped data into a central store and allowed for reporting.  It covered all sorts of platforms.  It was a really great idea.  But this guy was an alpha geek.  Owning and running that toolset required his level of abilities, I’m guessing.

System Center Configuration Management (SCCM or ConfigMgr) 2007 features Desired Configuration Management (DCM).  DCM allows you to use either a set of pre-built or custom made templates to audit your Microsoft network on a recurring and automated basis.  That means there’s no more logging into each box to check out the configuration of the box.  Everything is automated.  You’re also building that expertise into the network by using templates.  Heck, Microsoft even gives away a set of DCM packs for the products to cover regulators like SOX, FISMA, EUDPD, HIPAA and more!  Now you can just tell your auditors to run a report to see the configuration health of your network.  No more wasted admin or auditor time or complexity, e.g. delegated admin rights on servers and applications.  The DCM tool is easy enough to get your head around in order to build your own templates for auditing 3rd party or internal applications. 

If you’re in a regulated market, e.g. finance, health, pharmaceuticals, etc, then you’re probably required to have these sorts of controls.  If you’re using System Centre then it makes sense to look into and enable these functions to make your job easier.  Sure, you may require another server and some storage but when you compare time savings VS capital costs, there’s really only one logical way forward: build that expertise into the network and leverage the available automation.

IBM Support Sucks Too

We have a support contract at work for our IBM servers and storage.  The contract defines it as 24*7 with 4 hours response time.  I logged a call 24 hours ago for a failed disk.  24 hours later I get a phone call from "Droopy" who can’t get me an engineer.  What?  Breach of contract (by 20 hours) is what IBM offers as an enterprise service.  I asked to speak to his manager.  "He’s busy".  OK, I’ll speak to his manager’s manager.  "He’s busy too".  Friggin muppets.  Imagine how much worse it’ll be when IBM hands over their server and storage brands to Lenovo?

Anyone looking at IBM hardware – forget it.  Do yourself a favour and talk to Dell or HP. 

Beware Anti-Virus and Hyper-V

I released the July updates onto our network this past weekend.  I’d also deployed our new AV the previous week.  Let’s just say that AV mixed with Hyper-V and followed by a reboot made for a nice mess.

I logged into the Hyper-V lab this morning to find half of my VM’s were missing.  They’re sitting find (but idle) on the storage.  It’s just Hyper-V has "forgotten" that they ever existed.

I trawled through the Windows Event logs (Application and Service logs – Microsoft – Windows – Hyper-V-Config – Admin) and found a series of these:

Source: Hyper-V-Config

Event ID: 4096

Level: Error

The Virtual Machines configuration <big long GUID> at <path to VM> is no longer accessible: The requested operation cannot be performed on a file with a user-mapped section open. (0x800704C8)

Ok.  A bit of googling found an entry on the TechNet forums that says you need to disable scanning for the VHD’s and the XML files of your VM’s.  Ouch!

OK, so I did that and rebooted by lab server.  Still no dice.  Actually, Hyper-V doesn’t even bother attempting to load these VM’s now.  OK, I’ll do what I would in any other virtualisation product; I’ll open them.  Ick … no open command.  Import?  Nope; because MS in their wisdom (!) decided that the import/export format should be different to that of a normal VM. 

So I’ve got a plethora of VM’s that are sitting on my disk in a saved state that I cannot load up.  My only way forward is to re-add the virtual hard disks as new VM’s.  This is a pain:

  • I lose my saved states.
  • I have to reconfigure every single VM that is missing.
  • Each VM has to do the PNP dance with a "new" NIC and I have to reconfigure IPv4 addressing.
  • It’s just lots of work I shouldn’t have to do.

I’ve logged a bug report with MS.  I’m open to any constructive suggestions.

Why I Dislike IBM Director

I inherited a number of IBM servers with this job.  They perform a critical business service for our customers.  Luckily, the architecture we use is very fault tolerant.

Over the weekend we deployed updates in a staged manner to our production network – after testing of course.  On Sunday morning, I woke up to an email from System Center Operations Manager 2007 (gotta love it!) saying that one of the servers we patched on Saturday night was not responding to agent heartbeat requests.  Uh oh!  This was one of those IBM boxes.  We have triplicate redundancy so I knew I could let it wait until Monday morning.  To be safe, I suspended updates for the remaining production boxes.  I didn’t suspect an update but I wasn’t taking any chances.

I came into the data centre this morning and found the server sitting on a BIOS prompt.  Hmm.  That’s not good.  It had detected a problem with the external disk storage and was waiting for administrator approval to boot up.  What?  Hello?  Note: the failure was nothing to do with the server-internal boot disks.

I checked the Direct Attached Storage (DAS) and it was all green.  I booted up the server and saw the DAS was not being connected.  I shut down the server and powered down the DAS.  I powered up the DAS and was greeted with beeping … non-stop beeping.  The front panel now showed a chassis alert on the DAS and one of the disks in the RAID5 array was alerting as well.  Huh!?!  Why didn’t it tell me this when the server already knew there was a problem?

I powered up the server.  Now it didn’t prompt me.  But it did tell me the external disk was degraded.  Fine, the hardware knows there’s a problem.

I logged in and found there were no hardware logs or any sort of interface into the IBM director agent.  Nothing.  Sweet F.A.  The consultants (before my time) who installed the hardware had set up an IBM director console on another box for centralised monitoring.  I logged into it and sure enough, there were no alerts.  Hold an a *beep*ing minute; the hardware knows there’s a problem but the monitoring agent from the hardware vendor doesn’t have a clue?

OK, maybe it was the central console at fault?  I’ve never trusted it.  I went on to the SCOM console but found no alerts or health degradation on the IBM Director monitors.  That made it certain in my mind, the IBM Director agent was clueless.

So here’s my summary why I would recommend people to steer clear of IBM hardware in an enterprise deployment based on this little story:

  1. The DAS failed to show an alert on the front panel or disk despite the server not being able to boot up because it detected a failure.
  2. The IBM Director agent failed to report an incident of any kind.
  3. There’s no user interface to the IBM director agent on the server.
  4. A failure of a single disk in a RAID5 array in a DAS caused a server not to boot up.  That’s just stupid.
  5. We’ve all heard that Lenovo are taking over the server and storage business.  My experience of them with their support was awful – A call open for around 4 months and 2 months of that with the regional director taking a personal interest.

I’m now left wondering how long I’ve had a failed disk on this server considering it didn’t give any monitoring alert or visible notification until I reset the DAS chassis.

How would HP handle this?

  1. The SIM agent would have alerted on this and shown it in the HP SIM log and in the SIM web page on the server.
  2. The HP SCOM management pack for SIM would have alerted and sent all of the required/responsible administrators/operators/"business owners" a notification of the failure.
  3. The disk would have shown an alert light immediately.
  4. It’s unlikely that the server would have been prevented from booting up unless there was a complete failure of the boot disk.
  5. I would have had the storage back to a healthy state within 4 hours of opening a call with HP.

That’s a very different experience and one you expect to have from enterprise class servers and storage.

EDIT

As you can guess, I was concerned with the lack of h/w monitoring that the IBM Director agent gave me.  The horrid response from the MD was that we’d have to check that the logical disks in question were present on a daily/manual presence.  Yuk!  I’d a better idea: let SCOM do the work for me.  I’ve created a distributed application that entails on the dependancies I can think of for this service, including the presence and health of the logical disk in question.

It was funny to see that the HP management pack allowed me to include discovered HP hardware objects but there were no classes for IBM hardware.  Come on IBM; you gotta play better with others!  Not everyone wants to buy consultancy-ware like Tivoli.

This Sucks: CoreConfigurator Is Discontinued

The author of CoreConfigurator has had to pull the plug on CoreConfigurator.  Like many of us, he had that awful clause in his employment contract that gives the employer ownership rights over all intellectual property he created while employed by that firm, even if he did it on his own time and at home. 

That one brought up some interesting discussions when I took my current job because some of the things I’ve been writing are already the property of a publisher.  My employer was able to confirm that legally the contract could not extend to my independent work at home.

The author, Guy Teverovsky, has had to hand over ownership to his now former employers.  What they’ll do with the code is uncertain.  This sucks because it was a great little tool for those new to the ways of command prompt.

But, not that I condone piracy in any way, you will find this tool out and about in the wilds of the Internet if you Google hard enough.

Hyper-V RAM Calculator

This download has been superseded by a newer Hyper-V Calculator spreadsheet.

I’ve previously discussed how RAM is used by Hyper-V in terms of:

  • The parent partition
  • Hyper-V services
  • Drivers
  • Guest RAM allocation overhead.

I’ve put together an Excel spreadsheet that calculates how much RAM is consumed by a VM as you load it onto a host.  Using it is easy:

  1. Specify how much RAM is in the physical host machine.
  2. Add each guest VM and enter how much RAM (in GB) you want to allocate to the guest.
  3. The RAM utilised by the guest is calculated and the amount remaining on the host is presented.

The numbers you need to enter are highlighted in yellow.

The formula used assumes maximum RAM overhead, i.e. the worst case scenario of 32MB for the first GB and 8MB for each GB after that on a per VM basis.  I’m also allowing 300MB in addition to the 2GB recommended as the reserve for the parent partition.  Often, this can be considered a part of the 2GB.  You can recalculate things by adding in another line item to specify driver requirements for the parent OS if you want.

EDIT:

I’ve done some testing on hosts with 32GB RAM and the theory seems to match the practice.

Hyper-V Controllers: IDE or SCSI?

There’s been plenty of blog posts out there saying that there is no support for SCSI in Hyper-V.  That’s not true.  What is true is this.  You can use SCSI controllers for disks but not for your boot disk.  Your boot disk must be on an IDE controller.

When using emulated storage controllers, i.e. no integration components then IDE is slower than SCSI.  However, there is no discernable difference between SCSI and IDE when using sythentic drivers, i.e. integration components or VM additions.

Setting Up VM’s

How do you set up your VM’s?  You have no choice about your boot disk.  You must use a disk connected to the IDE controller.  You can’t move that to the SCSI controller because you cannot boot from a Hyper-V SCSI controller.  Lightweight VM’s can probably put everything on one virtual disk and run on the IDE controller.

However, best practice is to separate your data/workload from your operating system.  Consider a virtual application server where the operating system is on C: and the workload is on D:.  C: will be a virtual disk on the IDE controller.  D: should be a virtual disk on a SCSI controller if you don’t have integration components.  This makes the most of the underlying Hyper-V architecture and optimises CPU utilisation on the host server.  However, if you have integration components then it makes no difference whether you use SCSI or IDE for the workload disk.

What really makes a difference is the underlying physical storage and the types of VHD that you use.  Passthrough disks are physical speed.  Fixed Sized VHD currently get to within 6% of the speed of the underlying physical LUN, assuming you have 1 VHD per LUN.  Dynamic and Differencing VHD’s have great impacts on performance.