We had a hardware issue recently where we need to service the hardware of one of our Hyper-V cluster hosts. The HP engineer originally believed there was an issue with the server’s motherboard so this needed to be replaced. So it was. This didn’t fix the problem – it turned out to be something else but that’s a tangent.
When the server was fixed we brought it back online. Thanks to us using diskless blades and HP Virtual Connect, the OS and the data are stored on our fibre channel SAN. Nothing was damaged. The server started up and I tested VM failover using VMM. I was getting a failure of:
"Error (2915)
The WS-Management service cannot process the request. Object not found on the myserver.domain.internal server.
(Unknown error (0x80338000))
Recommended Action
Ensure that the agent is installed and running. If the error persists, reboot myserver.domain.internal and then try the operation again."
To debug I went into the Failover Cluster MMC. I expanded the service (VM) I was testing and expanded the properties. I failed it over. I could see the state saving and the disk going offline. However, the configuration would fail. The cluster would then move the VM to a working host. I tried a few more VM’s and then I had some luck. A VM stayed on the "bad" host after failing.
I went into the Hyper-V MMC and tried to start the VM there. I was told that:
"The virtual machine could not start because the hypervisor is not running".
Ah! The penny had dropped? There are two BIOS requirements for Hyper-V to work. CPU virtualisation assistance and DEP must be enabled. In a HP Proliant, they are in something like Advanced – Processor Options. My motherboard had been replaced, flashed with an upgrade and the old settings would be in the replaced motherboard. I enable the settings (requiring a reboot) and fired up the box.
All was now well. I could move VM’s with no issue.
BTW, if you will have a node offline in a cluster for some time then you need to be aware of your cluster quorum settings. Clusters with even numbers of nodes need a quorum and those with uneven numbers must not have a quorum. I keep a quorum disk in place just in case a node is going to be added/removed from the cluster. In our scenario, we had N+2 fault tolerance so I could afford to safely remove this node and alter the quorum settings without losing fault tolerance.