Imagine a scenario:
- You have a cluster of Hyper-V hosts
- Some operator pulls the wrong network cables
- A host becomes network-isolated and the cluster heartbeat times out before the mistake is noticed
- Virtual machines fail over
Great, right? HA kicked in? That’s good … right!?!?!
Ummm maybe not. Let me ask you a question. Which is worse:
- A virtual machine being offline for a minute or so because the host is network-isolated? OR …
- Every virtual machine on that host stops executing, fails over to other hosts in the cluster, and takes several minutes to boot and get services responsive on the network.
For most people, option A is more favourable and this is why Microsoft is giving us Cluster Compute Resiliency.
With this new feature, a cluster will become more tolerant (and this is configurable) to transient network errors. In the event of a heartbeat timeout, the host will go into isolation. This will allow VMs on that host to continue executing and prevent additional VMs being placed onto that host. If the host becomes responsive within a certain time frame then it comes out of isolation. If the host does not become responsive then VMs are failed over to other hosts.
Note that if a host is determined to be “flapping” then it will be put into Cluster Quarantine.