Most of us have dealt with some piece of infrastructure that is flapping, be it a switch port that’s causing issues, or a driver that’s causing a server to bug-check. These are disruptive issues. Cluster Compute Resiliency is a feature that prevents unwanted failovers when a host is having a transient issue. But what if that transient issue is repetitive? For example, what if a cluster keeps going into network-isolation and the VMs are therefore going offline too often?
If a clustered host goes into isolation too many times within a set time frame then the cluster will place this host into quarantine. The cluster will move virtual machines from a quarantined host, ideally using your pre-defined migration method (defaulting to Live Migration, but allowing you to set Quick Migration for a priority of VM or selected VMs).
The cluster will not place further VMs onto the quarantined host and this gives administrators time to fix whatever the root cause is of the transient issues.
This feature, along with Cluster Compute Resiliency are what I would call “maturity features”. They’re the sorts of features that make life easier … and might lead to fewer calls at 3am when a host misbehaves because the cluster is doing the remediation for you.