Speakers: Elden Christensen & Ned Pyle, Microsoft
A pretty full room to talk fundamentals.
Stretching clusters has been possible since Windows 2000, making use of partners. WS2016 makes it possible to do this without those partners, and it’s more than just HA, but also a DR solution. There is built-in volume replication so you don’t need to use SAN or 3rd-party replication technologies, and you can use different storage systems between sites.
Assuming: You know about clusters already – not enough time to cover this.
Goal: To use clusters for DR, not just HA.
RTO & RPO
- RTO: Accepted amount of time that services are offline
- RPO: Accepted amount of data loss, measured in time.
- Automated failover: manual invocation, but automated process
- Automatic failover: a heartbeat failure automatically triggers a failure
- Stretch clusters can achieve low RPO and RTO
- Can offer disaster avoidance (new term) ahead of a predicted disaster. Use clustering and Hyper-V features to move workloads.
- Stretch cluster. What used to be aclled a multi-site cluster, metro cluster or geo cluster.
Stretch Cluster Network Considerations
Clusters are very aggressive out of the box: once per second heartbeat and 5 missed heartbeats = failover. PowerShell = (Get-Cluster).SameSubnetThreshold = 10 and (Get-Cluster).CrossSubnetThreshold = 20
Different data centers = different subnets. They are using Network Name Resources for things like file shares which are registered in DNS depending on which site the resource is active in. The NNR has IP address A and IP Address B. Note that DNS registrations need to be replicated and the TTL has to expire. If you failover something like a file share then there will be some time of RTO depending on DNS stuff.
If you are stretching Hyper-V clusters then you can use HNV to abstract the IPs of the VMs after failover.
Another strategy is that you prefer local failover. HA scenario is to failover locally. DR scenario is to failover remotely.
You can stretch VLANs across sites – you network admins will stop sending you XMas cards.
There are network abstraction devices from the likes of Cisco, which offer the same kind of IP abstraction that HNV offers.
(Get-Cluster).SecurityLevel =2 will encrypt cluster traffic on untrusted networks.
When nodes cannot talk to each other then they need a way to reconcile who stays up and who “shuts down” (cluster activities). Votes are assigned to each node and a witness. When a site fails then a large block of votes disappears simultaneously. Plan for this to ensure that quorum is still possible.
In a stretch cluster you ideally want a witness in site C via independent network connection from Site A – Site B comms. The witness is available even if one site goes offline or site A-B link goes down. This witness is a file share witness. Objections: “we don’t have a 3rd site”.
In WS2016, you can use a cloud witness in Azure. It’s a blob over HTTP in Azure.
Demo: Created a storage account in Azure. Got the key. A container contains a sequence number, just like a file share witness. Configures a cluster quorum as usual. Chooses Select a Witness, and slect Configure a Cloud Witness. Enters the storage account name and pastes in the key. Now the cluster starts using Azure as the 3rd site witness. Very affordable solution using a teeny bit of Azure storage. The cluster manages the permissions of the blob file. The blob stores only a sequence number – there is no sensitive private information. For an SME: a single Azure credit ($100) might last a VERY long time. In testing, they haven’t been able to get a charge of even $0.01 per cluster!!!!
Clustering in WS2012 R2 can survive a 50% loss of votes at onces. One site is automatically elected to win. It’s random by default but you can configure it. You can configure manual failover between sites. You do this by manually toggling the votes in the DR site – remove the votes from DR site nodes. You can set preferred owners for resources too.
Elden hands over to Ned. Ned will cover Storage Replica. I have to leave at this point … but Ned is covering this topic in full length later on today.