Day 2: Windows Server 2008 Failover Cluster Troubleshooting & Tips

The speaker is David Dion from MS.

Windows Server 2008 is the last x86 release. All nodes do not need to be exactly identical in W2008 Clustering.

Cluster Validation

Lots of problems in deployments of previous editions of Windows clustering (MSCS) were caused by configuration issues. Cluster Validation tool resolves this. Built into W2008. Tests servers, OS and storage to check if the configuration is valid. Should be run before cluster build or after adding node, adding drivers, patches, update firmware or BIOS (server or device), etc. You can also run the validate tool as a troubleshooting tool – primary course of action.

Very easy to use; it’s just a wizard. Best to run all of the tests. However, doing all of the storage tests can take hours with hundreds of disks, e.g. a 16 node Hyper-V cluster. A report is generated as an MHT file in IE. You get pass, pass with warning or fail. This is stored in the WindowsClusterReports folder.

Do not assume the hardware configuration will be fine; run the validation utility to test it.

Concerns:

Validation of storage requires that the storage be offline. Beware for Hyper-V. Schedule a full cluster maintenance window.
Running validate with a single node is pointless.

W2003 clustering required the H/W was on a clustering HCL. Niche H/W, therefore expensive. Everyone hated it. Not used in W208. The validation tool is your cluster certification. Purchase gear with W2008 logo. Run the tool and if you get a pass then you’re certified. Keep a copy of the report for PSS.

MS recommends you purchase "Failover Cluster Configuration Program" solutions from vendors, i.e. the pricey niche solutions, e.g. a cluster kit. Interestingly, HP is not one of the 9 partners in the program. Dell and IBM are.

Event Viewer

Check MicrosoftWindowsFailover Clustering log. Event logs are no longer replicated across all nodes in the cluster. You should use the MMC to view events from all nodes. You can also build event queries there. You can filter events for applications and resources. Because of this pooling of events, beware using the MMC remotely from the cluster and killing the WAN. Normally we only see critical and warning events. By enabling the operational "log" you can see information events.

Start with events if looking at non-configuration issues on the cluster.

Cluster Debug Logging

Lots of information and not user friendly. The legacy cluster log file no longer exists. Logging to to an event trace session: "Microsoft-Windows-FailoverClustering". Log enabled by default. You can produce a human readable log using "Cluster.exe log" command.

Tracrpt.exe can be used to dump the trace session. .EVTX and view the file in event viewer. .XML for you scripting freaks or to open in IE. Cluster.exe can raise or reduce the level of logging 3 is default. 1 is low, 5 is high. Running this command on one node configures all the nodes. Changing the size of the file causes historical logs to be lost. Copy them safely before doing this. It’s quite verbose at level 5. Running at level 3 (default) is recommended.

This is the last logging solution you should pick. Retaining 72 hours of data as a minimum is recommended. What size of log is 72 hours? How long is a piece of string. File shares are quiet. Exchange is noisy. Hyper-V probably could be as well if VM’s are moving about. Change the log size first, then set the required verbosity. Cluster logs are always GMT time zone. You’ll have to mentally map this when comparing with Windows Event Viewer if in different time zone to GMT.

Windows Server 2008 R2

Validation Tool includes best practices tests. Quorum configuration, status of cluster resources, network name settings in multi-site cluster.
Performance Counters are added into perfmon for clustering.
There will be Powershell support.
There is a read only mode for the console.

Best Practices For Now

Try to use identical hardware on all nodes. Especially storage: HBA, firmware, driver, cables, etc.
Run the validation tool.
Don’t add resources to the Cluster Group or the Available Storage Group.
Keep regular system state backups. This includes the cluster database automatically.
Use "preferred owners" and "possible owners" to balance the cluster.
Multi-site clusters are more complex so check out the MS site for a whitepaper.

Quorum:

Node and disk majority where there is shared storage. Small disk – 512MB at least. Only use it for the quorum. Use it as a GUI drive to discourage alternate usage. No need to backup on the quorum.
Node and File Share Majority: use one file serve for many clusters but dedicate 1 share to each cluster. OK to use a clustered file server but keep it in a different cluster (chicken and egg). File server should be in the same forest as the cluster. Avoid DFS namespaces.
More information available.

Old 2003 best practices that are gone:

You can add nodes as you want – nodes do not need to be powered off.
No NIC teaming restrictions any more.
No need to stagger boot times, e.g. w2003 required 30-60 second gaps.
Clustering runs as local system now. No password to change for the service.
Keep an eye on the hotfixes page for clustering.

Leave a Reply Cancel reply