Recommended Updates For WS2012R2 Hyper-V

A wiki page has been posted to list the hotfixes for Windows Server 2012 R2 Hyper-V.  At this point, it only contains the GA update, but that is sure to change.

I have had a look for the equivalent page for Failover Clustering but nothing has come up in my results yet.  I’ll update this post if I find something.

KB2779069 – Hotfix To Determine Which Cluster Node Is Blocking GUM Updates In WS2008R2 & WS2012

First some background …

A cluster is made up of (normally) 2 or more servers.  They use a distributed database to keep a synchronised copy of the configuration of the HA resources, e.g. HA VMs on a Hyper-V cluster.  Something called the Global Update Manager (GUM) is used to coordinate consistent updates to resource configurations across the cluster nodes. 

When a node contains an update that has to be shared with other nodes an initiator node first obtains a GUM lock. Then, the node shares the update by using a Multicast Request Reply (MRR) message to the other nodes. After this update is sent, the initiator node waits for a response from other nodes before you continue. However, in certain conditions, one of the nodes does not reply to the GUM request in time because the node is stuck for some reason. Currently, there is no mechanism to determine which node is stuck and does not reply to the GUM request.

That’s just changed thanks to a hotfix that is now available that adds two new cluster control codes to help you determine which cluster node is blocking a GUM update in Windows Server 2008 R2 and Windows Server 2012.

After you install this hotfix, two new cluster control codes are added to help the administrator resolve the problem. One of the cluster control codes returns the GUM lock owner, and the other control code returns the nodes that are stuck. Therefore, administrator can restart the stuck nodes to resolve the problem. For more information about the new control codes, go to the following Microsoft:

Notes

  • The Cluster service has a facility that is called GUM. GUM is used to distribute a global state throughout the cluster.
  • Only one cluster node can send GUM messages at any time. This node is called the GUM lock owner.
  • The GUM lock owner sends an MRR message to a subset of cluster nodes, and then waits for the nodes send message receipt confirmations.
  • Run some iterations of these control codes to confirm that the node is stuck.
  • After the CLUSCTL_CLUSTER_GET_GUM_LOCK_OWNER control code is called, you have to close the cluster handle. Then, you reopen the cluster handle by using the GUM lock owner node name that is returned by the control code. If you do not perform this action, the CLUSCTL_NODE_GET_STUCK_NODES control code may return an incorrect result.

You can get this hotfix from here.

KB2905412 – Stop Error 0xD1 On Windows-Based Computer With Multiple Processors

Not strictly a Hyper-V issue, but you’ll understand why I am blogging about this one; A hotfix has been released for when a stop error 0xD1 on a Windows-based computer with multiple processors.

Symptoms

Your multiprocessor Windows-based computer crashes every two to three days. Additionally, Stop error 0xD1 is generated when the computer crashes.

Cause

This problem occurs because of a race condition that exists in the TCP/IP driver in a multiprocessor environment. If duplicate TCP segments are received on different processors, they may be sequenced incorrectly, and this triggers the crash.

A hotfix is available

KB2908415 – CSVs Go Offline Or Cluster Service Stops During VM Backup On WS2012 Hyper-V

Another hotfix from Microsoft, this one for when clustered shared volumes go offline or the Cluster service stops during VM backup on a Windows Server 2012 Hyper-V host server.

Symptoms

Consider the following scenario:

  • You have a Windows Server 2012 Hyper-V host server.
  • You have the server in a cluster environment, and you use cluster shared volumes.
  • You try to back up a virtual machine (VM).

In this scenario, you may find that the cluster shared volumes go offline, and resource failover occurs on the other cluster nodes. Then, other VMs also go offline, or the Cluster service stops.

Cause

This problem occurs when there are many snapshots in the VM. This causes the Plug and Play (PnP) functionality on the host to be overwhelmed, and other critical cluster activity cannot finish.

A supported hotfix is available from Microsoft Support.

KB2902014 – Guest System Time Incorrect After VM Crashes On Win8 or WS2012 Hyper-V Host

This is a busy month for hotfixes!  Microsoft has released a fix for when system time of a virtual machine becomes incorrect after it crashes or resets in a 64 bit Windows 8-based or Windows Server 2012-based Hyper-V host.

Symptoms

Consider the following scenario:

  • You create a virtual machine (VM) on a Hyper-V host that runs 64 bit Windows 8 or Windows Server 2012.
  • You disable the Hyper-V time synchronization integration service on the VM.
  • The VM crashes or resets.

In this situation, the system time of the VM is incorrect when it starts again.

Cause

This issue occurs because the time information of the VM is not saved on VHD as expected when the VM crashes or resets. When the VM starts again, it uses the old and out-of-date time information that was saved.

A hotfix is available to prevent this issue.

KB2894485 – Cross-Page Memory Operation Crashes VM on Win8 or WS2012 Hyper-V Host

Microsoft has released a hotfix for when a cross-page memory read or write operation crashes virtual machine that runs on 64-bit Windows 8-based or Windows Server 2012-based Hyper-V host.

Symptoms

Assume that you install a Windows system virtual machine on a Hyper-V host that runs 64-bit Windows 8 or Windows Server 2012. You have an application that runs on the virtual machine. This application performs memory read or write operation that touches MMIO (Memory Mapped Input Output) region. The operation crosses page boundary. In this situation, the virtual machine crashes.

A hotfix is available to prevent this problem.

Migrating Two Non-Clustered Hyper-V Hosts To A Failover Cluster (With DataOn & Storage Spaces)

At work we have a small number of VMs to operate the business.  For our headcount, we actually would have lots of VMs, but distribution requires lots of systems for lots of vendors.  I generally have very little to do with our internal IT, but I’ll get involved with some engineering stuff from time to time.

2 non-clustered hosts (HP DL380 G6) were setup before I joined the company.  I upgraded/migrated those hosts to WS2012 earlier this year (networking = 4 * 1 GbE NIC team with virtualized converged networking for management OS and Live Migration). 

We decided to migrate the non-clustered hosts to create a Hyper-V cluster.  This was made feasibly affordable thanks to Storage Spaces, running on a shared JBOD.  We distribute DataOn, so we went with a single DNS-1640, to attached to both servers using the LSI 9207-8e dual port SAS card.

Yes, we’re doing the small biz option where two Hyper-V hosts are directly connected to a JBOD where Storage Spaces is running.  If we had more than 2 hosts, we would have used the SMB 3.0 architecture of Scale-Out File Server (SOFS).  Here is the process we have followed so far (all going perfectly up to now):

Step 1 – Upgrade RAM

Each host had enough RAM for it’s solo workload.  In a cluster, a single node must be capable of handling all VMs after a failover.  In our case, we doubled the RAM in each of the two servers.

Step 2 – Drain VMs from Host1

Using Shared-Nothing Live Migration, we moved VMs from Host1 to Host2.  This allows us to operate on a host for an extended period without affecting production VMs.

Note that this only worked because we had already upgraded the RAM (step 1) and we had sufficient free disk space in Host2.

Step 3 – Connect Host1

We added an LSI card into Host1.  We racked the JBOD.  And then we connected Host1 to the JBOD, one SAS cable going to port1/module1 in the JBOD, and the other SAS cable going to port1/module2 in the JBOD (for HA).

Host1 was booted up.  I downloaded the drivers, firmware, and BIOS from LSI for the adapter (never, ever use the drivers for anything that come on the Windows media if there is an OEM driver) and installed them.

Step 4 – Create Cluster

I installed two Windows features on Host1:

  • Failover Clustering
  • MPIO

I added SAS in MPIO, requiring a reboot.

Additional vNIC was added to the Management OS called Cluster2.  I then renamed the Live Migration network to Cluster 1.  QoS was configured so that the VMSwitch has 25% in the default bucket, and each of the 3 vNICs in the ManagementOS has 25% each.

SMB Multichannel constraints was configured for Cluster1 and Cluster2 for all servers.  That’s to control which NICs are used by SMB Multichannel (used by Redirected IO).

I then created a single node cluster and configured it.  Then it was time for more patching from Windows Update.

Step 5 – Hotfixes

I downloaded the recommended updates for WS2012 Hyper-V and Failover Clustering (not found on Windows Update) using a handy PowerShell script.  Then I installed them on & rebooted Host1.

Step 6 – Storage Spaces

In Failover Cluster manager I configured a new storage pool.  We’re still on WS2012 so a single hot spare disk was assigned.  Note that I strongly recommend WS2012 R2 and not assigning a hot spare; parallelized restore is a much faster and better option.

3 virtual disks (LUNs) were created:

  • Witness for the cluster
  • CSV1
  • CSV2

Rule of thumb: create 1 CSV per node in the cluster that is connected by SAS to the Storage Pool.

Step 7 – Configure Cluster Disks

The cluster is still single-node, so configuring a witness disk for quorum will cause alerts.  You can do it, but be aware of the alerts.

Each of the CSV virtual disks were converted to CSV and renamed to CSV1 and CSV2, including the mount points.

Step 8 – Test

Using Shared-Nothing Live Migration, a VM was moved to the cluster and placed on a CSV. 

This is where we are now, and we’re observing the performance/health of the new infrastructure.

Step 9 – Shared-Nothing Live Migration From Host2

All of the VMs will be moved from the D: of Host2 to the cluster and spread evenly across the two CSVs in the cluster, running on Host1.  This will leave Host1 drained.

Remember to reconfigure backups to backup VMs from the cluster!

Step 10 – Finish The Job

We will:

  1. Reconfigure the networking of Host2 as above (I’ve saved the PowerShell)
  2. Insert the LSI card in Host2 and connect it to the JBOD
  3. Install all the LSI drivers & updates on Host2 as we did on Host1
  4. Add the Failover Cluster and MPIO roles to Host2
  5. Add Host2 as a node in the cluster
  6. Patch up Host2
  7. Test Live Migration
  8. Plan out VM failover prioritization
  9. Configure Cluster Aware Updating self-updating for lunch time on the second Monday of every month – that’s a full month after Patch Tuesday, giving MSFT plenty of time to fix any broken updates (I’m thinking of Cumulative Updates/Update Rollups).

And that should be that!

KB2898774 – Data Loss Occurs On SCSI Disk That Turns Off In WS2012-Based Failover Cluster

Microsoft has released a KB article to avoid data loss occurring when a SCSI disk turns off in a Windows Server 2012-based failover cluster.

Symptoms

Consider the following scenario:

  • You deploy a Windows Server 2012-based failover cluster. The cluster contains two nodes (node A and node B).
  • A SCSI disk is used for the failover cluster. The disk is a shared disk and is accessible by both node A and node B.
  • Node A restarts or crashes. Then, the cluster fails over to node B.
  • Node A comes back online.
  • Node B is shut down and the cluster fails over to node A.
  • You write some data to the disk.
  • The disk turns off unexpectedly. For example, the device losses power.

In this scenario, the data that you write to the disk is lost.
Notes

  • This issue also occurs when the cluster contains more than two nodes.
  • This issue does not occur if the SCSI disk supports the SCSI Primary Commands – 4 (SPC-4) standard.

To resolve this issue, install update rollup 2903938.  It’s an update rollup, so update rollup rules apply – either test like nuts in a lab or wait a month before you approve/deploy it.

Another Windows Phone Hatchet Job? – Lumia 1020

I’m using this phone for 3 weeks.  Camera: excellent.  Social experience: excellent.  Apps: need some work but improving.  Only issue I’m having is when I’m listening to something in the car on the phone, I get a text, and the audio stops until I acknowledge the text – that’s a little unsafe.

Best thing I can say: this is the longest I’ve used Windows Phone as my personal handset without once getting annoyed at it :D  I’ve no plans to switch off of it for now.

I think that counts as high praise!?!?!?

Technorati Tags: ,

Flow Of Storage Traffic In Hyper-V Over SMB 3.0 to WS2012 R2 SOFS

I thought I’d write a post on how traffic connects and flows in a Windows Server 2012 R2 implementation of Hyper-V with the storage being Hyper-V over SMB 3.0 on a WS2012 R2 Scale-Out File Server (SOFS).  There are a number of pieces involved.  Understanding what is going on will help you in your design, implementation, and potential troubleshooting.

The Architecture

I’ve illustrated a high-level implementation below.  Mirrored Storage Spaces are being used as the back-end storage.  Two LUNs are created on this storage.  A cluster is built from FS1 and FS2, and connected to the shared storage and the 2 LUNs.  Each LUN is added to the cluster and converted to Cluster Shared Volume (CSV).  Thanks to a new feature in WS2012 R2, CSV ownership (the CSV coordinator automatically created and managed role) is automatically load balanced across FS1 and FS2.  Let’s assume, for simplicity, that CSV1 is owned by FS1 and CSV2 is owned by FS2.

The File Server for Application Data role (SOFS) is added to the cluster and named as SOFS1.  A share is added to CSV1 called CSV1-Share, and a share called CSV2-Share is added to CSV2.

image

Any number of Hyper-V hosts/clusters can be permitted to use both or either share.  For simplicity, I have illustrated just Host1.

Name Resolution

Host1 wants to start up a VM called VM1.  The metadata of VM1 says that it is stored on \SOFS1CSV1-Share.  Host1 will do a DNS lookup for SOFS1 when it performs an initial connection.  This query will return back all of the IP addresses of the nodes FS1 and FS2.

Tip: Make sure that the storage/cluster networks of the SOFS nodes are enabled for client connectivity in Failover Cluster Manager.  You’ll know that this is done because the NICs’ IP addresses will be registered in DNS with additional A records for the SOFS CAP/name.

Typically in this scenario, Host1 will have been given 4-6 addresses for the SOFS role.  It will perform a kind of client based round robin, randomly picking one of the IP addresses for the initial connection.  If that fails, another one will be picked.  This process continues until a connection is made or the process times out.

Now SMB 3.0 kicks in.  The SMB client (host) and the SMB server (SOFS node) will negotiate capabilities such as SMB Multichannel and SMB Direct.

Tip: Configure SMB Multichannel Constraints to control which networks will be used for storage connectivity.

Initial Connection

There are two scenarios now.  Host1 wants to use CSV1-Share so the best possible path is to connect to FS1, the owner of the CSV that the share is stored on.  However, the random name resolution process could connect Host1 to FS2.

Let’s assume that Host1 connects to FS2.  They negotiate SMB Direct and SMB Mulitchannel and Host1 connects to the storage of the VM and starts to work.  The data flow will be as illustrated below.

Mirrored Storage Spaces offer the best performance.  Parity Storage Spaces should not be used for Hyper-V.  Repeat: Parity Storage Spaces SHOULD NOT BE USED for Hyper-V.  However, Mirrored Storage Spaces in a cluster, such as a SOFS, are in permanent redirected IO mode.

What does this mean?  Host1 has connected to SOFS1 to access CSV1-Share via FS2.  CSV1-Share is on CSV1.  CSV1 is owned by FS1.  This means that Host1 will connect to FS2, and FS2 will redirect the IO destined to CSV1 (where the share lives) via FS1 (the owner of CSV1).

image

Don’t worry; this is just the initial connection to the share.  This redirected IO will be dealt with in the next step.  And it won’t happen again to Host1 for this share once the next step is done.

Note: if Host1 had randomly connected to FS1 then we would have direct IO and nothing more would need to be done.

You can see why the cluster networks between the SOFS nodes needs to be at least as fast as the storage networks that connect the hosts to the SOFS nodes.  In reality, we’re probably using the same networks, converged to perform both roles, making the most of the investment in 10 GbE, or faster and possibly RDMA.

SMB Client Redirection

There is another WS2012 R2 feature that works along-side CSV balancing.  The SMB server, running on each SOFS node, will redirect SMB client (Host1) connections to the owner of the CSV being accessed.  This is only done if the SMB client has connected to a non-owner of a CSV.

After a few moments, the SMB server on FS2 will instruct Host1 that for all traffic to CSV1, Host1 should connect to FS1.  Host1 seamlessly redirects and now the traffic will be direct, ending the redirected IO mode.

image

TIP: Have 1 CSV per node in the SOFS.

What About CSV2-Share?

What if Host1 wants to start up VM2 stored on \SOFS1CSV2-Share?  This share is stored on CSV2 and that CSV is owned by Host1.  Host1 will again connect to the SOFS for this share, and will be redirected to FS2 for all traffic related to that share.  Now Host1 is talking to FS1 for CSV1-Share and to FS2 for CSV2-Share.

TIP: Balance the placement of VMs across your CSVs in the SOFS.  VMM should be doing this for you anyway if you use it.  This will roughly balance connectivity across your SOFS nodes.

And that is how SOFS, SMB 3.0, CSV balancing, and SMB redirection give you the best performance with clustered mirrored Storage Spaces.