Microsoft News Summary – 28 July 2014

It was a quiet weekend. Note a useful scripts for health checking a Scale-Out File Server (SOFS) by Jose Barreto.

A Kit/Parts List For A WS2012 R2 Hyper-V Cluster With DataOn SMB 3.0 Storage

I’ve had a number of requests to specify the pieces of a solution where there is a Windows Server 2012 R2 Hyper-V cluster that uses SMB 3.0 to store virtual machines on a Scale-Out File Server with Storage Spaces (JBOD). So that’s what I’m going to try to do with this post. Note that I am not going to bother with pricing:

  • It takes too long to calculate
  • Prices vary from country to country
  • List pricing is usually meaningless; work with a good distributor/reseller and you’ll get a bid/discount price.
  • Depending on where you live in the channel, you might be paying distribution price, trade price, or end-customer price, and that determines how much margin has been added to each component.
  • I’m lazy

Scale-Out File Server

Remember that an SOFS is a cluster that runs a special clustered file server role for application data. A cluster requires shared storage. That shared storage will be one or more Mini-SAS-attached JBOD trays (on the Storage Spaces HCL list) with Storage Spaces supplying the physical disk aggregation and virtualization (normally done by SAN controller software).

On the blade versus rack server question: I always go rack server. I’ve been burned by the limited flexibility and high costs of blades. Sure you can get 64 blades into a rack … but at what cost!?!?! FlexFabric-like solutions are expensive, and strictly speaking, not supported by Microsoft – not to mention they limit your bandwidth options hugely. The massive data centres that I’ve seen and been in use 1U and 2U rack servers.  I like 2U rack servers over 1U because 1U rack servers such as the R420 have only 1 full-height and 1 half-height PCI expansion slots. That half-height slot makes for tricky expansion.

For storage (and more) networking, I’ve elected to go with RDMA networking. Here you have two good choices:

  • iWARP: More affordable and running at 10 GbE – what I’ve illustrated here. Your vendor choice is Chelsio.
  • Infiniband: Amazing speeds (56 Gbps with faster to come) but more expensive. Your vendor choice is Mellanox.

I’ve ruled out RoCE. It’s too damned complicated – just ask Didier Van Hoye (@workinghardinit).

There will be two servers:

  • 2 x Dell R720: Dual Xeon CPU, 6 GB RAM, rail kits, dual CPU, on-board quad port 1 GbE NICs. The dual CPU gives me scalability to handle lots of hosts/clusters. The 4 x 1 GbE NICs are teamed (dynamic load distribution) for management functionality. I’d upgrade the built-in iDRAC Essentials to the Enterprise edition to get the KVM console and virtual media features. A pair of disks in RAID1 configuration are used for the OS in each of the SOFS nodes.
  • 10 x 1 GbE cables: This is to network the 4 x 1 GbE onboard NICs and the iDRAC management port. Who needs KVM when you’ve already bought it in the form of iDRAC.
  • 2 x Chelsio T520-CR: Dual port 10 GbE SFP+ iWARP (RDMA) NICs. These two rNICs are not teamed (not compatible with RDMA). They will reside on different VLANs/subnets for SMB Multichannel (cluster requirement). The role of these NICs is to converge SMB 3.0 storage, and cluster communications. I might even use these networks for backup traffic.
  • 4 x SFP+ cables: These are to connect the two servers to the two SFP+ 10 GbE switches.
  • 2 x LSI 9207-8e Mini-SAS HBAs: These are dual port Mini-SAS adapters that you insert into each server to connect to the JBOD(s). Windows MPIO provides the path failover.
  • 2 x Windows Server Standard Edition: We don’t need virtualization rights on the SOFS nodes. Standard edition includes Failover Clustering.

Regarding the JBODs:

Only use devices on the Microsoft HCL for your version of Windows Server. There are hardware features in these “dumb” JBODs that are required. And the testing process will probably lead to the manufacturer tweaking their hardware.

Not that although “any” dual channel SAS drive can be used, some firmwares are actually better than others. DataOn Storage maintain their own HCL of tested HDDs & SSDs and HBAs. Stick with the list that your JBOD vendor recommends.

How many and what kind of drives do you need? That depends. My example is just that: an example.

How many trays do you need? Enough to hold your required number of drives 😀 Really though, if I know that I will scale out to fill 3 trays then I will buy those 3 trays up front. Why? Because 3 trays is the minimum required for tray fault tolerance with 2-way mirror virtual disks (LUNs). Simply going from 1 tray to 2 and then 3 won’t do because data does not relocate.

Also remember that if you want tiered storage then there is a minimum number of SSDs (STRONGLY) recommended per tray.

Regarding using SATA drives: DON’T DO IT! The available interposer solution is strongly discouraged, even by DataOn.  If you really need SSD for tiered storage then you really need to pay (through the nose).

Here’s my EXAMPLE configuration:

  • 3 x DataOn Storage DNS-1640D: 24 x 2.5” disk slots in each 2U tray, each with a blank disk caddy for a dual channel SAS SSD or HDD drive. Each has dual boards for Mini-SAS connectivity (A+B for server 1 and A+B for server 2), and A+B connectivity for tray stacking. There is also dual PSU in each tray.
  • 18 x Mini-SAS cables: These cables are used to connect the LSI cards in the servers to the JBOD(s) and to stack the trays. At least I think 18 cables are required. They’re short cables because the servers are on top/under the JBOD trays and the entire storage solution is just 10U in height.
  • 12 x STEC S842E400M2 400GB SSD: Go google the price of these for a giggle! These are not your typical (or even “enterprise”) SSD that you’ll stick in a laptop.  I’m putting 4 into each JBOD, the recommended minimum number of SSDs in tiered storage if doing 2-way mirroring.
  • 48 x Seagate ST900MM0026 900 GB 10K SAS HDD: This gives us the bulk of the storage. There are 20 slots free (after the SSDs) in each JBOD and I’ve put in 16 disks into each. That gives me loads of capacity and some wiggle room to add more disks of either type.
  • 18 x Mini-SAS Cables: I’m not looking at a diagram and I’m tired so 18 might not be the right number. There’s a total of 10U of hardware in the SOFS (servers + JBOD) so short Mini-SAS cables will do the trick. These are used to attach the servers to the JBODs and to daisy chain the JBODs. The connections are fault tolerant – hence the high number of cables.

And that’s the SOFS, servers + JBODs with disks.

Just to remind you: it’s a sample spec. You might have one JBOD, you might have 4, or you might go with the 60 disk slot models. It all depends.

Hyper-V Hosts

My hosting environment will consist of one Hyper-V cluster with 8 nodes. This could be:

  • A few clusters, all sharing the same SOFS
  • One or more clusters with some non-clustered hosts, all sharing the same SOFS
  • Lots of non-clustered hosts, all sharing the same SOFS

One of the benefits of SMB 3.0 storage is that a shared folder is more flexible than a CSV on a SAN LUN. There are more sharing options, and this means that Live Migration can span the traditional boundary of storage without involving Shared-Nothing Live Migration.

Regarding host processors, the L2/L3 cache plays a huge role in performance. Try to get as new a processor as possible. And remember, it’s all Intel or all AMD; do not mix the brands.

There are lots of possible networking designs for these hosts. I’m going to use the design that I’ve implemented in the lab at work, and it’s also one that Microsoft recommends. A pair or rNICs (iWARP) will be used for the storage and cluster networking, residing on the same two VLANs as the cluster/storage networks that the SOFS nodes are on. Then two other NICs are going to be used for host and VM networking. These two NICs could be 1 GbE or 10 GbE or faster, depending on the needs of your VMs. I’ve got 4 pNICs to play with so I will team them.

    • 8 x Dell R720: Dual Xeon CPU, 256 GB RAM, rail kits, dual CPU, on-board quad port 1 GbE NICs. These are some big hosts. Put lots of RAM in because that’s the cheapest way to scale. CPU is almost never the 1st or even 2nd bottleneck in host capacity. The 4 x 1 GbE NICs are teamed (dynamic load distribution) for VM networking and management functionality. I’d upgrade the built-in iDRAC Essentials to the Enterprise edition to get the KVM console and virtual media features. A pair of disks in RAID1 configuration are used for the management OS.
    • 40 x 1 GbE cables: This is to network the 4 x 1 GbE onboard NICs and the iDRAC management port in each host. Who needs KVM when you’ve already bought it in the form of iDRAC.
    • 8 x Chelsio T520-CR: Dual port 10 GbE SFP+ iWARP (RDMA) NICs. These two rNICs are not teamed (not compatible with RDMA). They will reside on the same two different VLANs/subnets as the SOFS nodes. The role of these NICs is to converge SMB 3.0 storage, SMB 3.0 Live Migration (you gotta see it to believe it!), and cluster communications. I might even use these networks for backup traffic.
    • 16 x SFP+ cables: These are to connect the two servers to the two SFP+ 10 GbE switches.
    • 8 x Windows Server Datacenter Edition: The Datacenter edition gives us unlimited rights to install Windows Server into VMs that will run on these licensed hosts, making it the economical choice. Enabling Automatic Virtual Machine Activation in the VMs will simplify VM guest OS activation.

There are no HBAs in the Hyper-V hosts; the storage (SOFS) is accessed via SMB 3.0 over the rNICs.

Other Stuff

Hmm, we’re going to need:

  • 2 x SFP+ 10 GbE Switches with DBC support: Data Center Bridging really is required to do QoS of RDMA traffic. If would need PFC (Priority Flow Control) support if using RoCE for RDMA (not recommended – do either iWARP or Infiniband). Each switch needs at least 12 ports – allow for scalability.  For example, you might put your backup server on this network.
  • 2 x 1 GbE Switches: You really need a pair of 48 port top-of-rack switches in this design due to the number of 1 GbE ports being used and the need for growth.
  • Rack
  • PDU

And there’s probably other bits. For example, you might run a 2-node cluster for System Center and other management VMs. The nodes would have 32-64 GB RAM each. Those VMs could be stored on the SOFS or even on a JBOD that is directly attached to the 2 nodes with Storage Spaces enabled. You might run a server with lots of disk as your backup server. You might opt to run a pair of 1U servers are physical domain controllers for your infrastructure.

I recently priced up a kit, similar to above. It came in much cheaper than the equivalent blade/SAN configuration, which was a nice surprise. Even better was that the SOFS had 3 times more storage included than the SAN in that pricing!

The Effects Of WS2012 R2 Storage Spaces Write-Back Cache

In this post I want to show you the amazing effect that Write-Back Cache can have on the write performance of Windows Server 2012 R2 Storage Spaces.  But before I do, let’s fill in some gaps.

Background on Storage Spaces Write-Back Cache

Hyper-V, and many other applications/services/etc, does something called write-through.  In other words, it bypasses write caches of your physical storage.  This is to avoid corruption.  Keep this in mind while I move on.

In WS2012 R2, Storage Spaces introduces tiered storage.  This allows us to mix one tier of HDD (giving us bulk capacity) with one tier of SSD (giving us performance).  Normally a heap map process runs at 1am (task scheduler, and therefore customisable) and moves around 1 MB slices of files to the hot SSD tier or to the cold HDD tier, based on demand.  You can also pin entire files (maybe a VDI golden image) to the hot tier.

In addition, WS2012 R2 gives us something called Write-Back Cache (WBC).  Think about this … SSD gives us really fast write speeds.  Write caches are there to improve write performance.  Some applications are using write-through to avoid storage caches because they need the acknowledgement mean that the write really went to disk.

What if abnormal increases in write behaviour led to the virtual disk (a LUN in Storage Spaces) using it’s allocated SSD tier to absorb that spike, and then demote the data to the HDD tier later on if the slices are measured as cold.

That’s exactly what WBC, a feature of Storage Spaces with tiered storage, does.  A Storage Spaces tiered virtual disk will use the SSD tier to accommodate extra write activity.  The SSD tier increases the available write capacity until the spike decreases and things go back to normal.  We get the effect of a write cache, but write-through still happens because the write really is committed to disk rather than sitting in the RAM of a controller.

Putting Storage Spaces Write-Back Cache To The Test

What does this look like?  I set up a Scale-Out File Server that uses a DataOn DNS-1640D JBOD.  The 2 SOFS cluster nodes are each attached to the JBOD via dual port LSI 6 Gbps SAS adapters.  In the JBOD there is a tier of 2 * STEC SSDs (4-8 SSDs is a recommended starting point for a production SSD tier) and a tier of 8 * Seagate 10K HDDs.  I created 2 * 2-way mirrored virtual disks in the clustered Storage Space:

  • CSV1: 50 GB SSD tier + 150 GB HDD tier with 5 GB write cache size (WBC enabled)
  • CSV2: 200 GB HDD tier with no write cache (no WBC)

Note: I have 2 SSDs (sub-optimal starting point but it’s a lab and SSDs are expensive) so CSV1 has 1 column.  CSV2 has 4 columns.

Each virtual disk was converted into a CSV, CSV1 and CSV2.  A share was created on each CSV and shared as \Demo-SOFS1CSV1 and \Demo-SOFS1CSV2.  Yeah, I like naming consistency Smile

Then I logged into a Hyper-V host where I have installed SQLIO.  I configured a couple of params.txt files, one to use the WBC-enabled share and the other to use the WBC-disabled share:

  • Param1.TXT: \demo-sofs1CSV1testfile.dat 32 0x0 1024
  • Param2.TXT \demo-sofs1CSV2testfile.dat 32 0x0 1024

I pre-expanded the test files that would be created in each share by running:

  • "C:Program Files (x86)SQLIOsqlio.exe" -kW -s5 -fsequential -o4 –b64 -F"C:Program Files (x86)SQLIOparam1.txt"
  • "C:Program Files (x86)SQLIOsqlio.exe" -kW -s5 -fsequential -o4 -b64 -F"C:Program Files (x86)SQLIOparam2.txt"

And then I ran a script that ran SQLIO with the following flags to write random 64 KB blocks (similar to VHDX) for 30 seconds:

  • "C:Program Files (x86)SQLIOsqlio.exe" -BS -kW -frandom -t1 -o1 -s30 -b64 -F"C:Program Files (x86)SQLIOparam1.txt"
  • "C:Program Files (x86)SQLIOsqlio.exe" -BS -kW -frandom -t1 -o1 -s30 -b64 -F"C:Program Files (x86)SQLIOparam2.txt"

That gave me my results:

image

To summarise the results:

The WBC-enabled share ran at:

  • 2258.60 IOs/second
  • 141.16 Megabytes/second

The WBC-disabled share ran at:

  • 197.46 IOs/second
  • 12.34 Megabytes/second

Storage Spaces Write-Back Cache enabled the share on CSV1 to run 11.44 times faster than the non-enhanced share!!!  Everyone’s mileage will vary depending on number of SSDs versus HDDs, assigned cache size per virtual disk, speed of SSD and HDD, number of columns per virtual hard disk, and your network.  But one thing is for sure, with just a few SSDs, I can efficiently cater for brief spikes in write operations by the services that I am storing on my Storage Pool.

Credit: I got help on SQLIO from this blog post on MS SQL Tips by Andy Novick (MVP, SQL Server).