Notes: Continuously Available File Server – Under The Hood

Here are my notes from TechEd NA session WSV410, by Claus Joergensen.  A really good deep session – the sort I love to watch (very slowly, replaying bits over).  It took me 2 hours to watch the first 50 or so minutes 🙂

image

For Server Applications

The Scale-Out File Server (SOFS) is not for direct sharing of user data.  MSFT intend it for:

  • Hyper-V: store the VMs via SMB 3.0
  • SQL Server database and log files
  • IIS content and configuration files

Required a lot of work by MSFT: change old things, create new things.

Benefits of SOFS

  • Share management instead of LUNs and Zoning (software rather than hardware)
  • Flexibility: Dynamically reallocate server in the data centre without reconfiguring network/storage fabrics (SAN fabric, DAS cables, etc)
  • Leverage existing investments: you can reuse what you have
  • Lower CapEx and OpEx than traditional storage

Key Capabilities Unique to SOFS

  • Dynamic scale with active/active file servers
  • Fast failure recovery
  • Cluster Shared Volume cache
  • CHKDSK with zero downtime
  • Simpler management

Requirements

Client and server must be WS2012:

  • SMB 3.0
  • It is application workload, not user workload.

Setup

I’ve done this a few times.  It’s easy enough:

  1. Install the File Server and Failover Clustering features on all nodes in the new SOFS
  2. Create the cluster
  3. Create the CSV(s)
  4. Create the File Server role – clustered role that has it’s own CAP (including associated computer object in AD) and IP address.
  5. Create file shares in Failover Clustering Management.  You can manage them in Server Manager.

Simple!

Personally speaking: I like the idea of having just 1 share per CSV.  Keeps the logistics much simpler.  Not a hard rule from MSFT AFAIK.

And here’s the PowerShell for it:

image

CSV

  • Fundamental and required.  It’s a cluster file system that is active/active.
  • Supports most of the NTFS features.
  • Direct I/O support for file data access: whatever node you come in via, then Node 2 has direct access to the back end storage.
  • Caching of CSVFS file data (controlled by oplocks)
  • Leverages SMB 3.0 Direct and Multichannel for internode communication

Redirected IO:

  • Metadata operations – hence not for end user data direct access
  • For data operations whena  file is being accessed simultaneously by multiple CSVFS instances.

CSV Caching

  • Windows Cache Manager integration: Buffered read/write I/O is cached the same way as NTFS
  • CSV Block Caching – read only cache using RAM from nodes.  Turned on per CSV.  Distributed cache guaranteed to be consistent across the cluster.  Huge boost for polled VDI deployments – esp. during boot storm.

CHDKDSK

Seamless with CSV.  Scanning is online and separated from repair.  CSV repair is online.

  • Cluster checks once/minute to see if chkdsk spotfix is required
  • Cluster enumerates NTFS $corrupt (contains listing of fixes required) to identify affected files
  • Cluster pauses the affected CSVFS to pend I/O
  • Underlying NTFS is dismounted
  • CHKDSK spotfix is run against the affected files for a maximum of 15 seconds (usually much quicker)  to ensure the application is not affected
  • The underlying NTFS volume is mounted and the CSV namespace is unpaused

The only time an application is affected is if it had a corrupted file.

If it could not complete the spotfix of all the $corrupt records in one go:

  • Cluster will wait 3 minutes before continuing
  • Enables a large set of corrupt files to be processed over time with no app downtime – assuming the apps’ files aren’t corrupted – where obviously the would have had downtime anyway

Distributed Network Name

  • A CAP (client access point) is created for an SOFS.  It’s a DNS name for the SOFS on the network.
  • Security: creates and manages AD computer object for the SOFS.  Registers credentials with LSA on each node

The actual nodes of the cluster nodes are used in SOFS for client access.  All of them are registered with the CAP.

DNN & DNS:

  • DNN registers node UP for all notes.  A virtual IP is not used for the SOFS (previous)
  • DNN updates DNS when: resource comes online and every 24 hours.  A node added/removed to/from cluster.  A cluster network is enabled/disabled as a client network.  IP address changes of nodes.  Use Dynamic DNS … a lot of manual work if you do static DNS.
  • DNS will round robin DNS lookups: The response is a list of sorted addresses for the SOFS CAP with IPv6 first and IPv4 done second.  Each iteration rotates the addresses within the IPv6 and IPv4 blocks, but IPv6 is always before IPv4.  Crude load balancing.
  • If a client looks up, gets the list of addresses.  Client will try each address in turn until one responds.
  • A client will connect to just one cluster node per SOFS.  Can connect to multiple cluster nodes if there are multiple SOFS roles on the cluster.

SOFS

Responsible for:

  • Online shares on each node
  • Listen to share creations, deletions and changes
  • Replicate changes to other nodes
  • Ensure consistency across all nodes for the SOFS

It can take the cluster a couple of seconds to converge changes across the cluster.

SOFS implemented using cluster clone resources:

  • All nodes run an SOFS clone
  • The clones are started and stopped by the SOFS leader – why am I picturing Homer Simpson in a hammock while Homer Simpson mows the lawn?!?!?
  • The SOFS leader runs on the node where the SOFS resources is actually online – this is just the orchestrator.  All nodes run independently – moving or crash doesn’t affect the shares availability.

Admin can constrain what nodes the SOFS role is on – possible owners for the DNN and SOFS resource.  Maybe you want to reserve other nodes for other roles – e.g. asymmetric Hyper-V cluster.

Client Redirection

SMB clients are distributed at connect time by DNS round robin.  No dynamic redistribution.

SMB clients can be redirected manually to use a different cluster node:

image

Cluster Network Planning

  • Client Access: clients use the cluster nodes client access enable public networks

CSV traffic IO Redirection:

  • Metadata updates – infrequent
  • CSV is built using mirrored storage spaces
  • A host loses direct storage connectivity

Redirected IO:

  • Prefers cluster networks not enabled for client access
  • Leverages SMB Multichannel and SMB Direct
  • iSCSI Networks should automatically be disabled for cluster use – ensure this is so to reduce latency.

Performance and Scalability

image

image

SMB Transparent Failover

Zero downtime with small IO delay.  Supports planned and unplanned failovers.  Resilient for both file and directory operations.  Requires WS2012 on client and server with SMB 3.0.

image

Client operation replay – If a failover occurs, the SMB client reissues those operations.  Done with certain operations.  Others like a delete are not replayed because they are not safe.  The server maintains persistence of file handles.  All write-throughs happen straight away – doesn’t effect Hyper-V.

image

The Resume Key Filter fences off file handles state after failover to prevent other clients grabbing files when the original clients expect to have access when they are failed over by the witness process.  Protects against namespace inconsistency – file rename in flight.  Basically deals with handles for activity that might be lost/replayed during failover.

Interesting: when a CSV comes online initially or after failover, the Resume Key Filter locks the volume for a few seconds (less than 3 seconds) for a database (state info store in system volume folder) to be loaded from a store.  Namespace protection then blocks all rename and create operations for up to 60 seconds to allow for local file hands to be established.  Create is blocked for up to 60 seconds as well to allow remote handles to be resumed.  After all this (up to total of 60 seconds) all unclaimed handles are released.  Typically, the entire process is around 3-4 seconds.  The 60 seconds is a per volume configurable timeout.

Witness Protocol (do not confuse with Failover Cluster File Share Witness):

  • Faster client failover.  Normal SMB time out could be 40-45 seconds (TCP-based).  That’s a long timeout without IO.  The cluster informs the client to redirect when the cluster detects a failure.
  • Witness does redirection at client end.  For example – dynamic reallocation of load with SOFS.

Client SMB Witness Registration

  1. Client SMB connects to share on Node A
  2. Witness on client obtains list of cluster members from Witness on Node A
  3. Witness client removes Node A as the witness and selects Node B as the witness
  4. Witness registers with Node B for notification of events for the share that it connected to
  5. The Node B Witness registers with the cluster for event notifications for the share

Notification:

  1. Normal operation … client connects to Node A
  2. Unplanned failure on Node A
  3. Cluster informs Witness on Node B (thanks to registration) that there is a problem with the share
  4. The Witness on Node B notifies the client Witness that Node A went offline (no SMB timeout)
  5. Witness on client informs SMB client to redirect
  6. SMB on client drops the connection to Node A and starts connecting to another node in the SOFS, e.g. Node B
  7. Witness starts all over again to select a new Witness in the SOFS. Will keep trying every minute to get one in case Node A was the only possibility

Event Logs

All under Application and Services – Microsoft – Windows:

  • SMBClient
  • SMBServer
  • ResumeKeyFilter
  • SMBWitnessClient
  • SMBWitnessService

Notes: Microsoft Virtual Machine Converter Solution Accelerator

These are my notes from the TechEd NA recording of WCL321 with Mikael Nystrom.

Virtual Machine Converter (VMC)

VMC is a free-to-download Solution Accelerator that is currently in beta.  Solution Accelerators are glue between 2 MSFT products to provide a combined solution.  MAP, MDT are other examples.  They are supported products by MSFT.

The purpose of the tool is to convert VMware VMs into Hyper-V VMs.  It can be run as standalone or it can be integrated into System Center, e.g. Orchestrator Runbooks.

It offers a GUI and command line interface (CLI).  Nice quick way for VMware customers to evaluate Hyper-V – convert a couple of known workloads and compare performance and scalability.  It is a low risk solution; the original VM is left untouched.

It will uninstall the VMware tools and install the MSFT Integration components.

The solution also fixes drive geometries to sort out possible storage performance issues – basic conversion tools don’t do this.

VMware Support

It supports:

  • vSphere 4.1 and 5.0
  • vCenter 4.1 and 5.0
  • EXS/ESXi

Disk types from VMware supported include:

  • VMFS Flat and Sparse
  • Stream optimised
  • VMDK flat and sparse
  • Single/multi-extent

Microsoft Support

Beta supports Windows VMs:

  • Server 2003 SP2 x64/x86
  • 7 x64/x86
  • Server 2008 R2 x64
  • Server 2008 x64 (RC)
  • Vista x86 (RC)

Correct; no Linux guests can be converted with this tool.

In the beta the Hyper-V support is:

  • Windows Server 2008 R2 SP1 Hyper-V
  • VHD Fixed and Dynamic

In the RC they are adding:

  • Windows Server 2012 and Windows 8 Hyper-V
  • VHDX (support to be added in RTM)

Types of Conversion

  • Hot migration: no downtime to the original VM.  Not what VMC does.  But check the original session recording to see how Mikael uses scripts and other MSFT tools to get one.
  • Warm: start with running VM.  Create a second instance but with service interruption.  This is what VMC does.
  • Cold: Start with offline VM and convert it.

VMC supports Warm and Cold.  But there are ways to use other MSFT tools to do a Hot conversion.

Simplicity

MSFT deliberately made it simple and independent of other tools.  This is a nice strategy.  Many VMware folks want Hyper-V to fail.  Learning something different/new = “complexity”, “Microsoft do it wrong” or “It doesn’t work”.  Keeping it simple defends against this attitude from the stereotypical chronic denier. 

Usage

Run it from a machine.  Connect to ESXi or vCenter machine (username/password).  Pick your VM(s).  Define the destination host/location.  Hit start and monitor.

  1. The VM is snapshotted. 
  2. The VMware Tools are removed. 
  3. The VM is turned off. 
  4. The VMDK is transferred to the VMC machine
  5. The VMDK is converted.  You will need at least twice the size of the VMDK file … plus some space (VHD will be slightly larger).  Remember that Fixed VHD is full size in advance.
  6. The VHD is copied to the Hyper-V host. 
  7. The new Hyper-V VM is built using the VM configuration on the VMware host.
  8. The drive is added to the VM configuration.
  9. The VM is started. 
  10. The Hyper-V integration components are installed.

The conversion will create a Hyper-V VM without a NIC.  Supposed to prevent split-brain conversion where source and target VM are both online at the same time.  I’d rather have a tick box. 

If a snapshot is being used … then you will want any services on that VM offline …. file shares, databases, etc.  But offline doesn’t mean powering down the VM …. we need it online for the VMware tools removal.

The Wizard

A VM must has a FQDN to be converted.  Install the VMware tools and that makes the VM convertible.  This is required to make it possible to … uninstall the VMware tools Smile

It will ask for your credentials to log into the guest OS for the VMware tools uninstall. 

Maybe convert the VM on an SSD to speed things up.

TechEd Europe 2012 Day 1 Keynote Notes #TEE12

Great that TechEd is back in Amsterdam.  I wish I was there.  Berlin is a nice city, but the Messe is a hole.

Brad Anderson

Mentions the Yammer acquisition, Windows Phone 8, and the new Surface tablets.  He’s talking about change.  Is it chaos or is it opportunity?  Pitching the positive spin of innovation in change.

Think of storage, compute, and network as one entity, manage it as such.  In other words: Windows Server 2012, System Center 2012, and Azure are integration into a single solution – you pick and choose the ingredients that you want in the meal.

Patrick Lownds has tweeted a great word: convergence.  This is beyond hybrid cloud; this is converged clouds.

Design with the knowledge that failures happen.  That’s how you get uptime and continuous availability of the service.  Automation of process allows scalability.

Hyper-V: “no workload that you cannot virtualise and run on Hyper-V”.  We’re allegedly going to see the largest every publicly demonstrated virtual machine.

Jeff Woolsey

The energetic principal PM for Windows Server virtualisation.  “Extend to the cloud on your terms”.  Targeted workloads that were not virtualisable.  Dozens of cores.  Hundreds of MB RAM.  Massive IOPS requirements.  This demo (40 SSDs) is same as 10 full sized fully populated racks of traditional SAN disk.  MSFT using SSD in this demo.  VMware: up to 300,000 IOPS.  Hyper-V now beats what it did in TechEd USA: Over 1,000,000 (1 million) IOPS from a Hyper-V VM.

Iometer

Now we see the Cisco Nexus 1000v Hyper-V Switch extension (not a switch replacement like in VMware).  Shows off easy QoS policy deployment.

PowerShell:  Over 2400 cmdlets in WS2012.  Now we’re going to see Hyper-V Replica management via System Center 2012 Orchestrator.  A Site Migration runbook.  It verifies source/destination, and then it brings up the VMs in the target location in the order defined by the runbook.  And we see lots of VMs power up.

Once again, we see System Center 2012 App Controller integrating with a “hosting company” and enabling additional VM hosting capacity beyond the private cloud.

I”m wrapping up here … looks like the keynote is mostly the same as the USA one (fine for 99% of the audience who aren’t hooked to their Twitter/RSS like myself) and I have to head to work.

This keynote recording will be available on Channel 9, and the USA one is already there.  Enjoy!

Technorati Tags:

Windows Server 2012 NIC Teaming and Multichannel

Notes from TechEd NA 2012 WSV314:

image

Terminology

  • It is a Team, not NIC bonding, etc.
  • A team is made of Team Members
  • Team Interfaces are the virtual NICs that can connect to a team and have IP stacks, etc.  You can call them tNICs to differentiate them from vNICs in the Hyper-V world.

image

Team Connection Modes

Most people don’t know the teaming mode they select when using OEM products.  MSFT are clear about what teaming does under the cover.  Connection mode = how do you connect to the switch?

  • Switch Independent can be used where the switch doesn’t need to know anything about the team.
  • Switch dependent teaming is when the switch does need to know something about the team. The switch decides where to send the inbound traffic.

There are 2 switch dependent modes:

  • LACP (Link Aggregation Control Protocol) is where the is where the host and switch agree on who the team members are. IEEE 802.1ax
  • Static Teaming is where you configure it on the switch.

image

Load Distribution Modes

You also need to know how you will spread traffic across the team members in the team.

1) Address Hash comes in 3 flavours:

  • 4-tuple (the default): Uses RSS on the TCP/UDP ports. 
  • 2-tuple: If the ports aren’t available (encrypted traffic such as IPsec) then it’ll go to 2-tuple where it uses the IP address.
  • MAC address hash: If not IP traffic, then MAC addresses are hashed.

2) We also have Hyper-V Port, where it hashes the port number on the Hyper-V switch that the traffic is coming from.  Normally this equates to per-VM traffic.  No distribution of traffic.  It maps a VM to a single NIC.  If a VM needs more pipe than a single NIC can handle then this won’t be able to do it.  Shouldn’t be a problem because we are consolidating after all.

Maybe create a team in the VM?  Make sure the vNICs are on different Hyper-V Switches. 

SR-IOV

Remember that SR-IOV bypasses the host stack and therefore can’t be teamed at the host level.  The VM bypasses it.  You can team two SR-IOV enabled vNICs in the guest OS for LBFO.

Switch Independent – Address Hash

Outbound traffic in Address Hashing will spread across NICs. All inbound traffic is targeted at a single inbound MAC address for routing purposes, and therefore only uses 1 NIC.  Best used when:

  • Switch diversity is a concern
  • Active/Standby mode
  • Heavy outbound but light inbound workloads

Switch Independent – Hyper-V Port

All traffic from each VM is sent out on that VM’s physical NIC or team member.  Inbound traffic also comes in on the same team member.  So we can maximise NIC bandwidth.  It also allows for maximum use of VMQs for better virtual networking performance.

Best for:

  • Number of VMs well exceeds number of team members
  • You’re OK with VM being restricted to bandwidth of a single team member

Switch Dependent Address Hash

Sends on all active members by using one of the hashing methods.  Receives on all ports – the switch distributes inbound traffic.  No association between inbound and outbound team members.  Best used for:

  • Native teaming for maximum performance and switch diversity is not required.
  • Teaming under the Hyper-V switch when a VM needs to exceed the bandwidth limits of a single team member  Not as efficient with VMQ because we can’t predict the traffic.

Best performance for both inbound and outbound.

Switch Dependent – Hyper-V Port

Sends on all active members using the hashed port – 1 team member per VM.  Inbound traffic is distributed by the switch  on all ports so there is no correlation to inbound and outbound.  Best used when:

  • When number of VMs on the switch well exceeds the number of team members AND
  • You have a policy that says you must use switch dependent teaming.

When using Hyper-V you will normally want to use Switch Independent & Hyper-V Port mode. 

When using native physical servers you’ll likely want to use Switch Independent & Address Hash.  Unless you have a policy that can’t tolerate a switch failure.

Team Interfaces

There are different ways of interfacing with the team:

  • Default mode: all traffic from all VLANs is passed through the team
  • VLAN mode: Any traffic that matches a VLAN ID/tag is passed through.  Everything else is dropped.

Inbound traffic passes through to one team interface at once.

image

The only supported configuration for Hyper-V is shown above: Default mode passing through all traffic t the Hyper-V Switch.  Do all the VLAN tagging and filtering on the Hyper-V Switch.  You cannot mix other interfaces with this team – the team must be dedicated to the Hyper-V Switch.  REPEAT: This is the only supported configuration for Hyper-V.

A new team has one team interface by default. 

Any team interfaces created after the initial team creation must be VLAN mode team interfaces (bound to a VLAN ID).  You can delete these team interfaces.

Get-NetAdapter: Get the properties of a team interface

Rename-NetAdapter: rename a team interface

Team Members

  • Any physical ETHERNET adapter with a Windows Logo (for stability reasons and promiscuous mode for VLAN trunking) can be a team member.
  • Teaming of InfiniBand, Wifi, WWAN not supported.
  • Teams made up of teams not supported.

You can have team members in active or standby mode.

Virtual Teams

Supported if:

  • No more than 2 team members in the guest OS team

Notes:

  • Intended for SR-IOV NICs but will work without it.
  • Both vNICs in the team should be connected to different virtual switches on different physical NICs

If you try to team a vNIC that is not on an External switch, it will show up fine and OK until you try to team it.  Teaming will shut down the vNIC at that point. 

You also have to allow teaming in a vNIC in Advanced Properties – Allow NIC teaming.  Do this for each of the VM’s vNICs.  Without this, failover will not succeed. 

PowerShell CMDLETs for Teaming

The UI is actually using POSH under the hood.  You can use the NIC Teaming UI to remotely manage/configure a server using RSAT for Windows 8.  WARNING: Your remote access will need to run over a NIC that you aren’t altering because you would lose connectivity.

image

Supported Networking Features

NIC teaming works with almost everything:

image

TCP Chimney Offload, RDMA and SR-IOV bypass the stack so obviously they cannot be teamed in the host.

Limits

  • 32 NICs in a team
  • 32 teams
  • 32 team interfaces in a team

That’s a lot of quad port NICs.  Good luck with that! Winking smile 

SMB Multichannel

An alternative to a team in an SMB 3.0 scenario.  Can use multiple NICs with same connectivity, and use multiple cores via NIC RSS to have simultaneous streams over a single NIC (RSS) or many NICs (teamed, not teamed, and also with RSS if available).  Basically, leverage more bandwidth to get faster SMB 3.0 throughput.

Without it, a 10 GbE NIC would only be partly used by SMB – single CPU core trying to transmit.  RSS makes it multi-threaded/core, and therefore many connections by the data transfer.

Remember – you cannot team RDMA.  So another case to use Multichannel and get an LBFO effect is to use SMB Multichannel …. or I should say “use” … SMB 3.0 turns it on automatically if multiple paths are available between client and server.

SMB 3.0 is NUMA aware.

Multichannel will only use NICs of same speed/type.  Won’t see traffic spread over a 10 GbE and a 1 GbE NIC, for example, or over RDMA-enabled and non-RDMA NICs. 

In tests, the throughput on RSS enabled 10 GbE NICs (1, 2, 3, and 4 NICs), seemed to grow in a predictable near-linear rate.

SMB 3.0 uses a shortest queue first algorithm for load balancing – basic but efficient.

SMB Multichannel and Teaming

Teaming allows for faster failover.  MSFT recommending teaming where applicable.  Address-hash port mode with Multichannel can be a nice solution.  Multichannel will detect a team and create multiple connections over the team.

RDMA

If RDMA is possible on both client and server then SMB 3.0 switches over to SMB Direct.  Net monitoring will see negotiation, and then … “silence” for the data transmission.  Multichannel is supported across single or multiple NICs – no NIC teaming, remember!

Won’t Work With Multichannel

  • Single non-RSS capable NIC
  • Different type/speed NICs, e.g. 10 GbE RDMA favoured over 10 GbE non-RDMA NIC
  • Wireless can be failed from but won’t be used in multi-channel

Supported Configurations

Note that Multichannel over a team of NICs is favoured over multichannel over the same NICs that are not in a team.  Added benefits of teaming (types, and fast failover detection).  This applies, whether the NICs are RSS capable or not.  And the team also benefits non-SMB 3.0 traffic.

image

Troubleshooting SMB Multichannel

image

Plenty to think about there, folks!  Where it applies in Hyper-V?

  • NIC teaming obviously applies.
  • Multichannel applies in the cluster: redirected IO over the cluster communications network
  • Storing VMs on SMB 3.0 file shares

Windows Server 2012 High-Performance, Highly-Available Storage Using SMB

Notes from TechEd NA 2012 session WSV303:

image

One of the traits of the Scale-Out File Server is Transparent Failover for server-server apps such as SQL Server or Hyper-V.  During a host power/crash/network failure, the IO is paused briefly and flipped over to an alternative node in the SOFS.

image

Transparent Failover

The Witness Service and state persistence enable Transparent Failover in SMB 3.0 SOFS.  The Witness plays a role in unplanned failover.  Instead of a TCP timeout (40 seconds and causing application issues), speeds up the process.  It tells the client that the server that they were connected to has failed and should switch to a different server in the SOFS.

image

NTFS Online Scan and Repair

  • CHKDSK can take hours/days on large volumes.
  • Scan done online
  • Repair is only done when the volume is offline
  • Zero downtime with CSV with transparent repair

Clustered Hardware RAID

Designed for when using JBOD, probably with Storage Spaces.

image

Resilient File System (ReFS)

A new file system as an alternative to NTFS (which is very old now).  CHKDSK is not needed at all.  This will become the standard file system for Windows over the course of the next few releases.

image

Comparing the Performance of SMB 3.0

Wow! SMB 3.0 over 1 Gbps network connection achieved 98% of DAS performance using SQL in transactional processing.

image

If there are multiple 1 Gbps NICs then you can use SMB Multichannel which gives aggregated bandwidth and LBFO.  And go extreme with SMB Direct (RDMA) to save CPU.

VSS and SMB 3.0 File Shares

You need a way to support remote VSS snapshots for SMB 3.0 file shares if supporting Hyper-V.  We can do app consistent snapshots of VMs stored on a WS2012 file server.  Backup just works as normal – backing up VMs on the host.

image

  1. Backup talks to backup agent on host. 
  2. Hyper-V VSS Writer reaches into all the VMs and ensures everything is consistent. 
  3. VSS engine is then asked to do the snapshot.  In this case, the request is relayed to the file server where the VSS snapshot is done. 
  4. The path to the snapshot is returned to the Hyper-V host and that path is handed back to the backup server. 
  5. The backup server can then choose to either grab the snapshot from the share or from the Hyper-V host.

Data Deduplication

Dedup is built into Windows Server 2012.  It is turned on per-volume.  You can exclude folders/file types.  By default files not modified in 5 days are deduped – SO IT DOES NOT APPLY TO RUNNING VMs.  It identifies redundant data, compresses the chunks, and stores them.  Files are deduped automatically and reconstituted on the fly.

image

REPEAT: Deduplication is not intended for running virtual machines.

Unified Storage

The iSCSI target is now built into WS2012 and can provide block storage for Hyper-V before WS2012. ?!?!?!  I’m confused.  Can be used to boot Hyper-V hosts – probably requiring iSCSI NICs with boot functionality.

image

Building a Highly Available Failover Cluster Solution With WS2012 From The Ground Up

Some notes taken from TechEd NA 2012 WSV324:

image

I won’t blog too much from this session.  I’ve more than covered a lot of it in the recent months.

Cluster Validation Improvements

  • Faster storage validation
  • Includes Hyper-V cluster validation tests
  • Granular control to validate a specific LUN
  • Verification of CSV requirements
  • Replicated hardware aware for multi-site clusters

CSV Improvements

  • No external authentication dependencies for improved performance and resiliency
  • Multi-subnet support (multi-site clusters)

Asymmetric Cluster

image

BitLocker on CSV

This will get the BitLocker status of the CSV:

manage-bde –status C:ClusterStorageVolume1

This will enable BitLocker on a CSV:

manage-bde –on C:ClusterStorageVolume1 –RecoverPassword

You get a warning if you try to run this with the CSV online.  You need the volume to be offline (Turn On Maintenance Mode under More Actions when you right-click the CSV) … so plan this in advance.  Otherwise be ready to do lots of Storage Live Migration or have VM downtime. 

NOTE! A recovery password is created for you.  Make sure you record this safely in a place independent from the cluster that is secure and reliable.

Get the status again to check the progress.

It’s critically important that you add the security descriptor for the cluster so that the cluster can use the now encrypted CSV.  Get that by:

get-cluster

Say that returns the name HV-Cluster1.

Now run the following, and note the $ at the end of the security descriptor (indicating computer account for the cluster):

manage-bde C:ClusterStorageVolume1 –protectors –add –sid HV-Cluster1$

That can be done while the CSV is encrypting.  Once encrypted, you can take it out of maintenance mode.

AD Integration

  • You now can intelligently place Cluster Name Objects (CNO) and Virtual Computer Objects (VCO) in desired OUs. 
  • AD-less Cluster Bootstrapping allows you to run/start a cluster with no physical domain controllers.  This gets a justifiable applause Smile It’s great news for branch offices and SMEs.
  • Repair action to automatically recreate VCOs
  • Improved logging and diagnostics
  • RODC support fro DMZ and branch office deployments

Node Vote Weight

  • In a stretch or mult-site cluster, you can configure which nodes have votes in determining quorum.
  • Configurable with 1 or 0 votes.  All nodes have a vote by default.  Does not apply in Disk Only quorum model.
  • In the multi-site cluster model, this allows the primary site to have the majority of votes.

Dynamic Quorum

  • It is now the default quorum choice in WS2012 Failover Clustering
  • Works in all quorum models except Disk Only Quorum.
  • Quorum changes dynamically based on nodes in active membership
  • Numbers of votes required for quorum changes as nodes go inactive
  • Allows the cluster to stay operations with >50% node count failure

Thoughts:

  • I guess it is probably useful for extremely condensed cluster dynamic power optimisation (VMM 2012)
  • Also should enable cluster to reconfigure itself when there are node failures

Configuration:

EnableDynamicQuorum edit a cluster common property to enable dynamic quorum

DynamicWeight Node private property to view a node’s current vote weight

Cluster Scheduled Tasks

3 types:

  • Cluster wide: On all nodes
  • Any node: On a random node
  • Resource specific: On the node that owns the resource

PowerShell:

  • Register-ClusteredScheduleTask
  • Unregister-ClusteredShceduledTask
  • Set-ClusteredScheduledTask
  • Get-ClusteredScheduledTask

Windows Server 2012 Cluster-In-A-Box, RDMA, And More

Notes taken from TechEd NA 2012 session WSV310:

image

Volume Platform for Availability

Huge amount of requests/feedback from customers.  MSFT spent a year focusing on customer research (US, Germany, and Japan) with many customers of different sizes.  Came up with Continuous Availability with zero data loss transparent failover to succeed High Availability.

Targeted Scenarios

  • Business in a box Hyper-V appliance
  • Branch in a box Hyper-V appliance
  • Cloud/Datacenter high performance storage server

What’s Inside A Cluster In A Box?

It will be somewhat flexible.  MSFT giving guidance on the essential components so expect variations.  MSFT noticed people getting cluster networking wrong so this is hardwired in the box.  Expansion for additional JBOD trays will be included.  Office level power and acoustics will expand this solution into the SME/retail/etc.

image

Lots of partners can be announced and some cannot yet:

  • HP
  • Fujitsu
  • Intel
  • LSI
  • Xio
  • And more

More announcements to come in this “wave”.

Demo Equipment

They show some sample equipment from two Original Device Manufacturers (they design and sell into OEMs for rebranding).  One with SSD and Infiniband is shown.  A more modest one is shown too:

image

That bottom unit is a 3U cluster in a box with 2 servers and 24 SFF SAS drives.  It appears to have additional PCI expansion slots in a compute blade.  We see it in a demo later and it appears to have JBOD (mirrored Storage Spaces) and 3 cluster networks.

RDMA aka SMB Direct

Been around for quite a while but mostly restricted to the HPC space.  WS2012 will bring it into wider usage in data centres.  I wouldn’t expect to see RDMA outside of the data centre too much in the coming year or two.

RDMA enabled NICs also known as R-NICs.  RDMA offloads SMB CPU processing in large bandwidth transfers to dedicated functions in the NIC.  That minimises CPU utilisation for huge transfers.  Reduces the “cost per byte” of data transfer through the networking stack in a server by bypassing most layers of software and communicating directly with the hardware.  Requires R-NICs:

  • iWARP: TCP/IP based.  Works with any 10 GbE switch.  RDMA traffic routable.  Currently (WS2012 RC) limited to 10 Gbps per NIC port.
  • RoCE (RDMA over Converged Ethernet): Works with high-end 10/40 GbE switches.  Offers up to 40 Gbps per NIC port (WS2012 RC).  RDMA not routable via existing IP infrastructure.  Requires DCB switch with Priority Flow Control (PFC).
  • InfiniBand: Offers up to 54 Gbps per NIC port (WS2012 RC). Switches typically less expensive per port than 10 GbE.  Switches offer 10/40 GbE uplinks. Not Ethernet based.  Not routable currently.  Requires InfiniBand switches.  Requires a subnet manager on the switch or on the host.

RDMA can also be combined with SMB Multichannel for LBFO.

image

Applications (Hyper-V or SQL Server) do not need to change to use RDMA and make the decision to use SMB Direct at run time.

Partners & RDMA NICs

  • Mellanox ConectX-3 Dual Port Adapter with VPI InfiniBand
  • Intel 10 GbE iWARP Adapter For Server Clusters NE020
  • Chelsio T3 line of 10 GbE Adapters (iWARP), have 2 and 4 port solutions

We then see a live demo of 10 Gigabytes (not Gigabits) per second over Mellanox InfiniBand.  They pull 1 of the 2 cables and throughput drops to 6,000 Gigabytes per second.  Pop the cable back in and flow returns to normal.  CPU utilisation stays below 5%.

Configurations and Building Blocks

  • Start with single Cluster in a Box, and scale up with more JBODs and maybe add RDMA to add throughput and reduce CPU utilisation.
  • Scale horizontally by adding more storage clusters.  Live Migrate workloads, spread workloads between clusters (e.g. fault tolerant VMs are physically isolated for top-bottom fault tolerance).
  • DR is possible via Hyper-V Replica because it is storage independent.
  • Cluster-in-a-box could also be the Hyper-V cluster.

This is a flexible solution.  Manufacturers will offer new refined and varied options.  You might find a simple low cost SME solution and a more expensive high end solution for data centres.

Hyper-V Appliance

This is a cluster in a box that is both Scale-Out-File Server and Hyper-V cluster.  The previous 2 node Quanta solution is set up this way.  It’s a value solution using Storage Spaces on the 24 SFF SAS drives.  The space are mirrored for fault tolerance.  This is DAS for the 2 servers in the chassis.

What Does All This Mean?

SAN is no longer your only choice, whether you are SME or in the data centre space.  SMB Direct (RDMA) enables massive throughput.  Cluster-in-a-Box enables Hyper-V appliances and Scale-Out File Servers in ready made kits, that are continuously available and scalable (up and out).

Cluster Shared Volumes Reborn in WS2012: Deep Dive

Noes from TechEd North America 2012 session WSV430:

image

New in Windows Server 2012

  • File services is supported on CSV for application workloads.  Can leverage SMB 3.0 and be used for transparent failover Scale-Out File Server (SOFS)
  • Improved backup/restore
  • Improved performance with block level I/O redirection
  • Direct I/O during backup
  • CSV can be built on top of Storage Spaces

New Architecture

  • Antivirus and backup filter drivers are now compatible with CSV.  Many are already compatible.
  • There is a new distributed application consistent backup infrastructure.
  • ODX and spot fixing are supported
  • BitLocker is supported on CSV
  • AD not longer a dependency (!?) for improved performance and resiliency.

Metadata Operations

Lightweight and rapid.  Relatively infrequent with VM workloads.  Require redirected I/O.  Includes:

  • VM creation/deletion
  • VM power on/off
  • VM mobility (live migration or storage live migration)
  • Snapshot creation
  • Extending a dynamic VHD
  • Renaming a VHD

Parallel metadata operations are non disruptive.

Flow of I/O

  • For non-metadata IO: Data sent to the CSV Proxy File System.  It then routes to the disk via CSV VolumeMgr via direct IO.
  • For metadata redirected IO (see above): We get SMB redirected IO on non-orchestrator (not the CSV coordinator/owner for the CSV in question) nodes.  Data is routed via SMB redirected IO by the CSV Proxy File System to the orchestrator via the cluster communications network so the orchestrator can handle the activity.

image

Interesting Note

You can actually rename C:ClusterStorageVolume1 to something like C:ClusterStorageCSV1.  That’s supported by CSV.  I wonder if things like System Center support this?

Mount Points

  • Used custom reparse points in W2008 R2.  That meant backup needed to understand these.
  • Switched to standard Mount Points in WS2012.

Improved interoperability with:

  • Performance coutners
  • OpsMgr (never had free space monitoring before)
  • Free space monitoring (speak of the devil!)
  • Backup software can understand mount points.

CSV Proxy File System

Appears as CSVFS instead of NTFS in disk management.  NTFS under the hood.  Enabled applications and admins to be CSV aware.

Setup

No opt-in any more.  CSV enabled by default.  Appears in normal storage node in FCM.  Just right click on available storage to convert to CSV.

Resiliency

CSV enables fault tolerance file handles.  Storage path fault tolerance, e.g. HBA failure.  When a VM opens a VHD, it gets a virtual file handle that is provided by CSVFS (metadata operation).  The real file handle is opened under the covers by CSV.  If the HBA that the host is using to connect the VM to VHD fails, then the real file handle needs to be recreated.  This new handle is mapped to the existing virtual file handle, and therefore the application (the VM) is unaware of the outage.  We get transparent storage path fault tolerance.  The fault tolerant SAN connectivity (remember that direct connection via HBA has failed and should have failed the VM’s VHD connection) is re-routed by Redirected IO via the Orchestrator (CSV coordinator) which “proxies” the storage IO to the SAN.

image

If the Coordinator node fails, IO is queued briefly and the orchestration role fails over to another node.  No downtime in this brief window.

If the private cluster network fails, the next available network is used … remember you should have at least 2 private networks in a CSV cluster … the second private network would be used in this case.

Spot-Fix

  • Scanning is separated from disk repair.  Scanning is done online.
  • Spot-fixing requires offline only to repair.  It is based on the number of errors to fix rather than the size of the volume … could be 3 seconds.
  • This offline does not cause the CSV to go “offline” for applications (VMs) using that CSV being repaired.  CSV proxy file system virtual file handles appear to be maintained.

This should allow for much bigger CSVs without chkdsk concerns.

CSV Block Cache

This is a distributed write-through cache.  Un-buffered IO is targeted.  This is excluded by the Windows Cache Manager (buffered IO only).  The CSV block cache is consistent across the cluster.

This has a very high value for pooled VDI VM scenario.  Read-only (differencing) parent VHD or read-write differencing VHDs.

You configure the memory for the block cache on a cluster level.  512 MB per host appears to be the sweet spot.  Then you enable CSV block cache on a per CSV basis … focus on the read-performance-important CSVs.

Less Redirected IO

  • New algorithm for detecting type of redirected IO required
  • Uses OpsLocks as a distributed locking mechanism to determine if IO can go via direct path

Comparing speeds:

  • Direct IO: Block level IO performance parity
  • Redirected IO: Remote file system (SMB 3.0)  performance parity … can leverage multichannel and RDMA

Block Level Redirection

This is new in WS2012 and provides a much faster redirected IO during storage path failure and redirection.  It is still using SMB.  Block level redirection goes directly to the storage subsystem and provides 2x disk performance.  It bypasses the CSV subsystem on the coordinator node – SMB redirected IO (metadata) must go through this.

image

You can speed up redirected IO using SMB 3.0 features such as Multichannel (many NICs and RSS on single NICs) and RDMA.  With all the things turned on, you should get 98% of the performance of direct IO via SMB 3.0 redirected IO – I guess he’s talking about Block Level Redirected IO.

VM Density per CSV

  • Orchestration is done on a cluster node (parallelized) which is more scalable than file system orchestration.
  • Therefore there are no limits placed on this by CSV, unlike in VMFS.
  • How many IOPS can your storage handle, versus how many IOPS do your VMs need?
  • Direct IO during backup also simplifies CSV design.

If your array can handle it, you could (and probably won’t) have 4,000 VMs on a 64 node cluster with a single CSV.

CSV Backup and Restore Enhancements

  • Distributed snapshots: VSS based application consistency.  Created across the cluster.  Backup applications query the CSV to do an application consistent backup.
  • Parallel backups can be done across a cluster: Can have one or more concurrent backups on a CSV.  Can have one or more concurrent CSV backups on a single node.
  • CSV ownership does not change.  There is no longer a need for redirected IO during backup.
  • Direct IO mode for software snapshots of the CSV – when there is no hardware VSS provider.
  • Backup no longer needs to be CSV aware.

Summary: We get a single application consistent backup snapshot of multiple VMs across many hosts using a single VSS snapshot of the CSV.  The VSS provider is called on the “backup node” … any node in the cluster.  This is where the snapshot is created.  Will result in less data being transmitted, fewer snapshots, quicker backups.

How a CSV Backup Work in WS2012

  1. Backup application talks to the VSS Service on the backup node
  2. The Hyper-V writer identifies the local VMs on the backup node
  3. Backup node CSV writer contacts the Hyper-V writer on the other hosts in cluster to gather metadata of files being used by VMs on that CSV
  4. CSV Provider on backup node contacts Hyper-V Writer to get quiesce the VMs
  5. Hyper-V Writer on the backup node also quiesces its own VMs
  6. VSS snapshot of the entire CSV is created
  7. The backup tool can then backup the CSV via the VSS snapshot

image

More VMware Compete Wins For Hyper-V

VMware made a cute video to defend themselves against Windows Server 2012 Hyper-V.  But MSFT continues to hand out a GTA IV style baseball beat down at TechEd.

This post would have been impossible without the tweeted pictures by David Davis at http://www.vmwarevideos.com

General Feature Comparison

Does your business have an IT infrastructure so you can play, or to run applications?  What features have you got to improve those services?

Capability vSphere Free vSphere 5.0 Ent + WS2012 Hyper-V
Incremental backups No Yes Yes
Inbox VM replication No No Yes
NIC teaming Yes Yes Yes
Integrated High Availability No Yes Yes
Guest OS Application Monitoring N/A No Yes
Failover Prioritization N/A Yes Yes
Affinity & Anti-Affinity Rules N/A Yes Yes
Cluster-Aware Updating N/A Yes Yes

So Hyper-V has more application integrations.

Live Migration

Capability vSphere Free vSphere 5.0 Ent + WS2012 Hyper-V
VM Live Migration No Yes Yes
1 GB Simultaneous Live Migrations N/A 4 Unlimited
10 GB Simultaneous Live Migrations N/A 8 Unlimited
Live Storage Migration No Yes Yes
Shared Nothing Live Migration No No Yes
Network Virtualisation No Partner Yes

Shared-nothing Live Migration is actually a big deal.  We know that 33% of business don’t cluster their hosts, and another 33% have a mix of clustered and non-clustered hosts.  Share-Nothing Live Migration enables mobility across these platforms.  Flexibility is the #2 reason why people virtualise (see Network Virtualisation later on).

Clustering

Can you cluster hosts, and if so, how many?  How many VMs can you put on a host cluster?  Apps require uptime too, because VMs need to be patched, rebooted, and occasionally crash.

Capability vSphere Free vSphere 5.0 Ent + WS2012 Hyper-V
Nodes/Cluster N/A 32 64
VMs/Cluster N/A 3000 4000
Max Size iSCSI Guest Cluster N/A 0 64 Nodes
Max Size Fibre Channel Guest Cluster 2 Nodes 2 Nodes 64 Nodes
Max Size File Based Guest Cluster 0 0 64 Nodes
Guest Clustering with Live Migration Support N/A No Yes
Guest Clustering with Dynamic Memory Support No No Yes

Based on this data, WS2012 Hyper-V is the superior platform for scalability and fault tolerance.

Virtual Switches

In a cloud, the virtual switch plays a huge role.  How do they stack up against each other?

Capability vSphere Free vSphere 5.0 Ent + WS2012 Hyper-V
Extensible Switch No Replaceable Yes
Confirmed partner extensions No 2 4
PVLAN No Yes Yes
ARP/ND Spoofing Protection No vShield/Partner Yes
DHCP Snooping Protection No vShield/Partner Yes
Virtual Port ACLs No vShield/Partner Yes
Trunk Mode to VMs No No Yes
Port Monitoring Per Port Group Yes Yes
Port Mirroring Per Port Group Yes No

Another win for WS 2012 Hyper-V.  Note that vShield is an additional purchase on top of vSphere.  Hyper-V is the clear feature winner in cloud networking.

Network Optimisations

Capability vSphere Free vSphere 5.0 Ent + WS2012 Hyper-V
Dynamic Virtual Machine Queue (DVMQ) NetQueue netQueue Yes
IPsec Task Offload No No Yes
SR-IOV DirectPath I/O DirectPath I/O Yes
Storage Encryption (CSV vs VMFS) No No Yes
  • NetQueue supports a subset of the VMware HCL
  • Apparently DirectPath I/O VMs cannot vMotion (Live Migrate) without certain Cisco UCS (blade server centres) configurations
  • No physical security for VMFS SANs in the data center or cololated hosting

Hyper-V wins on the optimisation side of things for denser and higher throughput network loads.

VMware Fault Tolerance

FT feature: Run a hot standby VM on another host, taking over if another host should fail.

Required sacrifices:

  • 4 FT VMs per host with no memory overcommit: expensive because of low host density
  • 1 vCPU per FT VM: Surely VMs that require FT would require more than one logical processor (physical thread of execution)?
  • EPT/RVI (SLAT) disabled: No offloaded memory management.  This boosts VM performance by around 20% so I guess this FT VM doesn’t require performance.
  • Hot-plug disabled: no hot adding devices such as disks
  • No snapshots: not such a big deal for a production VM in my opinion
  • No VCB (VSS) backups: This is a big deal, because now you have to do a traditional “iron” backup of the VM, requiring custom backup policy, discarding the benefits of storage level backup for VMs

If cost reduction is the #1 reason for implementing virtualisation, then VMware FT seems like a complete oxymoron to me.  VMware FT is a chocolate kettle.  It sounds good, but don’t try boil water with it.

VMware Autodeploy

Centrally deploy a Hypervisor from a central console.

We have System Center 2012 Virtual Machine Manager for bare metal deployment.  Yes, it’s a bit more complex to setup.  B-u-t … with converged fabrics in WS2012, Hyper-V networking is actually getting much easier.

And even with System Center 2012 Datacenter, the MSFT solution is way cheaper than the vSphere alternative, and provides a complete cloud in the package, whereas vSphere is only the start of your vTaxation for disparate point solutions that contradict desires for a deeply integrated, automated, connected, self-service infrastructure.

More Stuff

I didn’t see anything on SRM versus Hyper-V Replica but I guess it was probably discussed.  SRM is allegedly $250-$400 per VM.  Hyper-V Replica is free and even baked into the free Hyper-V Server.  And Hyper-V Replica works with cloud vendors as well as internal sites.  Orchestration of failover can be done manually, by very simple PowerShell scripts, or with System Center 2012 Orchestrator (demonstrated in day 1 keynote).

I don’t know anything about vSphere support for Infiniband and RDMA, both supported by WS2012.  In fact, today it was reported that WS2012 RC Hyper-V benchmarked at 10.36 GigaBYTES/second (not Gbps) with 4.6% CPU overhead.

I also don’t know if VMware supports network abstraction, as in Hyper-V Network Virtualisation, essential for mobility between different networks and cloud consolidation/migration.

Take some time to review the new features in WS2012 Hyper-V.

TechEd North America 2012 Day 2 Keynote

Antoine Leblond, Corporate Vice President is speaking, and the topic is Windows 8.

Over 600,000,000 copies of Windows 7 have been sold.  The enterprise features of Windows 8 are based on, but evolved from Windows 7.  We have moved on from the desktop-centric world when Windows 7 was launched.  Over 75% of consumer machines being bought in USA this year are laptops.  Next year it is projected that tablets will outsell PCs.  More machines will run off of the battery than DC power.  Every microwatt of power saved extends the battery life of the machine.  Tablets = touch UI.  If projections are right, then touch becomes the primary UI.

Connectivity is ubiquitous.  We have moved from a world of local content to a world of multi-cloud stored data: flickr, facebook, Skydrive, Office365, and many others.

The hard split between how I use a machine at home and how I use a machine at work has been blurred or completely dissolved.  Users have reimagined how they use PCs, and Microsoft has reimagined Windows.

Demo Business Apps

We see a bunch of bespoke apps with live tiles.  Info is flashed up so user can see current status.  The dev has use semantic zoom … a conceptual zoom rather than a graphic zoom. 

A CRM app uses GPS sensor to find out where the sales person is, and then shows the location of customers in a map.  Clever.

Linda Averett

Demo on Samsung Ultrabook with mouse/keyboard and a “modern touchpad”.  The Windows 8 gestures are recognised by the touchpad.  Kind of Mac-like I guess, handy if you don’t have touch screen – or are one of those OCD people who hates fingerprints on their screen.

A NewEgg app is shown, with search, filter and contracts being shown off.

Antoine Leblond

Now we see a sales pipeline automation app that is a beta/test app by SAP.  Looks very sexy … and it’s by SAP!  What an oxymoron!  Using touch, the user can explore the data that is graphically presented, changing variables and seeing the results.  Don’t thing columns and rows of numbers.  It was all imagery that was designed for exploring and touch.

Linda Averett – Business Features

She has a Lenovo laptop, but it has a touch screen.  Windows 7 is running in a Hyper-V VM on Windows 8.  As you should know by reading here, Hyper-V is in Windows 8 Pro and Enterprise.  It seems to get biggest cheer of anything in the keynotes so far (audience has been very quiet these 2 days).  Cut The Rope is running in IE 9 in the Win7 VM. 

BitLocker (AES256 full disk encryption) is shown off – it and BitLocker-To-Go now are in Windows 8 Pro, not just in the Enterprise edition.  Great for customers – not great for those of us trying to sell Software Assurance Smile 

Then lots of dev stuff and then the end of the keynote.

Technorati Tags: ,