Looking Back on Day 3 at Build Windows … Plus More!

Today was storage day at Build for me.  I attended 1.5 Hyper-V networking sessions and filled out the rest of the day with clustering and storage (which are pretty much one and the same now).  The highlights:

  • CSV backup in Windows Server 8 does not use Redirected I/O
  • The storage vendors were warned to increase the size of their iSCSI 3 tables (much bigger cluster support now from Microsoft, and more opportunity to use the SAN)
  • Storage Pool and File Share Clustering … well let me dig deeper ….

image

Investing in a virtualisation cluster is a pricey deal, for anyone because of the cost of SAS/iSCSI/FC SANs.  Even a start kit with just a few TB of disk will be the biggest investment in IT that most small/medium businesses will ever make.  And it requires a bunch of new skills, management systems, and procedures.   The operations of LUN deployment can slow down a cloud’s ability to respond to business demands.

Microsoft obviously recognised this several years ago and started working on Storage Pools and Spaces.  The idea here is that you can take a JBOD (just a bunch of disks, which can be internal or DAS) or disks on an existing SAN, and create a storage pool.  That is an aggregation of disks.  You can have many of these for isolation of storage class, administrative delegation, and so on.  From the pool, you create Storage Spaces.  These are VHDX files AFAIK on the disk, and they can be mounted as volumes by servers.

In this new style of Hyper-V cluster design, you can create a highly available File Server cluster with transparent failover.  That means failover is instant, thanks to a Witness (informs the server connecting to the cluster if a node fails and to connect to an alternative).  For something like Hyper-V, you can set your cluster up with active-active clustering of the file shares, and this uses CSV (CSV is no longer just for storing Hyper-V VMs).  The connecting clients (which are servers) can be load balanced using PowerShell scripting (could be a scheduled task).

Note: active/passive file share clustering (not using CSV) is recommended when there are lots of little files, when implementing end user file shares, and when there is a lot of file metadata activity.

Now you can create a Hyper-V cluster which uses the UNC paths of the file share cluster to store VMs.

This is all made possible by native NIC teaming, SMB 2.2, RDMA, and offloading technologies.

The result is actually a much cheaper storage solution than you could get with a starter kit SAN, and probably would include much more storage space.  It is more flexible, and more economic.  One of the examples we were shown had the file server cluster also hosting other shares for SQL Server files and end user file shares.

Brian Ehlert (@BrianEh) said it best: file servers are now cool.

Asymmetric Hyper-V Cluster

Elden Christensen briefly mentioned this one in his talk and I asked him about it at Ask The Experts.  The idea is that you take the above design, but only a single Windows cluster is used.  It is used to cluster the VMs and to cluster the file share(s).  This flattens the infrastructure, reduces the number of servers, and thus reduces the cost.  This one would be of great interest to small and medium businesses, as well as corporate branch offices.

Self Healing CSV

Myself and Didier van Hoye (@workinghardinit) once had a chat about sizing of CSV.  He brought up the point that no one wanted to take a CSV offline for a weekend to chkdsk a multi-terabye CSV volume.  True!

Microsoft have now implemented this solution in Windows Server 8:

  • Every 60 seconds, the health of the CSV volume is assessed.
  • If a fault is found, Windows will target that fault for a fix.
  • Windows will dismount the volume, and start caching VM write activity.
  • With the CSV offline, Windows will start fixing the fault.  It has an 8 second window.
  • If the fault is fixed the volume is brought back online and the storage activity cache is pushed out.
  • If the fault is not fixed, the volume is brought back online, and Windows will take a later 8 second break at continuing to fix the fault.  Eventually the fault is fixed with a one or more 8 second cumulative attempts.

VDI Changes

It seems like the VDI management/broker architecture will be getting much simpler.  We’re also getting some performance boosts to deal with the 9am disk storm.  Pooled VMs will be based on a single VHD.  Each created pooled VM will actually be a differencing disk.  When a pooled VM is booted up on a host, a differencing disk is created and cached on the host.  The disk is stored on an SSD in the host.  Because it’s a differencing disk, it should be tiny, holding probably no more than the user’s state.  Using local high IOPS SSD massively improves performance over accessing AVHDs on the SAN, and takes care of the 9am storage storm.

Designing Systems for Continuous Availability – Multi-Node with Remote File Storage

The speakers are Jim Pinkerton and Claus Jorgensen

Topic is on using SMB for remote storage of application files. Servers access their files on UNC file paths. Example: VM VHDs, SQL Server database and log files. Easier to provision and manage shares than LUNs. More flexible with dynamic serer relocation. No need for specialised hardware/netwok knowledge or infrastructure. LOWER cost.

Basic idea of architecture: some shared stord (e.g. Storage Spaces), file server cluster with shares, Hyper-V cluster hosts, SQL, or other servers store files on those shares.

Transparent Failover
In W2008 R2 a failover is not transparent. There is brief downtime to take down, move over, bring up the clustered service or role. 99% uptime at best

Failover in W8 is transparent to the server application. Supported planned and unplanned failovers, e.g. maintenance, failures, and load balancing. Requires Windows Failover Cluste, and both server and client must be running Windows Server 8. All operations, not just IO, must be continuous and transparent – transparent for file and directory operations.

This means we can have an application cluster that places data on a back end file server cluster. Both can scale independently.

Changes to Windows Server 8 to make transparent failover possible:
– New protocol: SMB 2.2
– SMB 2.2 Client (redirector): client operation replay, end-to-end for replay of idempotent and non-idempotent operations
– SMB 2.2 Server: support for network stte persistence, singles share spans multiple nodes (active/active shares – wonder if this is made possible by CSV?), files are always opened write-through.
– Resume Key – used to failover to: resume handle state after planned or unplanned failover, fence handle state information, mask some NTFS issues. This fences file locks.
– Witness protocol: enables faster unplanned failover because clients do not wait for timeouts, enables dynamic reallocation of load (nice!). Witness tells the client that a node is offline and tells it to redirect.

SMB2 Transparent Failover Semantics:
Server side: state persistence until the client reconnects. Example: delete a file. The file is opened, a flag is set to delete on close, and you close the file -> it’s deleted. Now you try to delete the file on a clustered file share. A planned failover happens. The node closes the file and it deletes. But after reconnect the client tries to close the file to delete it but its gone. This sort of circumstance is handled.

In Hyper-V world, we have “surprise failover” where a faulty VM can be failed over. The files are locked on file share by original node with the fence. A new API takes care of this.

SMB2 Scale Out
In W2008 R2 we have active-pasive clustered file shares. That means a share is only ever active on 1 node, so its not scalable. Windows Server 8 has scale out via active-active shares. The share can be active on all nodes. Targeted for server/server applications like SQL Server and Hyper-V. Not aimed at client/server applications like Office. We also get fewer IP addresses and DNS names. We only need one logical file server with a single file system namespace (no drive letter limitations), and no cluster disk resources to manage.

We now have a new file server type called File Server For Scale-Out Application Data. That’s the active/active type. Does not support NFS and certain role sevices such as FSRM or DFS Replication. The File Server for General Use is the active/passive one for client/server, but it also supports transparent failover.

VSS for WIndows Server 8 File Shares
Application consistent shadow copyof server application data that is stored on Windows Server 8 file shares. Bckup agent on the application server triggers backup. VSS on app server acts with File Share Shaow Copy Provider. It hits the File Share Shadow Copy Agent on the file server via RPC, and that then triggers the VSS on the file server to create the shadow copy. The backup server can read the snapshot directly from the file server, saving on needless data transfer.

Performance for Server Applications
SMB2.2 makes big changes. Gone from 25% to 97% of DAS performance. MSFT used same DAS storage in local and file share storage with SQL Server to get these numbers. NIC teaming, TCP offloads and RDMA improved performance.

Perfmon counters are added to help admins troubleshoot and tune. IO size, IO latency, IO queue length, etc. Can seperately tune SQL data file or log file.

Demo:
Scale-out file server in the demo. 4 clients accessing 2 files, balanced across 2 nodes in the scale out file server cluster. A node in the cluster is killed. The witness service sees this, knows which clients were using it, and tells them to reconnect – no timeouts, etc. The clients do come back online on the remaining node.

Platforms
– Networking: 2+ interfaces … 1 GbE, 10 GbE optionaly with RDMA, or Infiniband with RDMA
– Server: 2+ servers … “cluster in a box” (a self contained cluster appliance) or 2+ single node servers.
– Storage: Storage Spaces, Clustered PCI RAID (both on Shared JBOD SAS), FC/iSCSI/SAS fabric (on arrays)

Sample Configurations
– Lowest cost: cluster in a box with shared JBOD SAS using 1 GbE and SAS HBA. Or use the same with Cluster PCI RAID for better performance instead of the SAS HBA. An external port to add external storage to scale out. Beyong td that look at 10 GbE
– Discreet servers: 1/10 GbE with SAS HBA to Shared JBOD SAS. Or use advanced SANS.

Note: This new storage solution could radically shake up how we do HA for VMs or server applications in the small/mid enterprise. It’s going to be cheaper and more flexible. Even the corporations might look at this for low/mid tier services. MSFT did a lot of work on this and it shows IMO; I am impressed.

Designing Systems for Continuous Availability – Multi-Node with Block Storage

Speakers: Elden Christensen and Mallikarjun Chadalapaka

This session will focus on block based storage. It’s a clustering session. It seems like failover clustering is not opimised for the cloud. *joking*

Sneak Peak at Failover Clustering
– scale up to 4,000 VMs in a cluster
– scale out to 63 nodes in a cluster
– 4 x more than W2008 R2

Note: more persistent reservations and iSCSI-3 resevations to the SAN!

Multi-Machine Management with Server Manager, Featuring Cluster Integration
– Remote server management
– Server groups to manage sets of machines – single click to affect all nodes at once (nice!)
– Simplified management
– Launch clustering management from Server Manager

New Placement Policies
– Virtual Machine Priority: start with the most important VMs first (start backend first, then mid tier, then front tier). Ensure the most important VMs are running – shut down low priority VMs to allow high priority VMs to get access to constrained resources
– Enahance Failover Placement: Each VM based on note with best avaialble memory resources. Memory requirements determined on the per VM basis – finds best node based on how DM is configured. NUMA aware.

VM Mobility
– Live Migration Queing
– Storage Live Migration
– Concurrent Live Migrations – multiple simulataneous LMs for a given source or target
– Hyper-V Replica is integrated with clustering

Cluster Management
Demo: The demo cluster has 4001 VMs and 63 nodes (RDP into Redmond). In the FCM, it is smooth and fast. You can see the priority of each VM. You can search for VMs with basic and complex queries. The thumbnai of the VM is on the FCM.

Guest Clustering – Increased Storage Support
– Most common scenario is SQL Server
– Could only be done in iSCSI. Now we have a virtual fibre channel HBA

VM Monitoring
– Application level recovery: Service Control Manager or event triggered
– Guest Lvel HA Recovery – FC reboots the VM
– Host level HA recovery – FC fails over VM to another node
– Generic health monitoring for any application: Service Control Manager and generation of specific event IDs

VM Monitoring VS Guest Clustering
– VM Monitoring: Application monitoing, simplified configuration and event monitoring – good for tier 2 apps
– Guest clustering: applciation health monitoring, application mobility (for scheduled maintenance) – still for tier 1 apps

Automated Node Draining
Like VMM maintenance mode. Click a node to drain it of hosted roles (VMs).

Cluster Aware Updating
CAU updates alll cluster nodes in an automated fashion without impacting service availability. It is an end to end orchestration of updates. Built on top of WUA. Patching does not impact cluster quorum. Workflow:

– Scan nodes to ID appropriate updates
– ID node with fewest worklaodss
– Place node into maintenance mode to drain
– WSUS update
– Rinse and repeat

The workloads return to their original node at the end of the process.

Note: The machine managing this is called the orchestrator. That might be a little confusing because SC Orchestrator can do this stuff too.
Note: I wonder how well this will play with updates in VMM 2012?

There is extensibility to include firmware, BIOS, etc, via updates, via 3rd party plugin.

Demo: Streaming video from a HA VM. The cluster is updated, the workflow runs, and the videos stay running. The wizard gives you the PSH. You can save that and schedule it. No dedicated WSUS needed by the looks of it.

Cluster Shared Volume
Redirected I/O is b-a-d.

Windows Server 8: Improve backup / restore of CSV. Expanded CSV to include more roles. CSV expands out to 63 nodes. Enables zero downtime for planned and unplanned failures of SMB workloads Provides iteroperability with file system mini-filer drivers (a/v and backup), and lots more.

CSV no longer needs to be enabled. Just right click on a disk to make it a CSV. File systems now appears as CSVFS. It is NTFS under the covers. It enables applications to know they are on CSV and ensure their copatibility.

AV, Continuous data protection, backup and replication all use filter drivers to insert themselves in the CSV pseudo-file system stack.

High speed CSV I/O redirection will have negligible impact. CSV is integrated with SMB mutli-channel. Alows streaming CSV traffic acros multiple networks. Delivers improved performance when in redirected mode. CSV takes advantage of SMB 2 Direct and RDMA

BitLocker is now supported on traditional shared nothing disks and CSV. The Cluster Name Object (CNO) ID is used.

Cluster Storage Requirements Are:
– FC
– SAS RBOD
– Storage Spaces
– RAID HBA/SAS JBOD
– SMB
– iSCSI
– FCoE

Data Replication storage requirements:
– Hardware
– Software replication
– Aplication Replication (Exchange, SQL Denali AlwaysOn)

SCSI Command requirements: storage must support SCSI-3 SPC-3 compliant SCSI Commands.

Cost Effective & Scale Out with Storage Spaces. Integrated and supported by clustering and CSV.

Redirected I/O is normally file level. There is now a block level variant – not covered in this talk.

What if your Storage Spaces servers were in the same cluster as the Hyper-V hosts? High speed block level redirected IO. Simplified management. Single CSV namespace accessiible on all nodes. Unified security model Single cluster to manage. VMs can run anywhere.

Note: Wow!

Called an asymmetric configuration.

CSV Backup
Support for parallel backups on same or different CSV volumes, or on same or different cluster nodes. Improved I/O performance. Direct IO mode for snapshot and backup operations. (!!!) Software snapshots will stay in direct IO mode (!!!!) CSV volume ownership does not change during backup. Improved filter driver support for incremental backups. Backup applications do not need to be CSV aware. Fully compatible with W2008 R2 “requestors”.

Distributed App Consistens VM Shadow Copies:
Saw you have a LUN with VMs scattered across lots of hosts. Can now snap the entire LUN using an orchestrated snapshot.

Comparing Backup With W2008 R2
– Backup app: W2008 R2 rquires CSV aware backup app
– IO performance: No redireced IO for backup
– Locality of CSV volume: Snapshot can be created by any volume
– Complexity: Cluster coordinates the backup process

Note: I’m still trying to get over that we stay in direct IO during a system VSS provider backup of a CSV.

Cluster.exe is deprecated. Not there by default but you can install it in Server Manager. Use PSH instead.

SCSI Inquiry Data (page 83h) is now changed from recommended to required.

Designing Systems for Continuous Availability and Scalability.

Extra session where I ran to in this slot after previous one ended very early.  This one is on storage pools and spaces.  Speaker has a Dell 1U server with a bunch of internal unallocated disks.  Uses PSH to:

  1. New-StoragePool (Get-StorageSubsystem and Get-PhysicalDisk)  The command pools all un-pooled disks.  The disks appear from Disk Manager because they are pooled.
  2. A space (which is a virtual disk) is created: New-VirtualDisk
  3. Initialize-Disk is run to initialise it.
  4. New-Partition formats the disk which is visible in disk manager and can be explored.  Note that it has a drive letter.

Optimized Space Utilisation

  • On-demand provisioning with trim (h/w command that gives space back to the pool when files are deleted) support – for NTFS, Hyper-V, and apps like SQL.
  • Elastic capacity expansion by just adding more disks.  You’ll get alerts when nearly full.
  • Defrag optimized to work with Storage Pools

Resiliency:

  • Mirrored spaces and Parity Spaces with integrated journaling supported.
  • Per-pool hot spare disk supported
  • Application driven intelligent error correction: SQL and Exchange should be able to take advantage of this.

Not very well explained – sorry. 

Demo: he plays a video that is stored on a resilient space and pulls a disk from it.  The video is uninterrupted. 

Spaces have granular access control.  Could be good for multi-tenant deployment – I’m hesitant of that because it means giving visibility of the back end system to untrusted customers (rule #1 is users are stupid).

You can base SLA on the type of disks in your JBOD, e.g. SSD, 15K or SATA.  Your JBOD could be connected to a bunch of servers.  They can create spaces for themselves.  E.g. a file server could have spaces, and use the disk space to store clustered VMs.

Questions to sfsquestions@microsoft.com

Enabling Multi-Tenancy and Converged Fabric for the Cloud Using QoS

Speakers: Charley Wen and Richard Wurdock

Pretty demo intensive session.  We start off with a demo of “fair sharing of bandwidth” where PSH is used with minimum bandwidth setting to provide equal weight to a set of VMs.  One VM is needs to get more bandwidth but can’t get it.  A new policy is deployed by script and it get’s a higher weight. It then can access more of the pipe.  Maximum bandwidth would have capped the VM so it couldn’t access idle b/w.

Minimum Bandwidth Policy

  • Enforce bandwidth allocation –> get performance predictability
  • Redistribute unused bandwidth –> get high link utilisation

The effect is that VMs get an SLA.  They always get the minimum if the require it.  They consume nothing if they don’t use it, and that b/w is available to others to exceed their minimum.

Min BW % = Weight / Sum of Weights

Example of 1 Gbps pipe:

  • VM 1 = 1 = 100 Mbps
  • VM 2 = 2 = 200 Mbps
  • VM 3 = 5 = 500 Mbps

If you have NIC teaming, there is no way to guarantee minimum b/w of total potential pipe. 

Maximum Bandwidth

Example, you have an expensive WAN link.  You can cap a customer’s ability to use the pipe based on what they pay.

How it Works Under the Covers

Bunch of VMs trying to use a pNIC.  The pNIC reports it’s speed.  It reports when it sends a packet.  This is recorded in a capacity meter.    It feeds into the traffic meter and it determines classification of packet.  Using that it figures out if exceeds capacity of the NIC.  The peak bandwidth meter is fed by latter and it stops traffic (draining process). 

Reserved bandwidth meter guarantees bandwidth. 

All of this is software, and it is h/w vendor independent. 

With all this you can do multi-tenancy without over-provisioning.

Converged Fabric

Simple image: two fabrics: network I/O and storage I/O across iSCSI, SMB, NFS, and Fiber Channel.

Expensive, so we’re trying to converge onto one fabric.  QoS can be used to guarantee service of various functions of the converged fabric, e.g. run all network connections through a single hyper-v extensible switch, via 10 Gbps NIC team.

Windows Server 8 takes advantage of hardware where available to offload QoS.

We get a demo where a Live Migration cannot complete because a converged fabric is saturated (no QoS).  In the demo a traffic class QoS policy is created and deployed.  Now the LM works as expected … the required b/w is allocated to the LM job.  The NIC in the demo supports h/w QoS so it does the work.

Business benefit: reduced capital costs by using fewer switches, etc.

Traffic Classification:

  • You can have up to 8 traffic classes – 1 of them is storage, by default by the sound of it.
  • Appears that DCB is involved with the LAN miniport and iSCSI miniport is traffic QoS with traffic classification.  My head hurts.

Hmm, they finished after using only half of their time allocation.

Platform Storage Evolved

“Windows 8 is the most cost effective HA storage solution”

  • Storage Spaces: virtualised storage
  • Offloaded data transfer (ODX)
  • Data deduplication

File System Availability

Confidently deploy 64 TB NTFS volumes with Windows 8 with Online scan and repair:

  • Online repair
  • Online scan and corruption logging
  • Scheduled repair
  • Downtime proportional only to number of logged corruptions: scans don’t mean downtime now
  • Failover clustering & CSV integration
  • Better manageability via Action Center, PowerShell and Server Manager

Note: this means bigger volumes aren’t the big maintenance downtime problem they might have been for Hyper-V clusters. 

Operational Simplicity

Extensible storage management API:

  • WMI programmatic interfaces
  • PSH for remote access and scripting – easy E2E provisioning
  • All new in-box application using one new API
  • Foundational infrastructure for reducing operations expenditure

Multi-vendor interoperability – common interface for IHVs

  • SMI-S standards conformant: proxy service enables broad interoperability with existing SMI-S storage h/w – standards based approach … wonder if the storage manufacturers know that Smile
  • Storage Management Provider interface enables host-based extensibility

Basically everything uses one storage management interface to access vendor arrays, SMI-S compliant arrays, and Storage Spaces compatible JBOD.  The Windows 8 admin tools use this single API via WMI and PowerShell.

We are shown a 6 line PSH script to create a disk pool, create a virtual disk, configure the virtual disk, mount it on the server, and format it with NTFS.

Storage Spaces

New category of cost effective, scalable, available storage, with operationsl simplicity for all customer segments.  Powerful new platform abstractions:

  • Storage pools: units of aggregation (of disks), administration and isolation
  • Storage spaces (virtual disks): resiliency, provisioning, and performance

Target design point:

  • Industry standard interconnects: SATA or (shared) SAS
  • Industry standard storage: JBODs

You take a bunch of disks and connect them to the server with (shared or direct) SAS (best) or direct SATA (acceptable).  The disks are aggregated into pools.  Pools are split into spaces.  You can do CSV, NFS, or Windows Storage Management.  Supports Hyper-V.

Shared SAS allows a single JBOD to be attached to multiple servers to make a highly available and scalable storage fabric.

Capabilities:

  • Optimized storage utilisation
  • Resiliency and application drive error correction
  • HA and scale out with Failover Clustering and CSV
  • Operational simplicity

Demo:

Iometer is running to simulate storage workloads.  40x Intel x25-M 160 GB SSDs connected to a Dell T710 (48 GB RAM, dual Intel CPU) server with 5 * LSI HBAs.  Gets 880580.06 read IOPS with this developer preview pre-beta release.

Changes demo to a workload that needs high bandwidth rather than IOPS.  This time he gets 3311.04 MB per second throughput.

Next demo is a JBOD with a pool (CSV).  A pair of spaces are created in the pool, each assigned to virtual machines.  Both VMs have VHDs.  The VHDs are stored in VHDs.  Both are running on different Hyper-V nodes.  Both nodes access the space via CSV.  In the demo, we see that both nodes can see both pools.  The spaces appear in Explorer with driver letters (Note: I do not like that – indicates a return to 2008 days?).  For some reason he used Quick Migration – why?!?!?  A space is only visible in explorer on a host if the VM is running on that host – they follow when VMs are migrated between nodes. 

Offloaded Data Transfer (ODX)

Co-developed with partners, e.g. Dell Equalogic.  If we copy large files on the SAN between servers, the source server normally has had to do the work (data in, CPU and SAN utilisation), send it over a latent LAN, and then the destination server has to write it to the SAN again (CPU and data out).  ODX offloads the work to a compatible SAN which can do it more quickly, and we don’t get the needless cross LAN data transfer or CPU utilisation.  E.g. Host A wants to send data to Host B.  Token is passed between hosts.  Host A sends job to SAN with the token.  SAN uses this token to sync with host B, and host B reads direct from the SAN, instead of getting data from host A across the LAN.  This will be a magic multi-site cluster data transfer solution.

In a demo, he copies a file from SAN A in Redmond to SAN B in Redmond on his laptop in Anaheim.  With ODX, runs at 250 Mbps with zero data transfer on his laptop, takes a few minutes.  With no ODX, it wants to copy data to Anaheim from SAN A and then copy data from Anaheim to SAN B, would take over 17 hours.

Thin Provisioning Notifications

Can ID thinly provisioned virtual disks. 

Data Deduplication 

Transparent to primary server workload.  Can save over 80% of storage for VHD library, around 50% for general file share.  Deduplication scope is the volume.  It is cluster aware.  It is integrated with BranchCache for optimised data transfer over the WAN.

The speakers run out of time.  Confusing presentation: think the topics covered need much more time.

Designing the Building Blocks for a Windows Server 8 Cloud

Speakers: Yigal Edery and Ross Ortega from Microsoft.

Windows Server 8 apparently is cloud optimized.  That rings a bell … I expect some repetition so I’ll blog the unique stuff.

There is no one right cloud architecture.  The architecture depends on the environment and the requirements.  Don’t take from this that there are no wrong cloud architectures Winking smile  “Building an optimized could requires difficult decisions and trade-offs among an alphabet soup of options”.  This session will try provide some best practices.

Requirements

  • Cost
  • Scalability
  • Reliability
  • Security
  • Performance
  • High availability

Balance these and you get your architecture: workloads, networking, storage and service levels.

Which workloads will run in my cloud?

You need to understand your mission.

  • Cloud aware apps or legacy/stateful apps? Are you IaaS or PaaS or SaaS?
  • Are workloads trusted?  This is an important one for public clouds or multi-tenant clouds.  You cannot trust the tenants and they cannot trust each other.   This leads to some network security design decisions.
  • Compute-bound or Storage-bound?  This will dictate server and storage design … e.g. big hosts or smaller hosts, big FC SAN or lower end storage solution.
  • Workloads size?  And how many per server?  Are you running small apps or big, heavy apps?  This influences server sizing too.  Huge servers are a big investment, and will cost a lot of money to operate while they are waiting to be filled with workloads.

Networking

  • Are you isolating hoster traffic from guest traffic?  Do you want them on the same cable/switches?  Think about north/south (in/out datacenter) traffic and east/west (between servers in datacenter) traffic.  In MSFT datacenters, 70% is east/west traffic.
  • Will you leverage existing infrastructure?  Are you doing green field or not?  Green field gives you more opportunity to get new h/w that can use all Windows Server 8 features.  But trade-off is throwing out existing investment if there is one.
  • Will you have traffic management?

Infiniband VS 10 GBE vs 1 GbE

10 GbE:

  • Great performance
  • RDMA optional for SMB 2.2
  • Offers QoS (DCB) and flexible bandwidth allocation
  • New offloads
  • But physical switch ports are more expensive
  • New tech appears on 10 GbE NICs rather than on 1 BgE

InfiniBand (32 Gb and 56 Gb):

  • Very high performance and low latency
  • RDMA includes for SMB 2.2 file access
  • But network management different than Ethernet.  Can be expensive and requires a different skillset.  Can be hard to find staff, requires specific training.  Not many installations out there.

1 GbE:

  • Adequate for many workloads
  • If investing in new equipment for long life, then invest in 10 GbE to safeguard your investment

Price of WAN traffic is not reducing.  It is stable/stuck.  Datacenter north/south WAN links can be a fraction of the bandwidth of east/west LAN links.

How many NICs should be in the server? 

We are shown a few examples:

Physical Isolation with 4 NICs:

  • Live Migration –1
  • Cluster/Storage – 1
  • Management – 1
  • Hyper-V Extensible Switch – 2 bound together by Windows 8 NIC teaming, use Port ACLs for the VMs

Many people chose 10 GbE to avoid managing many NICs.  Windows Server 8 resloves this with NIC teaming so now you can use the b/w for throughput.

2 NICs with Management and guest isolation:

  • Live Migration, Cluster/Storage, Management (all on different subnets) – 1
  • Hyper-V Extensible Switch – 1 NIC, use Port ACLs for the VMs

1 * 10 GbE NIC:

  • Live Migration, Cluster/Storage, Management all plug into the Hyper-V Extensible Switch.
  • VMs plug into the Hyper-V Extensible Switch
  • 1 * 10 GbE NIC for the Hyper-V Extensible Switch
  • Use QoS to management bandwidth
  • Use Port ACLs for all ports on the Hyper-V Extensible Switch to isolate traffic
  • This is all done with PowerShell

Windows Server 8 NIC Scalability and Performance Features

  • Data Center Bridging (DCB)
  • Receive Segement Coalescing (RSC)
  • Receive Side Scaling (RSS)
  • Remote Direct Memory Access (RDMA)
  • Single Root I/O Virtualisation (SR-IOV)
  • Virtual Machine Queue (VMQ)
  • IPsec Offload (IPsecTO)

Note: no mention of failover or Hyper-V cluster support of the features.  E.g. We don’t recommend TOE in W2008 R2 … not supported.

Using Network Offloads for Increase Scale

  • NIC with RSS for native (parent) traffic: Live Migration, Cluster/Storage, Management
  • NIC with VMQ for virtualisation traffic: Hyper-V Extensible Switch

Note: RSS and VMQ cannot be enabled on the same NIC.  RSS not supported on the Hyper-V switch.

  • Raw performance: RDMA and SR-IOV:
  • Flexibility and scalability: Hyper-V extensible switch, network virtualisation, NIC teaming, RSS, VMQ, IPsecTO

Notes:

  • SR-IOV and RSS work together.
  • Offloads require driver and possibly BIOS support.
  • When you are working with 1 or restricted number of NICs, you need to pick and choose which features you use because of support statements.

Storage

HBAs VS NICs.  HBA (FC, iSCSI, or SAS) bypasses networking stack and has less CPU utilisation.

Storage Architectures

2 possible basic solutions:

  • Internal/DAS disk: cheap with disk bound VMs
  • External disk: expensive but mobile VMs, can grow compute and storage capacity on 2 different axis, compute bound VMs, storage offloading

The Great Big Hyper-V Survey of 2011 findings are that the breakdown in the market is 33% use A, 33% use B, and 33% use both.

Service Levels

  • What performance guarantees do you give to the customers?  More guarantees = more spending
  • How important is performance isolation?
  • What are the resiliency promises?  This is the challenging one: in-datacenter or inter-datacenter. 

More on the latter:

  • Some failure is acceptable.  You can offer cheaper services with storage/compute bound VMs.  Often done by hosters.  Windows Server 8 trying to offer mobility with non HA Live Migration.
  • Failure is not acceptable: Failover clustering: make everything as HA as possible.  Dual power, dual network path, N fault tolerant hosts, etc.  Maybe extend this to another data center.  Often done in private cloud and legacy apps, rarely done by hosters because of the additional cost.  Windows Server 8 trying to reduce this cost with lower cost storage options.

Representative Configurations by Microsoft

Tested in MS Engineering Excellence Center (EEC).  Optimized for different cloud types.  Guidance and PowerShell script samples.  These will be released between now and beta.

Start with:

The traditional design with 4 NICs (switch, live migration, cluster, and parent) + HBA: physically isoated netwowkrs, HBA, and W2008 R2 guidance.

Enable Support for Dmeanding Workloads:

  • Put Hyper-V switch on 10 GbE. 
  • Enable SR-IOV for better scale and lower latency

Enable 10 GbE for Storage:

  • Enable RSS
  • Fast storage
  • Ethernet so you have single skill set and management solution

Converge 10 GbE if you have that network type:

  • Use the NIC for Live Migration, Clsuter/Storage/Management.  Enable QoS with DCB and RSS.  MSFT saying they rarely see 10 GbE being fully used.
  • Switches must support DCB
  • QoS and DCB traffic classes ensure traffic bandwidth allocations

Use File Servers:

  • Share your VM storage using a file server instead of a SAN controller.  Use JBOD instead of expensive SAN.
  • Enable RDMA on file server NIC and converged 10 GbE NIC on host
  • RDMA is high speed, low latency, reduced CPU overhead solution.
  • “Better VM mobility”: don’t know how yet

High Availability and Performance with 3 * 10 GbE NICs

  • 2 teamed NICs for parent, cluster/storage, parent with DCB and RSS (no RDMA)
  • File server has 10 GbE
  • Hyper-V Switch and 10 GbE

Sample Documented Configuration:

  • 10 GbE NIC * 2 teamed for Live Migration, Cluster/Storage, and parent with DCB, RSS, and QoS.
  • 1 * 1 GbE with teaming for Hyper-V switch.
  • File server with 2 by 10 GbE teamed NICs with RSS, DCB, and QoS.
  • File server has FC HBA connected to back end SAN – still have SAN benefits but with fewer FC ports required and simpler configuration (handy if doing auto host deployment)

Damn, this subject could make for a nice 2 day topic.

Windows Server 8 Hyper-V Day 1 Look Back

I’ve just been woken up from my first decent sleep (jetlag) by my first ever earthquake (3.5) and I got to thinking … yesterday (Hyper-V/Private Cloud day) was incredible.  Normally when I live blog I can find time to record what’s “in between the lines” and some of the spoken word of the presenter.  Yesterday, I struggled to take down the bullet points from the slides; there was just so much change being introduced.  There wasn’t any great detail on any topic, simply because there just wasn’t time.  One of the cloud sessions ran over the allotted time and they had to skip slides.

I think some things are easy to visualise and comprehend because they are “tangible”.  Hyper-V Replica is a killer headline feature.  The increase host/cluster scalability gives us some “Top Gear” stats: just how many people really have a need for a 1,000 BHP car?  And not many of us really need 63 host clusters with 4,000 VMs.  But I guess Microsoft had an opportunity to test and push the headline ahead of the competition, and rightly took it.

Speaking of Top Gear metrics, one interesting thing was that the vCPU:pCPU ratiio of 8:1 was eliminated with barely a mention.  Hyper-V now supports as many vCPUs as you can fit on a host without compromising VM and service performance.  That is excellent.  I once had a quite low end single 4 core CPU host that was full (memory, before Dynamic Memory) but CPU only averaged 25%.  I could have reliably squeezed on way more VMs, easily exceeding the ratio.  The elimination of this limit by Hyper-V will further reduce the cost of virtualisation.  Note that you still need to respect the vCPU:pCPU ratio support statements of applications that you virtualise, e.g. Exchange and SharePoint, because an application needs what it needs.  Assessment, sizing, and monitoring are critical for squeezing in as much as possible without compromising on performance.

The lack of native NIC Teaming was something that caused many concerns.  Those who needed it used the 3rd party applications.  That caused stability issues, new security issues (check using HP NCU and VLANing for VM isolation), and I also know that some Microsoft partners saw it as enough of an issue to not recommend Hyper-V.  The cries for native NIC teaming started years ago.  Next year, you’ll get it in Windows 8 Server.

One of the most interesting sets features is how network virtualisation has changed.  I don’t have the time or equipment here in Anaheim to look at the Server OS yet, so I don’t have the techie details.  But this is my understanding of how we can do network isolation.

image

Firstly, we are getting Port ACLs (access control lists).  Right now, we have to deploy at least 1 VLAN per customer or application to isolate them.  N-tier applications require multiple VLANs.  My personal experience was that I could deploy customer VMs reliably in very little time.  But I had to wait quite a while for one or more VLANs to be engineered and tested.  It stressed me (customer pressure) and it stressed the network engineers (complexity).  Network troubleshooting (Windows Server 8 is bringing in virtual network packet tracing!) was a nightmare, and let’s not imagine replacing firewalls or switches.

Port VLANs will allow us to say what a VM can or cannot talk to.  Imagine being able to build a flat VLAN with hundreds or thousands of IP addresses.  You don’t have to subnet it for different applications or customers.  Instead, you could (in theory) place all the VMs in that one VLAN and use Port ACLs to dictate what they can talk to.  I haven’t seen a demo of it, and I haven’t tried it, so I can’t say more than that.  You’ll still need an edge firewall, but it appears that Port ACLs will isolate VMs behind the firewall.

image

Port ACLs have the potential to greatly simplify physical network design with fewer VLANs.  Equipment replacement will be easier.  Troubleshooting will be easier.  And now we have greatly reduced the involvement of the network admins; their role will be to customise edge firewall rules.

Secondly we have the incredibly hard to visualise network or IP virtualisation.  The concept is that a VM or VMs are running on network A, and you want to be able to move them to a different network B, but they want to do it without changing IP address or downtime.  The scenarios include:

  • A company’s network is being redesigned as a new network with new equipment.
  • One company is merging with another, and they want to consolidate the virtualisation infrastructures.
  • A customer is migrating a virtual machine to a hoster’s network.
  • A private cloud or public cloud administrator wants to be able to move virtual machines around various different networks (power consolidation, equipment replacement, etc) without causing downtime.

image

Any of these would normally involve an IP address change.  You can see above that the VMs (10.1.1.101 and 10.1.1.102) are on Network A with IPs in the 10.1.1.0/24 network.  That network has it’s own switches and routers.  The admins want to move the 10.1.1.101 VM to the 10.2.1.0/24 network which has different switches and routers.

Internet DNS records, applications (that shouldn’t, but have) hard coded IP addresses, other integrated services, all depend on that static IP address.  Changing that on one VM would cause mayhem with accusatory questions from the customer/users/managers/developers that make you out to be either a moron or a saboteur.  Oh yeah; it would also cause business operations downtime.  Changing an IP address like that is a problem. In this scenario, 10.1.1.102 would lose contact with 10.1.1.101 and the service they host would break.

Today, you make the move and you have a lot of heartache and engineering to do.  Next year …

image

Network virtualisation abstracts the virtual network from the physical network.  IP address virtualisation does similar.  The VM that was moved still believes it is on 10.1.1.101.  10.1.1.102 can still communicate with the other VM.  However, the moved VM is actually on the 10.2.1.0/24 network as 10.2.1.101.  The IP address is virtualised.  Mission accomplished.  In theory, there’s nothing to stop you from moving the VM to 10.3.1.0/24 or 10.4.1.0/24 with the same successful results.

How important is this?  I worked in the hosting industry and there was a nightmare scenario that I was more than happy to avoid.  Hosting customers pay a lot of money for near 100% uptime.  They have no interest in, and often don’t understand, the intricacies of the infrastructure.  They pay not to care about it.  The host hardware, servers and network, had 3 years of support from the manufacturer.  After that, replacement parts would be hard to find and would be expensive.  Eventually we would have to migrate to a new network and servers.  How do you tell customers, who have applications sometimes written by the worst of developers, that they could have some downtime and then that there is a risk that their application would break because of a change of IP.  I can tell you the response: they see this as being caused by the hosting company and any work the customers need to pay for to repair the issues will be paid by the hosting company.  And there’s the issue.  IP address virtualisation with expanded Live Migration takes care of that issue.

For you public or private cloud operators, you are getting metrics that record the infrastructure utilisation of individual virtual machines.  Those metrics will travel with the virtual machine.  I guess they are stored in a file or files, and that is another thing you’ll need to plan (and bill) for when it comes to storage and storage sizing (it’ll probably be a tiny space consumer).  These metrics can be extracted by a third party tool so you can analyse them and cross charge (internal or external) customers.

We know that the majority of Hyper-V installations are smaller, with the average cluster size being 4.78 hosts.  In my experience, many of these have a Dell Equalogic or HP MSA array.  Yes, these are the low end of hardware SANs.  But they are a huge investment for customers.  Some decide to go with software iSCSI solutions which also add cost.  Now it appears like those lower end clusters can use file shares to store virtual machines with support from Microsoft.  NIC teaming with RDMA gives massive data transport capabilities and gives us a serious budget solution for VM storage.  The days of the SAN aren’t over: they still offer functionality that we can’t get from file shares.

I’ve got more cloud and Hyper-V sessions to attend today, including a design one to kick off the morning.  More to come!

A Deep Dive Into Hyper-V Networking

See-Mong Tan and Pankaj Garg are the speakers.

Apparently Windows Server 8 is the most cloud optimised operating system yet. I did not know that.

Customers want availability despite faults, and predictiability of performance, when dealing with networking. Admins want scalability and density VS customer wanting performance. Customers want specialisation with lots of choice, fore firewalls, monitoring, and physical fabric integration.

Windows Server 8 gives us:
– Reliability
– Security
– Predicatabiltiy
– Scalability
– Extensibility
– … all with managability

Reliability:
Windows Server 8 gives us NIC teaming to protect against NIC or network path failure. Personal experience is that the latter is much more common, e.g. switch failure.

LBFO provider sits on top of the bound physical NICs (using IM MUX and virtual miniport). The Hyper-V Extensible Switch sits on top of that. You use the LBFO Admin Gui (via LBFO Configuration DLL) to configure the team.

– Multiple modes: Switch dependent and Switch independent
– Hasing modes: port and 4-tuple
– Active/Active and Active/Passive

Windows Server 8 provides security features to host multi tenant workloads in a hybrid cloud. You run multiple virtual networks on a physical network,. Each virtual network has the illusion that it is running as a physical fabric, the only physical network … just like a VM thinks it is the entire piece of physical hardware – that’s the analogy that MSFT is using. You decouple the virtual or tenant networks from the physical network. This is where the IP address virtualisation appears to live too. Other features:

– Port ACLs: allow you to do ACLs on IP range or MAC address … like firewall rules. And can do metering with them.
– PVLAN: Bind VMs to one uplink
– DHCP Guard: Ban VMs from being DHCP servers – very useful in cloud where users have local admin rights … users are stupid and destructive.

QoS provides predictable performance in a multi-tenant environment. You can do maximum and minumim and/or absolute vs weight.

Demo of QoS maximum bandwidth:
He runs a PSH script to implement a bandwidth rate limiting cap on some badly behaving VMs to limit their impact on the physical network. Set-VMNeworkAdapter -VMname VM1 -MaximumBandwidth 1250000.

Scalability:
Performance features mean more efficient cloud operations. Also get reduced power usage.

SR-IOV
Single Route I/O Virtualisation is a PCI group hardware technology. A NIC has features that can be assigned to a VM. WIthout it, vthe virtual swithc does routing, VLAN filtering, and data copy of incoming data to the VM, which then has to process the packet. Lots of CPU. SR-IOV bypasses the Hyper-V switch and sends the packet direct to the VM Virtual Function. This requires a SR-IOV NIC. You can Live Migrate a VM from a host with SR-IOV to a host withou SR-IOV. Apparently, VMware cannot do this. SR-IOV is a property of the virtual switch, and a property of the VM vNIC (tick boxes). The VM actually uses the driver of the SR-IOV NIC. We are shown a demo of a Live Migration to a non SR-IOV non-clustered host, with no missed pings.

D-VMQ is Dynamic Virtual Machine Queue
If the CPU is processing VM network traffic then you can use this to dynamically span processing VM n/w traffic across more than one CPU. It will automatically scale up and scale down the CPU utilisation based on demand. Static VMQ is limiting in high tide. No VMQ is limited to single CPU.

Receive Side Coalescing (RSC) allows a VM to receive live packets. IPsec Task Offload means a VM performs really well when running IPsec (CPU eater). There’s a call to action for NIC and Server vendors to support thiese features.

Extensibility:
The idea here is that partners can develop those specialised features that MSFT cannot do.

Partners can extend the Hyper-V extensible switch with their own features. There’s a set of APIs for them to use. Switch vendors should extend to provide unified management of physical and virtual switches.

Managability:
Features without management is useless. Windows Server 8 designed to manage large clouds. Metering allows chargeback, e.g. on network usage. Metrics are stored with the VM and are persistent after a VM move or migration.

PowerShell for Hyper-V. Unified tracing for network troubleshooting: trace packets from VM, to switch, though the vendor and onto the network. Port Mirroring: standard switch feature redirect switch traffic to analyse.

And this is where I need to wrap up … the session is about to end anyway.

Using Windows Server 8 for Building Private and Public IaaS Clouds

Speakers: Jeff Woolsey and Yigal Edery of Microsoft.

Was the cloud optimization of Windows Server 8 mentioned yet? Probably not, but it’s mentioned now.

– Enable multi tenant clouds: isolation and security
– High scale and low cost data centres
– Managable and extensible: they are pushing PowerShell here

Windows Server 8 should make building a IaaS much easier.

Evolution of the data centre (going from least to most scalable):

1) Dedicated servers, no virtualisation, and benefit of hardware isolation
2) Server virtalisation, with benefits of server consolidation, some scale out, and heterogeneous hardware
3) Cloud with Windows 8: Shared compute, storage, network. Multi-tenancy, converged network and hybrid clouds. Benefits of infrastructure utilization increase, automatic deployment and migration of apps, VMs, and services. Scaling of network/storage.

Enable Multi-Tenant Cloud
What is added?
– Secure isolation between tenants: Hyper-V extensible swich (routing, etc), Isolation policies (can define what a VM can see in layer 2 networking), PVLANs
– Dynamic Placement of Services: Hyper-V network virtualisation, complete VM mobility, cross-premise connectivity (when you move something to the cloud, it should still appear on the network as internal for minimal service disruption)
– Virtual Machine Metering: Virtual Machine QoS policies, resource meters (measure activity of VM over time, and those metric stay with a VM when it is moved), performance counters

Requirements:
– Tenant wants to easily move VMs to and from the cloud
– Hoster wants to place VMs anywhere in the data center
– Both want: easy onboarding, flexibility and isolation

The Hyper-V extensible switch has pVLAN functionality. But managing VLANs is not necessarily the way you want to go. 4095 maximum VLANs. And absolute nightmare to maintain, upgrade, or replace. IP address management is usually controlled by the hoster.

Network virtualisation aims to solve these issues. VM has two IPs: one it thinks it is using, and one that it really is using. “Each virtual network has illusiion it is running as a physical fabric”. The abstraction of IP address make the VM more mobile. Virtualisation unbinds server and app from physical hardware. Network virtualisation unbinds server and app from physical network.

Mobility Design
Rule 1: no new features that preclude Live Migration
Rule 2: maximise VM mobility with security

Number 1: recommendation is Live Migration with High Availability
Number 2: SMB Live Migration
Number 3: Live Storage Migration

Live Storage Migration enables:
– Storage load balancing
– No owntime servicing
– Leverages Hyper-V Offloaded Data Transfer (ODX): pass a secure token to a storage array to get it to move large amounts of data for you. Possibly up to 90% faster.

You can Live Migrate a VM with just a 1 Gbps connection and nothing else. VHDX makes deployment easier. Get more than 2040 GB in a vDisk without the need to do passthrough disk which requires more manual and exceptional effort. Add in the virtual fibre channel HBA with MPIO and you reduce the need for physical servers for customer clusters in fibre channel deployments.

Bandwitdh management is an option in the virtual network adapter. You can restrict bandwidth for customers with this. IPsec offload can be enabled to reduced CPU utilisation.

Upto 63 nodes in a cluster, with up to 4,000 VMs. That’s one monster cluster.

QoS and Resource Metering
Network: monitor incoming andoutgoing traffic per IP address
Sotrage: high water mark disk allocation
Memory: high and low water mark memory, and average

We get a demo of resource meters being used to rught size VMs.

Dynamic Memory gets a new setting: Minimum RAM. Startup RAM could give a VM 1024MB, but the VM could reduce to Minimum RAM of 512MB if there is insufficient pressure.

High scale and low cost data centres:
– The vCPU:pCPU ratio limit has been removed from Hyper-V support… just squeeze in what you can without impacting VM performance
– Up to 160 logical processors
– Up to 2 TB RAM

Networking:
– Dynamic VMQ
– Single root I/O virtualiation (SR-IOV): dedicate a pNIC to a VM
– Receive side scalling (RSS)
– Receive side coalescing (RSC)
– IPsec task offload

Storage
– ODX
– RDMA
– SMB 2.2
– 4K native disk support

HA and Data Protection
– Windows NIC teaming across different vendors of NIC!
– Hyper-V Replica for DR to scondary site – either one I own or a cloud provider
– BitLocker: Physically safeguard customers’ data. Even if you lose the disk the data is protected by encryption. You can now encrypt cluster volumes. TPMs can be leveraged for the first time with Hyper-V cluster shared disks. Cluster Names Obkect (CNO) used to lock and unlock disks.

Managable and Extensible
– PowerShell for Hyper-V by MSFT for the first time. Can use WMI too, as before.
– Workflows across many servers.
– Hyper-V Extensible switch to get visibility into the network
– WMIv2/CIM, OData, Data Center TCP

go.microsoft.com/fwlink/p/?LinkID=228511 is where a whitepaper will appear in the next week on this topic.