Hardware

My New Intel NUC PC

I recently purchased an Intel NUC, NUC8i7HNK, to use as my home office PC. Here’s a little bit of information about my experience.

Need To Upgrade

I’ve been using a HP micro-tower for around 6-7 years as my home office PC. It was an i5 with 16 GB RAM, originally purchased as part of a pair to use as a lab kit when I started writing Mastering Hyper-V 2012 R2. After that book, I re-purposed the machine as my home PC and it’s been where many of my articles were written and where I work when I work from home.

When Microsoft introduced a workaround security fix for Meltdown/Spectre I noticed the slowdown quite a bit. Over the year, the PC has just felt slower and slower. I don’t do anything that unusual with it, I don’t use it for development, it’s not running Hyper-V – Office, Chrome, Visio, and VS Code are my main tools of the trade. The machine is 6-7 years old, so it was time to upgrade.

Options

Some will ask “wasn’t the Surface Studio the perfect choice?”. No, not for me. The price is crazy, the Studio 1 needs a hard disk replacement, the Studio 2 isn’t available yet, and I need a nice dual monitor supply and I don’t like working with mismatched monitors – Microsoft doesn’t make additional matching monitors for the Studio.

I did look at Dell/Lenovo/HP but nothing there suited me. Some were too lower spec. Some had the spec but a Surface-like price to go with them. I considered home-builds. Most of the PC’s I have owned have been either home-built or customised. But I don’t have time for that malarkey. I looked at custom-builds but they are expensive options for gamers – I don’t have time to play the X-Box games that I already have.

At work, we use Intel NUCs for our training room. They’re small, high spec, and have an acceptable price. So that’s what I went for.

NUC8i7HNK

One of my colleagues showed me some of the new 8th generation NUC models and I opted for the NUC8i7HNK (Amazon USA / Amazon UK). A machine with an i7, Radeon graphics instead of the usual Intel HD, USB C and Thunderbolt, TPM 2.0 (not listed on the Intel site, I found), and oodles of ports. Here’s the front:

And here’s the back:

Look at that: 2 x HDMI, 2 x mini-DP, USB C, 6 x USB 3.0, 2 x Ethernet, and there’s the Radeon graphics, speaker, built-in mic, and more. It supports 2 x M.2/NVMe disks and 2 x DIMM slots for up to 32 GB RAM.

The machine is quite tidy and small. It comes with a plate allowing you to mount it to the back of a monitor – if the monitor supports mounting.

My Machine

The NUC kits come built, but you have to add your disk and RAM. I went with:

Adata SX6000 M.2 SSD, capable of up to 1000 MB/S read and 800 MB/S write.
2 x Adata DDR4 2400 8 GB RAM

I installed Windows 10 1809 in no time, added the half dozen or so required Windows updates, and installed Office 365 from the cloud. A quick blast of Ninite and I had most of the extra bits that I needed. In terms of data, all of my docs are is either in OneDrive or Office 365 so there was no data migration. My media (movies & photos) are on a pair of USB 3.0 drives configured with Storage Spaces so all I’ll have to do is move the drives over. To be honest, the biggest thing I have to do is buy a pair of video cables to replace my old ones!

Going with a smaller machine will clear up a lot of space from under my desk, and help reduce some of the clutter – there’s a lot of clutter to clear!

Got a Surface Pro

As you might have noticed, myself and my wife have started a new Azure training business called Cloud Mechanix. The thing I fear the most, as a presenter, is my laptop dying. I don’t want to use my employer’s device (a Surface Book) because that would be a conflict of interest. My personal laptop is a 4-year old Lenovo Thinkpad Yoga, which still runs well, but is showing it’s age … Thinkpads have a great build reputation but the rubber feel and logos were all gone in 18 months. Many moons ago, I had a laptop die in England the night before I was to present at an MVP event. I ended up having to borrow a machine, and that’s not a position that I can tolerate as a trainer. So the Yoga will be my backup machine, and I needed something new and suitable for presentations.

Choice

My requirements were:

Weight: I wanted this machine to be light because I will be travelling light with no checked-in bags.
Moderate performance: An i5 was fine, 8-16 GB RAM. I’m not running Visual Studio or games, but I want the machine to run and age well.
Touch: I use touch when I’m reading.
Stylus: I whiteboard a lot. Hotels charge a fortune for things like flipcharts, and I prefer to use Windows 10 inking, e.g. Microsoft Whiteboard, because it’s being projected onto a big screen. I often draw over my PowerPoint for convenience.

So, that left me with plenty of options. Lenovo was ruled out because of build quality and price – see above. I really liked the look of the recently Dell XPS 13, until I saw what Dell had done with the webcam. Imagine doing Skype calls when everyone is looking up your nose! HP have some nice machines that are similar to the Dell XPS 13. I was tempted by USB-C, but then I thought … how many devices will I hang from my presentation machine? My office machine has 8 on-board USB 2.0 ports and an additional 4 x PCI USB 3.0 ports, most of which are used. But I will be travelling light, so all I’ll need are:

Video out
USB 2.0/3.0 for a clicker
USB 3.0 for a gigabit network adapter

FYI, Acer, Asus, and Samsung were all ruled out because of terrible post-sales hardware support.

That left me with Microsoft: Surface Laptop and Surface Pro. I like the Surface Laptop. It’s thin, light, and pretty much the Surface Pro in laptop form. I was tempted – if it had been a convertible then I would have pulled the trigger. But what did it for me was the ability to remove or flip up the keyboard of the Surface Pro. Form time to time, I have been known to connect to the screen/projector via Miracast, pick up my device, and walk around while presenting. It’s also handy in a meeting when whiteboarding on screen – get the keyboard out of the way and draw/talk; the flexible stand helps there too.

Post-Sale

The purchase was easy; Cloud Mechanix as a service provider is able to buy from my employer (a distributor) at trade prices plus support would be easy for me. The OOBE setup of Windows 10 was interesting:

The OOBE was defaulting to UK English/UK as the location so Cortana was there. She walked me through the setup. I had never heard Cortana during setup before, and I never even knew it was possible.
I was forced to do Windows updates at the end of the OOBE. A 3 GB download/install was required (I guess 1709 was not in the image). That start at around 4PM and finished sometime after 9PM – I actually left it running in the back of the car when I was driving home from work.

The Surface Pro has 1 x USB 3.0 port, which is not enough for my basic presentation requirements. That’s easily solved. I added a Macally U3HUBGBA USB/Ethernet hub – also purchased through work via trade. From a single (shared bandwidth) USB 3.0 port, I get 3 more ports and a “Gigabit” Ethernet adapter. That’s all my connectivity requirements sorted –

I added the Cobalt stylus and a signature keyboard. The alcantara of the keyboard doesn’t feel like a fabric; it feels more like what it is: the result of 2 chemicals companies cooperating on something. It feels smooth to the touch and like it will wear well. The keyboard is rigid enough to work well, and I haven’t had any issues typing on it, which I often do with some Lenovo and HP machines when they get funky with keyboard layouts, e.g. moving CTRL or ALT.

It’s only been a few days, so a review isn’t justifiable, and others wrote reviews last year.

Microsoft Azure Started Patching Reboots Yesterday

Contrary to a previous email that I received, Microsoft started rebooting Azure VMs yesterday, instead of the 9th. Microsoft also confirmed that this is because of the Intel CPU security flaw. The following email was sent out to customers:

Dear Azure customer,

An industry-wide, hardware-based security vulnerability was disclosed today. Keeping customers secure is always our top priority and we are taking active steps to ensure that no Azure customer is exposed to these vulnerabilities.

The majority of Azure infrastructure has already been updated to address this vulnerability. Some aspects of Azure are still being updated and require a reboot of some customer VMs for the security update to take effect.

You previously received a notification about Azure planned maintenance. With the public disclosure of the security vulnerability today, we have accelerated the planned maintenance timing and began automatically rebooting the remaining impacted VMs starting at PST on January 3, 2018. The self-service maintenance window that was available for some customers has now ended, in order to begin this accelerated update.

You can see the status of your VMs, and if the update completed, within the Azure Service Health Planned Maintenance Section in the Azure Portal.

During this update, we will maintain our SLA commitments of Availability Sets, VM Scale Sets, and Cloud Services. This reduces impact availability and only reboots a subset of your VMs at any given time. This ensures that any solution that follows Azure’s high availability guidance remains available to your customers and users. Operating system and data disks on your VM will be retained during this maintenance.

You should not experience noticeable performance impact with this update. We’ve worked to optimize the CPU and disk I/O path and are not seeing noticeable performance impact after the fix has been applied. A small set of customers may experience some networking performance impact. This can be addressed by turning on Azure Accelerated Networking (Windows, Linux), which is a free capability available to all Azure customers.

This Azure infrastructure update addresses the disclosed vulnerability at the hypervisor level and does not require an update to your Windows or Linux VM images. However, as always, you should continue to apply security best practices for your VM images.

For more information, please see the Azure blog post.

That email reads like Microsoft has done quite a bit of research on the bug, the fix, and the effects of bypassing the flawed CPU performance feature. It also sounds like the only customers that might notice a problem are those with large machines with very heavy network usage.

Accelerated networking is Azure’s implementation of Hyper-V’s SR-IOV. The virtual switch (in user mode in the host parent partition) is bypassed, and the NIC of the VM (in kernel mode) connects directly to a physical function (PF) on the host’s NIC via a virtual function (VF) or physical NIC driver in the VM’s guest OS. There are fewer context switches because there is no loop from the the NIC, via the VM bus, to the virtual switch, and then back to the host’s NIC drivers. Instead, with SR-IOV/Accelerated Networking, everything stays in kernel mode.

If you find that your networking performance is impacted, and you want to enable Accelerated Networking, then there are a few things to note:

Thanks to Neil Bailie of P2V for spotting that I’d forgotten something in the below, stricken out, points:

~~You cannot enable Accelerated Networking on an existing NIC. You need to create a new NIC.~~
~~If your existing VM doesn’t have Accelerated Networking enabled on the NIC, then you can replace the NIC with one that has.~~
You cannot enable Accelerated Networking on an existing VM. You must delete the VM (keep the disks), and create a new VM from the existing disks using a previously created NIC with Accelerated Networking enabled. This is not a trivial process for the uninitiated. You will not lose any data/settings/program files/OS files – you’re replacing VM metadata.
Your VM series/size, guest OS, and region must be in the supported list. Note that Accelerated Networking is GA on Windows and in Preview with Linux.

Was This Post Useful?

If you found this information useful, then imagine what 2 days of training might mean to you. I’m delivering a 2-day course in Amsterdam on April 19-20, teaching newbies and experienced Azure admins about Azure Infrastructure. There’ll be lots of in-depth information, covering the foundations, best practices, troubleshooting, and advanced configurations. You can learn more here.

Intel CPU Security Bug

Gossip started to twirl in the last few days about what was driving both Azure and AWS to push out updates at relatively short notice. And news leaked over the last day that Intel has discovered a significant security flaw in the code of nearly all (or all) Intel processors manufactured in the last decade.

Intel has issued an embargo to partners on sharing the news while fixes are being produced, but the news has leaked, and it affects everything using Intel’s processors: Windows, MacOS, Linux, AWS, Azure, and probably VMware too. It sounds like the error is a hardware error that cannot be fixed using a microcode update by Intel. This means that the hypervisors and operating systems on top of the processors must bypass the flaw in the processor. And here’s where the bad news is.

We can expect Microsoft to issue a security fix very quickly. According to Gizmodo, a redacted form of the fix appeared in the Linux kernel recently. But the fix will bypass the flaw which resides in a performance feature of the processor. My limited understanding is that the feature helps make the switch between user mode and kernel mode less disruptive by tweaking the handling of secure kernel memory. The flaw makes it possible for processes in user mode to scan kernel memory. To bypass this feature, the performance enhancement has to be bypassed, and this could cause anywhere between a “5 and 30 percent” performance hit, according to several news sites, but I don’t know how reliable that number is.

Typical end users won’t notice this. But heavily loaded systems will notice. So if your CPU is heavily used, you can expect that the security fix will cause you problems.

The timing of this flaw/fix and the timing of Azure’s and AWS’s updates cannot be a coincidence.

The iPhone 8–After 1 Week of Ownership

I’ve been using the HTC One (M7 and then M9) for the last 4 years on the Three network in Ireland. I liked Android, but problems that both I and my wife had with the M9, and the lousy camera, convinced me to change handsets. And the awful degradation of the Three network and their rubbish outside-EU roaming offers made me want to go elsewhere.

I reviewed my phone options. The Samsung S8 and The Google Pixel are the best of Android. The Pixel isn’t officially available here, but grey market handsets can be had at a steep price. The S8 … I hate what Samsung does to Android. Prior to going Android, I had an iPhone 4. I didn’t like iTunes, but the platform was stable, and Apple puts pretty good cameras into their phones. That convinced me – I wanted a great camera for family snapshots. Along came news of the iPhone 8. My employer happens to be a distributor of Apple products, so I bought the entry level model (64 GB storage) on the first morning of release.

That was a busy day! I was packing for 2 weeks of travel in the USA (Microsoft Ignite in Orlando, FL, and then to an MS partner bootcamp in Bellevue, WA), but I wanted to change phone carriers. I went with Vodafone Ireland on a SIM-only plan and activated their €2.99 roaming package for outside the EU. With that package, for €2.99/day, I get 200 MB of data and free calls/texts home from the USA.

I loaded up apps, and hit the “road” on Saturday morning, heading to MS Ignite. Google Maps was pre-loaded with maps. I had a rental car waiting in the USA and used maps to navigate quite a bit – to my hotel, and then out west on the Sunday to visit with a friend. All week long I was navigating, listening to Audible, taking photos, tweeting, phoning, texting (SMS/iMessage), using Facetime home, and calling home. The phone is being used … and the battery is easily out-performing the HTC One M9 that I previously owned. The camera is amazing compared to the rubbish in the HTC One – whether it’s a snap, a zoomed in shot of a screen using Office Lens, or a panorama (gloriously easy to use).

The decision for Apple to start with 64 GB was a good one. I was struggling with 32 GB on the previous phone, and even though I use OneDrive, I like to keep photos offline, as well as maps, audio books, and music.

The phone is much smaller than I expected. That’s causing me some issues with getting used to the keyboard. My wife went with the iPhone 8 Plus. I feared that it would be too big for my pocket but it’s not. However, I like not having to adjust my pocket contents when I sit down – and I’m less black & blue in the nether regions!

I’m very happy with the hardware. iOS 11 … we’ve all heard the grumbling. I installed Outlook and set it up for my work and personal email, both on O365. It works well and I’ve never looked at the Apple mail app. I’ve not had any problems with the software.

The switch to Vodafone has also worked out well. I have a data signal all the way between home and work – there’s a mobile antenna at the end of the road that I live on, and I could barely get a 1 bar signal with Three in the house, which does not have the latest signal-blocking insulation. Roaming has been the real test. I love having that 200 MB per day. No; it’s not much in a modern world, but I have something. Most of the time, I’ve been near wi-fi, but hotel/conference wi-fi can suck. Only a little while ago, I wanted to see my family and the hotel wi-fi was crapping out. I jumped off the wi-fi, and had a perfect mobile signal to see my family on Facetime.

This week I’ve been repeatedly asked why I didn’t wait until November for the iPhone X. Well … I’m nether stupid nor am I a poser. There is no way on earth that I was going to pay nearly €1,300 for animated emojis, or to be that plonker that puts their phone on the bar table, waiting for people to tell them how much better they are than everyone else. Seriously!

One week is not a long time, but I’ve used the phone quite a bit this week, more than I normally would. It’s worked out well, and I’m happy with it and the carrier decision that I’ve made.

My Hands-Off Review of Surface Studio

I don’t have a Surface Studio. My access to one was limited to a 10 minute play in a Microsoft Store in Bellevue, WA last month. But I did have that limited hands-on, I know the specs, and I’ve listened to & read other reviews. So I have my opinions on this headline-making PC from Microsoft and here they are.

Styling & Form

If it was possible to give a 12 out of 10 score, then I’d do it. The Surface Studio is a beautifully engineered machine, making all those beige and black cuboid PCs of the past look like dumpster fires. I love the form-factor – I was a fan of a similar machine that Lenovo launched several years ago with Windows 8, the A730, which often appears in TV shows such as The Flash.

The Lenovo A730

When word of a Microsoft PC leaked, I hoped it would look something like the Lenovo. And Microsoft exceeded that, with a machine that is perfectly designed on the exterior. The screen tilt is perfectly balanced; you can pull down or push up the screen with just one finger, and the motion is smooth. That quality makes you think of a €300,000 hand-made car. In “draft mode” with the screen at a low angle, the Studio is perfect for drawing on. The stylus experience is as you’d expect, fluid and responsive.

The Screen

In my opinion, this is the star feature of the Studio: a big bright, contrasty, colour-popping 28” screen that makes all others look like rubbish. I actually went up the escalator to the Apple store to do an eyeball comparison after playing with the Studio. Apple’s stock paled in comparison in my untrained and un-calibrated opinion. As a hobbyist photographer, the Studio’s monitor would be my choice. Now, there are pros out there that will point out some niche editing monitors with better contrast, colour ranges, hoods for blocking reflections, and all that jazz, but those things cost a freaking fortune, and few creatives ever use them. And the Studio’s big win … you don’t need some drawing pad from the likes of Wacom (professional ones can cost in excess of $1500) because the PixelSense monitor on the Studio is a touch screen that supports a stylus, and the screen tilts down to a suitable angle for editing and drawing.

The Peripherals

The keyboard and mouse are stylish and match the design of the machine. The choice of mice/keyboards is usually a personal thing; I hate small keyboards and flat mice so I would prefer to use something like the 2000 combo from Microsoft – which I use at home. Yes, I would “ruin the styling” at my desk, but these devices suit me better.

The 2000 keyboard/mouse from Microsoft which I prefer

Of course, the talking point peripheral is the Dial. The Dial is revolutionary. You press down to activate a menu, twist and select and option, and then twisting the dial impacts how much/little or forward/back the current editing does. For a righty, you have the stylus in your right hand, and the dial in your left on the screen (so you can see your press-down menu options), and editing is just a natural process. If you are editing, you can draw while resizing the brush, changing the tone, lightening/darkening the mask, or undoing/redoing your changes. It’s an extremely natural device to use, and the news that it works with other devices is great for all you graphic artists or photo editors that want a faster way to work.

The Spec

This is where things aren’t 12/10. I’m a big fan of the idea, the styling, the screen and the interaction with the Studio. But the spec has some issues. The first of these is the graphics card. I’m no PC gamer, so graphics cards aren’t something I pay attention to. But I sit beside two graphics artists at work. They LOVED the appearance of the Studio when it was announced, but then they saw the card spec, and were disappointed. The Studio includes a mobile GPU, not a PC one, so performance was sacrificed for form. I would have not been upset if the machine was a few millimetres thicker or wider to get a better card in there.

The other issue is that the machine has a 5400 RPM hard drive (!!!!) with an M2 SSD cache; in other words, a hybrid drive. The prices of flash storage have plummeted. There is no excuse for putting such a dreadful storage solution into a premium machine like the Studio. Hybrid drives, in my opinion, are a waste. The cache just doesn’t impact performance enough to matter – I know, because I replaced a similar 1 TB hybrid drive in my Lenovo Yoga with a 1 TB Samsung Evo SSD. And the reason was identical to what Leo LaPorte of TWiT reported on Windows Weekly a couple of weeks ago.

I might take 1,000 photos on a successful day of wildlife photography – not that similar to what a wedding or news photographer might do. A 36 megapixel photo might be around 60 MB in size. 1000 of those is 58 GB – well beyond the 32 GB SSD cache of a hybrid drive. Let’s say I import those 1000 photos into Adobe Lightroom on my imaginary Surface Studio. The first thing that a photography creative will do is browse through the photos, rate them, and remove what they don’t want to keep. Each photo is pretty large, so loading it from a 5400 RPM HDD will be tedious … 4-8 seconds for each photo! Yes; that’s what Leo LaPorte reported on Windows Weekly, and that’s what I’d expect from such a drive.

Microsoft should never have put such a cheap storage solution into a PC for creatives – that’s like putting a 1 litre engine from a Fiat Punto into a Rolls Royce. If you’re getting a Studio then allow for a couple of hundred dollars to replace the drive (which can be done) with an SSD.

Everything else is great … lots of memory in the choices, and fast CPUs. It’s a pity that the memory is not expandable, but as Apple have realized, that’s creating manufacturing costs and complexity for the 1% of your target market, and it just isn’t worth it.

The Price

There are 3 available specs of Surface Studio:

$2,999 plus tax: 1 TB / Core i5 / 8 GB RAM / 2 GB GPU
$3,499 plus tax: 1 TB / Core i7 / 16 GB RAM / 2 GB GPU
$4,199 plus tax: 2 TB / Core i7 / 32 GB RAM / 4 GB GPU

Your first reaction: whoah! But you need to realize that this is not a PC for everyone. Microsoft is aiming this machine at creating professionals that view their PC as a tool. And like all tool-using professionals, the quality of the tool impacts the effectiveness of their work processes, so professionals are willing to pay for better equipment. Let’s do a comparison with that these people have been purchasing up to now, that offers a similar solution:

Apple Mac Pro, the Apple PC that hasn’t been improved in 3 years: 256 GB SSD / Quad Core Intel Xeon / 12 GB RAM / 2 x 2 GB GPU …. $2,999 plus tax.
Apple Mac Pro, the Apple PC that hasn’t been improved in 3 years: 256 GB SSD / 6Core Intel Xeon / 16 GB RAM / 2 x 3 GB GPU …. $3,999 plus tax.

The graphics adapters are an advantage for Apple. I think the CPU is a wash because Apple has old hardware verus the Studio’s newer Core i7 (creatives shouldn’t bother with the entry level machine from Microsoft). Apple includes pathetically small storage and the screens neither tilt nor support touch/stylus. This means you need additional capacity:

Professional NAS: $1,000 plus Tax for a Netgear device on Amazon.com that came up first in my search for “Apple NAS”.
A professional Wacom stylus solution: The Cintiq 27QHD 27” costs $2,550 plus tax on Amazon.com.

So the entry level option from Apple will cost: $2,999 + $1,000 + $2,550 = $6,549 plus tax. The top model from Microsoft will cost $4,199 plus an SSD, plus tax. Hmm, that’s around a $2,000 saving, plus I get a cleaner working experience, modern hardware, and tools (Dial and tilt screen) designed for how I work.

The Impact of Surface Studio

My employer (one of the few authorized Surface distributors in the world) got calls about supplying Surface Studio the morning after the launch. The sad news is that the Studio is limited to the USA and it doesn’t look like that will change anytime soon. My personal opinion is that Microsoft accomplished exactly what they wanted with the Studio. The Studio was a concept, much like a Bugatti Veyron or similar. This was an “ultimate machine” designed not to be a profit center, but a highlight, and example of what can be accomplished. By launching a desktop PC, Microsoft risked further angering their OEM partners like Dell, HP, Acer, Asus, and so on. But by making this a very expensive, niche (creatives), and relatively unavailable (tiny supply to a single market) machine, Microsoft created a light in the dark instead of a competitor to their partners.

The Surface Studio is a lighthouse. It has shone a light on what can be done with Windows 10, and most importantly, made the media and the customer aware that Microsoft still exists and is still relevant. That plan was a complete success. Even the most ardent Apple-fanboys in the media were convinced that Microsoft has won the title of “most cool” versus Apple, especially after the poorly timed and underwhelming Apple MacBook Pro “touch” launch. Apple customers were all over forums and social media saying that Microsoft has scored a huge win. Share values of Microsoft have stayed high. And hopefully, the OEM partners have seen what can be done, and will mimic the Studio with cheaper clones (with SSD storage!).

Ignite 2016 – Storage Spaces Direct

Read the notes from the session recording (original here) on Windows Server 2016 (WS2016) Storage Spaces Direct (S2D) and hyper-converged infrastructure, which was one of my most anticipated sessions of Microsoft Ignite 2016. The presenters were:

Claus Joergensen: Program Manager
Cosmos Darwin, Program Manager

Definition

Cosmos starts the session.

Storage Spaces Direct (S2D) is software-defined, shared-nothing storage.

Software-defined: Use industry standard hardware (not proprietary, like in a SAN) to build lower cost alternative storage. Lower cost doesn’t mean lower performance … as you’ll see
Shared-nothing: The servers use internal disks, not shared disk trays. HA and scale is achieved by pooling disks and replicating “blocks”.

Deployment

There’s a bunch of animated slides.

3 servers, each with internal disks, a mix of flash and HDD. The servers are connected over Ethernet (10 GbE or faster, RDMA)
Runs some PowerShell to query the disks on a server. The server has 4 x SATA HDD and 2 x SATA SSD. Yes, SATA. SATA is more affordable than SAS. S2D uses a virtual SAS bus over the disks to deal with SATA issues.
They form a cluster from the 3 servers. That creates a single “pool” of nodes – a cluster.
Now the magic starts. They will create a software-defined pool of virtually shared disks, using Enable-ClusterStorageSpacesDirect. That cmdlet does some smart work for us, identifying caching devices and capacity devices – more on this later.
Now they can create a series of virtual disks, each which will be formatted with ReFS and mounted by the cluster as CSVs – shared storage volumes. This is done with one cmdlet, New-Volume, which is doing all the lifting. Very cool!

There are two ways we can now use this cluster:

We expose the CSVs using file shares to another set of servers, such as Hyper-V hosts, and those servers store data, such as virtual machine files, using SMB 3 networking.
We don’t use any SMB 3 or file shares. Instead, we enable Hyper-V on all the S2D nodes, and run compute and storage across the cluster. This is hyper-converged infrastructure (HCI)

A new announcement. A 3rd scenario is SQL Server 2016 (supported). You install SQL Server 2016 on each node, and store database/log files on the CSVs (no SMB 3 file shares).

Scale-Out

So your S2D cluster was fine, but now your needs have grown and you need to scale out your storage/compute? It’s easy. Add another node (with internal storage) to the cluster. In moments, S2D will claim the new data disks. Data will be re-balanced over time across the disks in all the nodes.

Time to Deploy?

Once you have the servers racked/cabled, OS installed, and networking configured, you’re looking at under 15 minutes to get S2D configured and ready. You can automate a lot of the steps in SCVMM 2016.

Cluster Sizing

The minimum number of required nodes is an “it depends”.

Ideally you have a 4-node cluster. This offers HA, even during maintenance, and supports the most interesting form of data resilience that includes 3-way mirroring.
You could do a 3 node cluster, but that’s limited to 2-way mirroring.
And now, as of Ignite, you can do a 2-node cluster.

Scalability:

2-16 nodes in a single cluster – add nodes to scale out.
Over 3PB of raw storage per cluster – add drives to nodes to scale up (JBODS are supported).
The bigger the cluster gets, the better it will perform, depending on your network.

The procurement process is easy: add servers/disks

Performance

Claus takes over the presentation.

1,000,000 IOPS

Earlier in the week (I blogged this in the WS2016 and SysCtr 2016 session), Claus showed some crazy numbers for a larger cluster. He’s using a more “normal” 4-node (Dell R730xd) cluster in this demo. There are 4 CSVs. Each node has 4 NVMe flash devices and a bunch of HDDs. There are 80 VMs running on the HCI cluster. They’re using a open source stress test tool called VMFleet. The cluster is doing just over 1 million IOPS, over 925,000 read and 80.000 write. That’s 4 x 2U servers … not a rack of Dell Compellent SAN!

Disk Tiering

You can do:

SSD + HDD
All SSD

You must have some flash storage. That’s because HDD is slow at seek/read. “Spinning rust” (7200 RPM) can only do about 75 random IOs per second (IOPS). That’s pretty pathetic.

Flash gives us a built-in, always-on cache. One or more caching device (a flash disk) is selected by S2D. Caching devices are not pooled. The other disks, capacity devices, are used to store data, and are pooled and dynamically (not statically) bound to a caching device. All writes up to 256 KB and all reads up to 64 GB are cached – random IO is intercepted, and later sent it to capacity devices as optimized IO.

Note the dynamic binding of capacity devices to caching devices. If a server has more than one caching device, and one fails, the capacity devices of the failed caching device are dynamically re-bound.

Caching devices are deliberately not pooled – this allows their caching capability to be used by any pool/volume in the cluster –the flash storage can be used where it is needed.

The result (in Microsoft’s internal testing) was that they hit 600+ IOPS per HDD …. that’s how perfmon perceived it … in reality the caching devices were positively greatly impacting the performance of “spinning rust”.

NVMe

WS2016 S2D supports NVMe. This is a PCIe bus-connected form of very fast flash storage, that is many times faster than SAS HBA-connected SSD.

Comparing costs per drive/GB using retail pricing on NewEgg (a USA retail site):

Comparing performance, not price:

If we look at the cost per IOP, NVMe becomes a very affordable acceleration device:

Some CPU assist is require to move data to/from storage. Comparing SSD and NVMe, the NVMe has more CPU for Hyper-V or SQL Server.

The highest IOPS number that Microsoft has hit, so far, is over 6,000,000 read IOPS from a single cluster, which they showed earlier in the week.

1 Tb/s Throughput (New Record)

IOPS are great. But IOPS is much like horsepower in a car, we care more about miles/KMs per hour or amounts of data we can actually push in a second. Microsoft recently hit 1 terabit per second. The cluster:

12 nodes
All Micron NVMe
100 GbE Mellanox RDMA network adapters
336 VMs, stress tested by VMFleet.

Thanks to RDMA and NVMe, the CPU consumption was only 24-27%.

1 terabit per second. Wikipedia (English) is 11.5 GB. They can move English Wikipedia 14 times per second.

Fault Tolerance

Soooo, S2D is cheaper storage, but the performance is crazy good. Maybe there’s something wrong with fault tolerance? Think again!

Cosmos is back.

Failures are not a failure mode – they’re a critical design point. Failures happen, so Microsoft wants to make it easy to deal with.

Drive Fault Tolerance

You can survive up to 2 simultaneous drive failures. That’s because each chunk of data is stored on 3 drives. Your data stays safe and continuously (better than highly) available.
There is automatic and immediate repair (self-healing: parallelized restore, which is faster than classic RAID restore).
Drive replacement is a single-step process.

Demo:

3 node cluster, with 42 drives, 3 CSVs.
1 drive is pulled, and it shows a “Lost Communication” status.
The 3 CSVs now have a Warning health status – remember that each virtual disk (LUN) consumes space from each physical disk in the pool.
Runs: Cluster* | DebugStorageSubSystem …. this cmdlet for S2D does a complete cluster health check. The fault is found, devices identified (including disk & server serial), fault explained, and a recommendation is made. We never had this simple debug tool in WS2012 R2.
Runs: $Volumes | Debug-Volume … returns health info on the CSVs, and indicates that drive resiliency is reduced. It notes that a restore will happen automatically.
The drive is automatically marked as restired.
S2D (Get-StorageJob) starts a repair automatically – this is a parallelized restore writing across many drives, instead of just to 1 replacement/hot drive.
A new drive is inserted into the cluster. In WS2012 R2 we had to do some manual steps. But in WS2016 S2D, the disk is added automatically. We can audit this by looking at jobs.
A rebalance job will automatically happen, to balance data placement across the physical drives.

So what are the manual steps you need to do to replace a failed drive?

Pull the old drive
Install a new drive

S2D does everything else automatically.

Server Fault Tolerance

You can survive up to 2 node failures (4+ node cluster).
Copies of data are stored in different servers, not just different drives.
Able to accommodate servicing and maintenance – because data is spread across the nodes. So not a problem if you pause/drain a node to do planned maintenance.
Data resyncs automatically after a node has been paused/restarted.

Think of a server as a super drive.

Chassis & Rack Fault Tolerance

Time to start thinking about fault domains, like Azure does.

You can spread your S2D cluster across multiple racks or blade chassis. This is to create the concept of fault domains – different parts of the cluster depend on different network uplinks and power circuits.

You can tag a server as being in a particular rack or blade chassis. S2D will respect these boundaries for data placement, therefore for disk/server fault tolerance.

Efficiency

Claus is back on stage.

Mirroring is Costly

Everything so far about fault tolerance in the presentation has been about 3-copy mirror. And mirroring is expensive – this is why we encounter so many awful virtualization deployments on RAID5. So if 2-copy mirror (like RAID 10) gives us the raw storage as usable storage, and only 1/3 with 3-way mirroring, this is too expensive.

2-way and 3-way mirroring give us the best performance, but parity/erasure coding/RAID5 give us the best usable storage percentage. We want performance, but we want affordability too.

We can do erasure coding with 4 nodes in an S2D cluster, but there is a performance hit.

Issues with erasure coding (parity or RAID 5):

To rebuild from one failure, you have to read every column (all the disks), which ties up valuable IOPS.
Every write incurs an update of the erasure coding, which tiers up valuable CPU. Actively written data means calculating the encoding over and over again. This easily doubles the computational work involved in every write!

Local Reconstruction Codes

A product of Microsoft Research. It enables much faster recovery of a single drive by grouping bits. The XO the groups and restore required bits instead of an entire stripe. It reduces the number of devices that you need to touch to do a restore of a disk when using parity/erasure coding. This is used in Azure and in S2D.

This allows Microsoft to use erasure coding on SSD, as do many HCI vendors, but also on HDDs.

The below depicts the levels of efficiency you can get with erasure coding – note that you need 4 nodes minimum for erasure coding. The more nodes that you have, the better the efficiencies.

Accelerated Erasure Coding

S2D optimizes the read-modify-write nature of erasure coding. A virtual disk (a LUN) can combine mirroring and erasure coding!

Mirror: hot data with fast write
Erasure coding: cold data – fewer parity calculations

The tiering is real time, not scheduled like in normal Storage Spaces. And ReFS metadata handling optimizes things too – you should use ReFS on the data volumes in S2D!

Think about it. A VM sends a write to the virtual disk. The write is done to the mirror and acknowledged. The VM is happy and moves on. Underneath, S2D is continuing to handle the persistently stored updates. When the mirror tier fills, the aged data is pushed down to the erasure coding tier, where parity is done … but the VM isn’t affected because it has already committed the write and has moved on.

And don’t forget that we have flash-based caching devices in place before the VM hits the virtual disk!

As for updates to the parity volume, ReFS is very efficient, thanks to it’s way of abstracting blocks using metadata, e.g. accelerated VHDX operations.

The result here is that we get the performance of mirroring for writes and hot data (plus the flash-based cache!) and the economies of parity/erasure coding.

If money is not a problem, and you need peak performance, you can always go all-mirror.

Storage Efficiency Demo (Multi-Resilient Volumes)

Claus does a demo using PoSH.

Note: 2-way mirroring can lose 1 drive/system and is 50% efficient, e.g. 1 TB of usable capacity has a 2 TB footprint of raw capacity.

12 node S2D cluster, each has 4 SSDs and 12 HDDs. There is 500 TB of raw capacity in the cluster.
Claus creates a 3-way mirror volume of 1 TB (across 12 servers). The footprint is 3 TB of raw capacity. 33% efficiency. We can lose 2 systems/drives
He then creates a parity volume of 1 TB (across 12 servers). The The footprint is 1.4 TB of raw capacity. 73% efficiency. We can lose 2 systems/drives
3 more volumes are created, with different mixtures of 3-way mirroring and erasure coding.
The 500 GB mirror + 500 dual parity virtual disk has 46% efficiency with a 2.1 TB footprint.
The 300 GB mirror + 700 dual parity virtual disk has 54% efficiency with a 1.8 TB footprint.
The 100 GB mirror + 900 dual parity virtual disk has 65% efficiency with 1.5 TB footprint.

Microsoft is recommending that 10-20% of the usable capacity in “hybrid volumes” should be 3-way mirror.

If you went with the 100/900 balance for a light write workload in a hybrid volume, then you’ll get the same performance as a 1 TB 3-way mirror volume, but by using half of the raw capacity (1.5 TB instead of 3 TB).

CPU Efficiency

S2D is embedded in the kernal. It’s deep down low in kernel mode, so it’s efficient (fewer context switches to/from user mode). A requirement for this efficiency is using Remote Direct Memory Access (RDMA) which gives us the ultra-efficient SMB Direct.

There’s lots of replication traffic going on between the nodes (east-west traffic).

RDMA means that:

We use less CPU when doing reads/write
But we also can increase the amount of read/write IOPS because we have more CPU available
The balance is that we have more CPU for VM workloads in a HCI deployment

Customer Case Study

I normally hate customer case studies in these sessions because they’re usually an advert. But this quick presentation by Ben Thomas of Datacom was informative about real world experience and numbers.

They switched from using SANs to using 4-node S2D clusters with 120 TB usable storage – a mix of flash/SATA storage. Expansion was easy compared to compute + SAN > just buy a server and add it to the cluster. Their network was all Ethernet (even the really fast 100 Gbps Mellanox stuff is Ethernet-based) so they didn’t need fibre networks for SAN anymore. Storage deployment was easy. In SAN there’s create the LUN, zone it, etc. In S2D, 1 cmdlet creates a virtual disk with the required resilience/tiering, formats it, and it appears as a replicated CSV across all the nodes.

Their storage ended up costing them $0.04 / GB or $4 / 1000 IOPS. The IOPS was guaranteed using Storage QoS.

Manageability

Cosmos is back.

You can use PowerShell and FCM, but mid-large customers should use System Center 2016. SCVMM 2016 can deploy your S2D cluster on bare metal.

Note: I’m normally quite critical of SCVMM, but I’ve really liked how SCVMM simplified Hyper-V storage in the past.

If you’re doing a S2D deployment, do a Hyper-V deployment and check a single box to enable S2D and that’s it, you get a HCI cluster instead of a compute cluster that requires storage from elsewhere. Simple!

SCOM provides the monitoring. They have a big dashboard to visualize alerts and usage of your S2D cluster.

Where is all that SCOM data coming from? You can get this raw data yourself if you don’t have System Center.

Health Service

New in WS2016. S2D has a health service built into the OS. This is the service that feeds info to the SCOM agents. It has:

Always-on monitoring
Alerting with severity, description, and call to action (recommendation)
Root-cause analysis to reduce alert noise
Monitoring software and hardware from SLA down to the drive (including enclosure location awareness)

We actually saw the health service information in an earlier demo when a drive was pulled from an S2D cluster.

It’s not just health. There are also performance, utilization, and capacity metrics. All this is built into the OS too, and accessible via PowerShell or API: Cluster* | Get-StorageHealthReport

DataON MUST

Cosmos shows a new tool from DataON, a manufacturer of Storage Spaces and Storage Spaces Direct (S2D) hardware.

If you are a reseller in the EU, then you can purchase DataON hardware from my employer, MicroWarehouse (www.mwh.ie) to resell to your customers.

DataON has made a new tool called MUST for management and monitoring of Storage Spaces and S2D.

Cosmos logs into a cloud app, must.dataonstorage.com. It has a nice bright colourful and informative dashboard with details of the DataON hardware cluster. The data is live and updating in the console, including animated performance graphs.

There is an alert for a server being offline. He browses to Nodes. You can see healthy node with all it’s networking, drives, CPUs, RAM, etc.

He browses to the dead machine – and it’s clearly down.

Two things that Cosmos highlights:

It’s a browser-based HTML5 experience. You can access this tool from any kind of device.
DataON showed a prototype to Cosmos – a “call home” feature. You can opt in to get a notification sent to DataON of a h/w failure, and DataON will automatically have a spare part shipped out from a relatively local warehouse.

The latter is the sort of thing you can subscribe to get for high-end SANs, and very nice to see in commodity h/w storage. That’s a really nice support feature from DataON.

Cost

So, controversy first, you need WS2016 Datacenter Edition to run S2D. You cannot do this with Standard Edition. Sorry small biz that was considering this with a 2 node cluster for a small number of VMs – you’ll have to stick with a cluster in a box.

Me: And the h/w is rack servers with RDMA networking – you’ll be surprised how affordable the half-U 100 GbE switches from Mellanox are – each port breaks out to multiple cables if you want. Mellanox price up very nicely against Cisco/HPE/Dell/etc, and you’ll easily cover the cost with your SAN savings.

Microsoft has worked with a number of server vendors to get validated S2D systems in the market. DataON will have a few systems, including an all-NVME one and this 2U model with 24 x 2.5” disks:

You can do S2D on any hardware with the pieces, Microsoft really wants you to use the right, validated and tested, hardware. you know, you can put a loaded gun to your head, release the safety, and pull the trigger, but you probably shouldn’t. Stick to the advice, and use especially engineered & tested hardware.

Project Kepler-47

One more “fun share” by Claus.

2-nodes are now supported by S2D, but Microsoft wondered “how low can we go?”. Kepler-47 is a proof-of-concept, not a shipping system.

These are the pieces. Note that the motherboard is mini-ITX; the key thing was that it had a lot of SATA connectors for drive connectivity. The installed Windows on a USB3 DOM. 32 GB RAM/node. There are 2 SATA SSDs for caching and 6 HDDs for capacity in each node.

There are two nodes in the cluster.

It’s still server + drive fault tolerant. They use either a file share witness or a cloud witness for quorum. It has 20 TB of usable mirrored capacity. Great concept for remote/branch office scenario..

Both nodes are 1 cubic foot, 45% smaller than 2U of rack space. In other words, you can fit this cluster into one carry-on bag in an airplane! Total hardware cost (retail, online), excluding drives, was $2,190.

The system has no HBA, no SAS expander, and no NIC, switch or Ethernet! They used Thunderbolt networking to get 20 Gbps of bandwidth between the 2 servers (using a PoC driver from Intel).

Summary

My interpretation:

Sooooo:

Faster than SAN
Cheaper than SAN
Probably better fault tolerance than SAN thanks to fault domains
And the same level of h/w support as high end SANs with a support subscription, via hardware from DataON

Why are you buying SAN for Hyper-V?

Technorati Tags: Event Notes,Storage Spaces,Storage,Windows Server 2016,Failover Clustering,Hardware,Hyper-V,Virtualisation

Webinar Today: Reducing Costs By Switching From VMware to Hyper-V on DataON Cluster-in-a-Box

I’m presenting in a webinar by DataON Storage later today at 6PM UK/Irish time, 7PM Central Europe, and 1 PM Eastern. The focus is on how small-medium businesses can switch to an all-Microsoft server stack on DataON hardware and greatly reduce costs, while simplifying the deployment and increasing performance.

There are a number of speakers, including me, DataON, a customer that made that jump recently, and HGST (manufacturer of enterprise class flash storage).

You can register here.

Technorati Tags: Events,Hyper-V,Failover Clustering,Storage,Storage Spaces,Hardware

Webinar Recording – Clustering for the Small/Medium Enterprise & Branch Office

I recently did another webinar for work, this time focusing on how to deploy an affordable Hyper-V cluster in a small-medium business or a remote/branch office. The solution is based on Cluster-in-a-Box hardware and Windows Server 2012 R2 Hyper-V and Storage Spaces. Yes, it reduces costs, but it also simplifies the solution, speeds up deployment times, and improves performance. Sounds like a win-win-win-win offering!

We have shared the recording of the webinar on the MicroWarehouse site, and that page also includes the slides and some additional reading & viewing.

The next webinar has been scheduled; On August 25th at 2PM UK/Irish time (there is a calendar link on the page) I will be doing a session on what’s new in WS2016 Hyper-V, and I’ll be doing some live demos. Join us even if you don’t want to learn anything about Windows Server 2016 Hyper-V, because it’s live demos using a Technical Preview build … it’s bound to all blow up in my face.

Technorati Tags: Hyper-V,Windows Server 2012 R2,Virtualisation,Failover Clustering,Hardware,Storage,Storage Spaces

Webinar – Affordable Hyper-V Clustering for the Small/Medium Enterprise & Branch Office

I will be presenting another MicroWarehouse webinar on August 4th at 2PM (UK/Ireland), 3 PM (central Europe) and 9AM (Eastern). The topic of the next webinar is how to make highly available Hyper-V clusters affordable for SMEs and large enterprise branch offices. I’ll talk about the benefits of the solution, and then delve into what you get from this hardware + software offering, which includes better up-time, more affordability, and better performance than the SAN that you might have priced from HPE or Dell.

Interested? Then make sure that you register for our webinar.

Technorati Tags: Events,Windows Server 2012 R2,Hyper-V,Virtualisation,Failover Clustering,Hardware,Storage,Storage Spaces