I have written a 40+ page document on how to use the free iSCSI target to build a Hyper-V cluster. It will be available very soon so watch this space.
Tag: Failover Clustering
First Impressions: Free Microsoft iSCSI Target for W2008 R2
Today I downloaded and installed the free iSCSI target for Windows Server 2008 R2 that was just released. I needed something free and lightweight for the lab in work. We’re using a pair of HP DL165 G7s as clustered hosts and a DL180 G6 with “cheap” SATA disk as the “SAN”. I was planning on using Windows Storage Server 2008 R2, but then I saw the tweet by Microsoft’s Jose Barreto that announced the release. Perfect – that was one less ISO I would have to download.
I deployed W2008 R2 from the WDS VM in the lab and downloaded the compressed setup file. After it was extracted, I installed the the target. That gives you a simple enough tool to use.
The service creates targets. Each target is a collection of disks (fixed size VHDs that are stored on the iSCSI target server) and you permission the target using IQN, MAC, IP address … and I can’t remember if DNS name was one of the options or not.
I needed two targets. One would be for the VMM library. For my lab, VMM would be running as a VM on a standalone host (another DL165 G7). I set up a target with a disk and permitted the iSCSI addresses of the standalone host to connect.
On the standalone host I added the MPIO feature, enabled iSCSI, and added the iSCSI NICs devices. In the initiator, I added the target IP address, enabled multipath, and added the volume. All I had to do now was format it in Disk Management.
For my Hyper-V cluster (all the networking was set up), I set up a second target, and permitted the 4 iSCSI NIC IP addresses of the 2 hosts to connect. The first disk I created was a 1GB VHD. This would be for the cluster witness.
Back on each clustered host, I added the Hyper-V role, and added the MPIO and Failover Clustering features. Once again, I enabled iSCSI in MPIO and added the NIC devices. On each host, I connected to the target IP address and enabled multipath. It found the second (cluster storage) target and did not find the first (VMM storage) target. That’s because the VMM storage target did not permit the IP addresses of the clustered hosts iSCSI NICs to connect. The witness disk was added.
Now I set up the cluster. The witness disk was added and I renamed it to “Witness Disk” in Failover Clustering.
Now I needed some storage for VMs. In the CSV target admin console, I created another disk on the “SAN” server of the required size. It was associated with the second (cluster storage) target so the clustered hosts could now see it in Disk Management. I formatted the volume, labelling it as “CSV1”, and added it into Failover Clustering, renaming it as "CSV1” in there. CSV was enabled in Failover Clustering, and the CSV1 disk was added as CSV storage.
I repeated that process to create CSV2.
A couple of VMs later and I had a fully functioning Hyper-V cluster working with a free Microsoft iSCSI target, running on relatively economic storage.
I found the iSCSI target to be really easy to set up and use. You just need to get used to the idea that you are sharing VHDs instead of LUNS to your iSCSI clients. The performance is OK – it’s never going to match a dedicated appliance like a Compellent, P4000, or a Clarion. But it sure does beat them on price and quick availability. I had no complaints but I intend this lab to be a lab, not a production private cloud with hundreds of VMs.
I was asked if I would run performance benchmarks. I though this would be pointless – you cannot compare something that is intended to run on a huge variety of economic platforms (I’m using a non-dedicated HP 1 Gbps switch in the lab, along with slow SATA disk on a budget storage server) with something like a pre-set collection of gear like you get with a HP P4000 bundle. Everyone’s performance experience of this solution will vary wildly.
This sort of solution is going to be of use in two scenarios:
- Demonstrations and training labs: If you need to try something out quick or show clustering in action, you can’t beat something that you can run even on a laptop and is free to download and use.
- Low end, budget production clusters: No, it cannot match a storage appliance or even other paid-for iSCSI software solutions for features or performance, but I bet you that many low end, 2 or 3 node cluster owners would prefer economy over features. Not everyone needs snapshots replicating to a remote site, you know!
Give it a look-see and find out for yourself what it can do. You might have an EVA 8000 series or some monster Hitachi SAN for production – but maybe something like this could be useful in a test lab?
Some Downloads For You To Consider: iSCSI and SCVMM
Microsoft was busy yesterday and released a bunch of downloads that you might be interested in.
Microsoft iSCSI Software Target 3.3
One of the challenges of trying out things like Hyper-V and clustering in a lab is the storage. SANs are expensive. There are solutions like Windows Storage Server (sold as OEM on storage appliances) and StarWind (a economic and highly regarded iSCSI target to install on Windows Server).
Now, if you want a simple iSCSI target that you can download and install, you can do it. Jose Bareto blogged about (with instructions) the Microsoft iSCSI 3.3 target being available to the general public. This was previously only available as a part of Storage Server. Now you can download it and install it on a Windows Server 2008 R2 machine to create a simple iSCSI storage solution. So if you want a quick and cheap “SAN” to try out clustering … you got it!
This isn’t limited to the lab either. The iSCSI target is supported in production usage. So if you need a cheap shared storage solution for a cluster, this is one way you can go. Sure, it won’t match a SAN appliance for functionality or performance, and the likes of StarWind and Datacore offer other features, but this opens up some opportunities at the lower end of the market.
SCVMM 2012 MpsRpt Beta Tool
A lot of people are trying out the beta for System Center Virtual Machine Manager 2012. I keep telling people that the virtualisation folks in Microsoft are serious about gathering and acting on feedback. This is evidence of that. This tool will enable support for collecting trace logs in SCVMM 2012 Beta. Documentation is available here.
System Center Virtual Machine Manager 2008, 2008 R2, and 2008 R2 SP1 Configuration Analyzer
This tool has been updated to add support for SCVMM 2008 R2 SP1.
“The VMMCA is a diagnostic tool you can use to evaluate important configuration settings for computers that either are serving or might serve VMM roles or other VMM functions. The VMMCA scans the hardware and software configurations of the computers you specify, evaluates them against a set of predefined rules, and then provides you with error messages and warnings for any configurations that are not optimal for the VMM role or other VMM function that you have specified for the computer.
Note: The VMMCA does not duplicate or replace the prerequisite checks performed during the setup of VMM 2008, VMM 2008 R2, or 2008 R2 SP1 components.”
Event: Private Cloud Academy – DPM 2010
The next Private Cloud Academy event, co-sponsored by Microsoft and System Dynamics, is on next Friday 25th March, 2011. At this free session, you’ll learn all about using System Center Data Protection Manager (DPM) 2010 to backup your Hyper-V compute cluster and the applications that run on it. Once again, I am the presenter.
I’m going to spend maybe a 1/3 of the session talking about Hyper-V cluster design, focusing particularly the storage. Cluster Shared Volume (CSV) storage level backup are convenient but there are things you need to be aware of when you design the compute cluster … or face the prospect of poor performance, blue screens of death, and a P45 (pink slip aka getting fired). This affects Hyper-V when being backed up by anything, not just DPM 2010.
With that out of the way, I’ll move on to very demo-centric DPM content – I’m spending most of next week building the demo lab. I’ll talk about backing up VMs and their applications, and the different approaches that you can take. I’ll also be looking at how you can replicate DPM backup content to a secondary (DR) site, and how you can take advantage of this to get a relatively cheap DR replication solution.
Expect this session to last the usual 3-3.5 hours, starting at 09:30 sharp. Note that the location has changed; we’ll be in the Auditorium in Building 3 in Sandyford. You can register here.
Hyper-V Cluster: Be Careful With Your Protocol Bindings
Failover clustering isn’t exactly fussy about what networks it uses. That can be troublesome, especially when people are buying servers with lots and lots of NICs. Document everything, and only use what you need. Here’s just a few tips:
Tip #1
Label your network connections with something descriptive such as “Parent”, “CSV”, “VM1”, “LM”, or “iSCSI1”, instead of the useless “Local Area Connection 2”. This allows you to track what is doing what.
Tip #2
Disable unused NICs. They just clutter up stuff all over the place. And they can cause a nightmare in when they are patched into DHCP networks.
Tip #3
Do not disable IPv6, even if you have no IPv6 on your network. MS will support you, but it’s recommended that IPv6 is left bound to all the physical NICs in your cluster nodes. I recently had that discussion with MCS on a customer site. A reach out to Redmond gave us this recommendation.
Tip #4
Disable everything except for the Hyper-V switching protocol on the host NICs that are used for VM networking once you verify that they are patched into the right network(s). This is to prevent the host being a n accidental participant on a guest’s network if that VLAN has a DHCP scope. It also keeps things tidy.
Tip #5
Unbind everything except for TCP on the iSCSI network (which should be a dedicated network for iSCSI with dedicated switches). I found that you can get some weird funnies like CSV suddenly not cooperating if you don’t.
CIO’s Delaying Virtualisation Because They Don’t Trust Backup
With some incredulity, I just read a story on TechCentral.ie where Veeam says that:
“44% of IT directors say they avoid using virtualisation for mission-critical workloads because of concerns about backup and recovery. At the same time, only 68% of virtual servers are, on average, backed up, according to the study of 500 IT directors across Europe and the US”.
That’s pretty damned amazing. Why do I say that? Because I know one MS partner here in Ireland sells Hyper-V because it makes backups easier and more reliable.
Hyper-V features a volume shadow snapshot service (VSS) provider. This allows compatible backup solutions (there’s plenty out there) to safely backup VM’s at the host level. This means that backing up a VM, its system state, its applications, and its data is a simple backup of a few files (it’s a bit more complicated than that under the hood). From the admins perspective, it’s just like backing up a few Word documents on a file server.
Here’s the cool bit. When a Hyper-V VM is quiesced, the VSS providers within the VM also start up. Any file services, Exchange services, SQL, and so on, are all put into a safe state to allow a backup to take place with no service interruption. Everything is backed up in a safe, consistent, and reliable manner. The result is that the organisation has a backup of the entire VM that can be restored very quickly.
Now compare being able to backup a VM by restoring a few files comapred to doing a complete restoration of a physical server when some 2-5 year old piece of tin dies. You won’t get identical hardware and will have lots of fun restoring it.
BTW, if a physical piece of tin suddenly dies in a Hyper-V cluster then the VM just fails over to another host and starts working there. There’s no comparison in the physical world. Sure you can cluster there but it’ll cost you a whole lot more than a virtualisation cluster and be a lot more complicated.
Sounds good? It gets better. Backing up a Hyper-V cluster at the host level is actually not a good idea (sounds odd that something good starts with something bad, eh?). This is because a CSV will go into redirected more during the backup to allow the CSV owner complete access to the file system. You get a drop in performance as host I/O is redirected over the CSV network via the CSV owner to the SAN storage. We can eliminate all of that and simplify backup by using VSS enabled storage. That means choosing storage with VSS providers. Now you backup LUNs on the SAN instead of disks on a host. The result is quicker and more reliable backups, with less configuration. Who wouldn’t like that?
CA Report on Downtime
I’ve just read a news story on Silicon Republic that discusses a CA press release. CA are saying that European businesses are losing €17 billion (over $22 billion) a year in IT down time. I guess their solution is to use CA software to prevent this. But my previous experience working for a CA reseller, being certified in their software, and knowing what their pre-release testing/patching is like, I would suspect that using their software will simply swap “downtime” for “maintenance windows” *ducks flying camera tripods*.
What causes downtime?
Data Loss
The best way to avoid this is to back up your data. Let’s start with file servers. Few administrators know about or decided not to turn on VSS to snapshot the volumes containing their file shares. If a user (or power user) or helpdesk admin can easily right-click to recover a file then why the hell wouldn’t you use this feature? You can quickly recover a file without even launching a backup product console or recalling tapes.
Backup is still being done direct to tape with the full/incremental model. I still see admins collecting those full/incremental tapes in the morning and sending them offsite. How do you recover a file? Well VSS is turned off so you have to recall the tapes. The file might not be in last night’s incremental so you have to call in many more tapes. Tapes need to be mounted, catalogued, etc, and then you hope the backup job ran correctly because the “job engine” in the backup software keeps crashing.
Many backup solutions now use VSS to allow backups to disk, to the cloud, to disk->tape, to disk->cloud, or even to disk->DR site disk->tape. That means you can recover a file with a maximum of 15 minutes loss (depending on the setup) and not have to recall tapes from offsite storage.
High Availability
Clusting. That word sends shivers down many spines. I starting doing clustering on Windows back in 1997 or thereabouts using third party solutions and then with Microsoft Wolfpack (NT 4.0 Advanced Server or something). I was a junior consultant and used to set up demo labs for making SQL and the like highly available. It was messy and complex. Implementing a cluster took days and specialist skills. Our senior consultant would set up clusters in the UK and Ireland, taking a week or more, and charging the highest rates. Things pretty much stayed like that until Windows 2008 came along. With that OS, you can set up a single-site cluster in 30 minutes once the hardware is set up. Installing the SQL service pack takes longer than setting up a cluster now!
You can cluster applications that are running on physical servers. That might be failover clustering (SQL), network load balancing (web servers) or us in-built application high availability (SQL replication, Lotus Domino clustering, or Exchange DAG).
The vast majority of applications should now be installed in virtual machines. For production systems, you really should be clustering the hosts. That gives you host hardware fault tolerance, allowing virtual machines to move between hosts for scheduled maintenance or in response to faults (move after failure or in response to performance/minor fault issues).
You can implement things like NLB or clustering within virtual machines. They still have an internal single point of failure: the guest OS and services. NLB can be done using the OS or using devices (use static MAC addresses). Using iSCSI, you can present LUNs from a SAN to your virtual machines that will run failover clustering. That allows the services that they run to become highly available. So now, if a host fails, the virtualization clustering allows the virtual machines to move around. If a virtual machine fails then the service can failover to another virtual machine.
Monitoring
It is critical that you know an issue is occurring or about to occur. That’s only possible with complete monitoring. Ping is not enterprise monitoring. Looking at a few SNMP things is not enterprise monitoring. You need to be able to know how healthy the hardware is. Virtualisation is the new hardware so you need to know how it is doing. How is it performing? Is the hardware detecting a performance issue? Is the storage (most critical of all) seeing a problem? Applications are accessed via the network so is it OK? Are the operating systems and services OK? What is the end user experience like?
I’ve said it before and I’ll say it again. Knowing that there is a problem, knowing what it is, and telling the users this will win you some kudos from the business. Immediately identifying a root cause will minimize downtime. Ping won’t allow you to do that. Pulling some CPU temperature from SNMP won’t get you there. You need application, infrastructure and user intelligence and only an enterprise monitoring solution can give you this.
Core Infrastructure
We’re getting outside my space but this is the network and power systems. Critical systems should have A+B power and networking. Put in dual firewalls, dual paths from them to the servers. Put in a diesel generator (with fuel!), a UPS, etc. Don’t forget your Aircon. You need fault tolerance there too. And it’s no good just leaving it there. They need to be tested. I’ve seen a major service provider have issues when these things have not kicked in as expected due to some freak simple circumstances.
Disaster Recovery Site
That’s a whole other story. But virtualisation makes this much easier. Don’t forget to test!
Windows Server 2008 R2 Hyper-V CSV and NTLM
I went to my first IT conference in April 2004 – it was WinConnections in Vegas. It was there I heard people like Mark Minasi, Steve Riley, and Jeremy Moskowitz speaking for the first time. It was there that I started thinking beyond the off-the-shelf text book and training course. One of the things that came up was authentication security. Active Directory could use NTLM, NTLMv2, or Kerberos, with the latter being the most secure, and the former being not so good (I think they put it in stronger terms).
The advice was to disable NTLM authentication across the network using GPO. I’ve heard it dozens of times since. It seems to be accepted best practice. I’ve seen it deployed countless times.
We Hyper-V engineers/administrators are going to have a problem with that. Cluster Shared Volume (CSV, the Windows Server 2008 R2 shared file system for clustered Hyper-V hosts) uses NTLM authentication between the hosts. Enabling a policy to disable NTLM will break CSV and cause the following alert:
- ID: 5121
- Source: Microsoft-Windows-FailoverClustering
- Version: 6.1
- Symbolic Name: DCM_VOLUME_NO_DIRECT_IO_DUE_TO_FAILURE
- Message: Cluster Shared Volume ‘%1’ (‘%2’) is no longer directly accessible from this cluster node
This is another situation where security auditors will try to enforce policy that will break things for us (the other is antivirus on the host). You will need an exception to this policy for all clustered Hyper-V computer objects. You can do this by using a security group to filter the offending policy in question. That will require a single GPO to apply this one setting. Alternatively you can create and link another GPO that applies just to the clustered hosts. This GPO will enable NTLM.
Hyper-V Cluster with Different Capacity Hosts
Last week I was asked about how you would introduce new, bigger, Hyper-V hosts to a cluster. For example, there was a time (not long ago) when the sweet spot for RAM in a host was 32GB RAM. You might have a number of these hosts in a cluster. For example, a cluster with 8 or less hosts would have 1 redundant node with 32GB RAM. If one host fails, then the redundant host can take up the slack.
In reality, virtual machines will be running across all of the hosts in a load balanced environment. You will have the equivalent of 1 host in redundant capacity.
A cluster with 9-16 nodes will probably have 2 redundant hosts (or equivalent capacity).
Say I have 5 hosts with 32GB RAM with 1 of those being redundant. Now I can purchase hosts with 64GB of RAM at a decent rate because servers have many more memory slots and I don’t need to buy the exponentially more expensive 8GB or 16GB memory boards. Can I buy just one of those servers and add it to the existing cluster of 32GB RAM hosts? Sure you can. But you will have trouble when you add more VM’s to it than you could add to a 32GB RAM host.
Let’s put it this way: Say I have 2 * 1 gallon buckets. 1 is full of water. I can put .5 gallon in each bucket. I can pour .5 gallon from one bucket to another to wash it or repair the original bucket. I always have 1 gallon of water … clustered between my 2 * 1 gallon buckets. Now I want to carry much more water and I buy a 2 gallon bucket to add to my collection of buckets. I have 1 * 1 gallon bucket that is full. I have 1 * 2 gallon bucket that is full. I have 1 empty 1 gallon bucket as a spare. But it can only be a spare for the other 1 gallon bucket. I will have to throw away 1 gallon if I need to wash or repair the 2 gallon bucket and pour its contents into the spare 1 gallon bucket.
The same goes for Hyper-V hosts (purely on RAM capacity). A 32GB RAM host cannot offer full redundancy for a 64GB RAM host. Half of the virtual machines can migrate and stay running/reboot, but the other half will not be able to start.
Here’s what you can do in a growth scenario:
- Add a single 64GB RAM host. You don’t need to add more hosts while it hosts no more than the capacity of a 32GB RAM host (probably around 28GB of committed virtual machine RAM (I say committed because Dynamic Memory makes things a little more complicated).
- Once the 64GB RAM host exceeds the capacity of a 32GB RAM host you will need to add a second 64GB RAM host. This will provide you will the capacity for providing redundancy for all running virtual machines on the original 64GB RAM host.
I haven’t got the h/w to test this out but I suspect that VMM will scream at you about loss of redundancy in the cluster if you don’t do this right.
Cluster Resource Anti-Affinity
I recently learned from Hans Vredevoort that it is actually possible to define anti-affinity for Hyper-V virtual machines on a cluster. For example, you might want to force load-balanced virtual web servers to be on different nodes. You can do this by running commands such as:
cluster.exe group “VirtualWebServer1” /prop AntiAffinityClassNames="NLBCluster1"
cluster.exe group “VirtualWebServer2” /prop AntiAffinityClassNames="NLBCluster1"
This will create an anti-affinity object called NLBCluster1 and try to prevent both of the virtual web servers from being on the same Hyper-V host server in the same cluster. Sometimes a failover with reduced capacity will override this in order to keep the virtual machines running when there aren’t enough hosts left to meet demand.