Windows Server Boot From Fibre Channel SAN Whitepaper

Microsoft has made available a whitepaper that discussed the concept of booting from SAN:

“Booting from a storage area network (SAN), rather than from local disks on individual servers, can enable organizations to consolidate their IT resources, minimize their equipment costs, and realize considerable management benefits by centralizing the boot process. This white paper covers boot from SAN technology as deployed with Microsoft® Windows Server 2008 and Microsoft® Windows Server 2003. It describes the advantages and complexities of the technology, and a number of key SAN boot deployment scenarios”.

This is something we’ve done at my last two jobs using HP Blades and SAN/Ethernet Virtual Connects (VC).  All the SAN and Ethernet connections come into the back of the chassis into the VC’s.  There you define the networks and SAN connections.  You build a profile for each server that defines which cable/logical connection is presented to each physical interface on the server.  This profile is mobile, i.e. you can move it from one server to another with the click of a button.  In the VC, the profile defines the WWN and the MAC addresses that the SAN fabric and Ethernet will see instead of the ones that are physically assigned to the server.  This virtualises the connections allowing for mobility of the profile without compromising access to the connections, e.g. Ethernet ARP tables remember the VC profile (for the blade) MAC address and not the Blade server’s MAC address and the SAN fabric is set up using the VC profile (for the blade) SAN WWN.

Now I configure the blade to boot from it’s SAN HBA mezzanine card instead of the internal SCSI.  I prefer QLogic over Emulex; I had a very high failure rate with Emulex in the previous job.  That requires setting up a disk in the SAN which will be the boot disk.  This disk is only presented to the server … via the WWN defined in the profile in the VC … which is what the SAN will see the HBA using thanks to the VC virtualising the communications.  I install the OS and that is set up on the SAN disk.  The blade has no internal disks.  It’s now just a dump replaceable appliance with no data.

Why do this?  Well, as I said the server is now an appliance.  Other than its capital value it has no other value to the business, unlike a server that contains disks internally and whose physical MAC and WWN are presented to the network/SAN.  Replacing the traditional server with new hardware is a big ordeal involving lots of downtime.  For me, I can either keep a hot spare server or get HP to bring in a replacement (in 4 hours).  If I keep a hot spare I can log into the VC and move the profile of the failed server over to the hot spare and boot it up via ILO.  It then becomes the old server; the OS is on the SAN, the WWN and the MAC are defined in the VC profile and I’ve moved all definitions and connections over to the new server via the VC profile.  Alternatively I can pull out the old server and insert a new one.  the profile is associated with a blade enclosure bay.  The new server then becomes the old one automatically.

 

Primer on iSCSI and HP BladeSystem

When I think about blade servers and storage, to be honest, I think about Fibre Channel (FC) SAN.  The Virtual Connect (VC) technology is pretty powerful.  Even today when talking about high uptime options for a client we decided on a “hot spare” blade where we could flip over the VC profile if the original machine died.

Fibre Channel SAN isn’t an option for everyone.  iSCSI is a powerful option, especially with 10GB Ethernet or Flex10 as HP brands it in their Blade System.  There’s a lot of questions you might have about iSCSI and HP BL Proliant servers so HP has published a handy 3 page FAQ that goes through support and options.  With something like iSCSI, HP Blades and Flex10 you could possibly set up Blade hosts to run Windows Server 2008 R2 Hyper-V with the parent partition on internal SAS disks and the cluster shared volume running on iSCSI.

del.icio.us Tags: ,,,,

Power Utilisation Comparison Of Rack VS Blade Servers

This blog post by Data Center Strategies reports on a publication by HP.  HP compared the power usage of DL rack mounted servers and BL Blade servers.  It was … interesting.  When idle, the blades used significantly less power.  But when busy, there was little in the way of difference.

So …

  • If you are building a small to medium power intensive server farm you might be tempted to go with rack servers instead of blades.  There’s a big cost saving to be made.  Server prices have increased over the last year to compensate for the lack of sales … we need less physical boxes because we are virtualising.  Server capacity is up, though.
  • Blades do have some nice features.  There’s a lot less cabling and hardware virtualisation enables boot from SAN that turns your physical servers into anonymous replaceable appliances.  All the intelligence is in the chassis and all the OS/data is on the SAN.
  • As committed to blades/SAN as we are at work, there’s still times where we’ve found DL rack servers to be more appropriate, functionally and cost wise.

I’ve not looked at the cost of the C3000 “Shorty”.  There’s some cool stuff you can now do with their Flex-10 10GB networking that enables you to use the C3000 for virtualisation.  The C3000 has 8 slots for server, tape and storage blades.  The problem with the Shorty blades is that they only take one mezzanine card.  That means you can’t do complex virtualisation clusters that could require 6 NIC’s or more per server.  With Flex-10 you get 10GB networking in the backplane.  You can divide that up and create virtual NIC’s on your blades.  Potentially (don’t ask me about support for this because I don’t know) you could have 8 NIC’s per blade for virtualisation … 2 for the parent partition, 2 for the heartbeat, 2 for VMotion/Live Migration and 2 for the virtual switches.  This could be fine in small deployments, e.g. a branch office.  AFAIK, you could then use iSCSI to mount the shared storage for VMFS/CSV.

But you know, now if I was building a virtual server farm now with a traditional known growth limit (not like in hosting where the growth is hopefully endless) then I’d go with normal rack servers.  There’s a big investment in a blade chassis that is hard to justify now.  On the HP storage side the Lefthand iSCSI stuff looks very tempting for DR implementations.  It is pricey but it would make DR very easy.

EDIT #1

As exected, HP’s marketing was not very happy with this report.  Some investigations were done and it turns out the rack server configurations weren’t on par with the blade comparisons.  The rack servers only has one power supply and had redundant NIC’s disabled.  Anything that could be done to reduce power consumption was done.

IBM VS HP: Hardware and Service

I’ve been using HP servers and storage most of the time since 2003.  I’ve experience their support via two channels: via partners maintenance contracts and direct support contracts.  Has it always been perfect?  No, but I’ve gotten things sorted.  Typically, the issue is resolved within 4 hours which is perfect.  What I love about HP hardware is how easy it is to manage.  If you use their setup DVD, you can install your OS with all the HP management software and agents.  This lets you configure every aspect of the hardware (with no sacrifices to the gods or black magic required) and the SIM agent will detect any hardware fault.  You can use HP’s free software, their paid software or even their management pack for MOM 2005/OpsMgr 2007 to get alerts.  Heck, the agent can be configured to directly send SMTP alerts.  Each server has a HTTPS based service running on TCP 2381 to allow you to inspect the exact hardware issue, part number and serial number.  The HP event log also shows you a clear explanation of what’s happened.

At work, we have 6 IBM X servers and 5 IBM DAS storage units.  They were bought before I joined the company and installed by IBM and one of their Irish preferred partners (think of a shape to guess the name – no it isn’t a rhombus).  The first thing I did was inspect the installation.  I wasn’t familiar with IBM hardware or systems management so I didn’t know fully what to look for or expect.

Aside from the 13 configuration issues I found on this 6 server installation, e.g. the single domain controller on a mission critical service configured to use an external ISP’s DNS servers as it’s primary DNS server, I noticed I had no way of locally inspecting the health of the IBM hardware.  I had no way of configuring disk.  BTW, we spent endless hours fixing those 13 issues and I added Active Directory fault tolerance.

We use OpsMgr 2007 for health and performance monitoring.  HP offers a management pack to integrate with the SIM agent on Proliant servers.  IBM claimed to offer a management pack for IBM Director agents.  I searched high and low for it.  A friend in Holland was also doing the same for a site he was in.  Neither of us could find it.  Every link on the IBM site was dead.  I contacted a sales guy in Ireland.  He sent me a link.  It turns out IBM published it on their Intranet but not on the Internet.  The link wouldn’t work.  Eventually we got the MP after many emails.

One year ago we had a failed disk in one of the IBM DAS storage units.  No worry; we had a support contract with IBM.  Or I thought there should be no worry.  I logged the call.  After 1 week of stress, getting our directors involved and screaming at local IBM sales people, did we get our replacement disk.  Here’s the really worrying bit.  The IBM director agent on the server connect to the DAS box did not pick up the failure.  I discovered the failure when we rebooted the server and it hung on the POST to say there was an issue.

I the meantime we bought HP blades and a SAN.  We had some memory board failures, etc.  Each time, I got an alert from the SIM agent via OpsMgr 2007.  We logged calls with HP via their portal and memory boards were replaced within 4 hours.

Back to April of this year.  One of our 5 IBM DAS units went offline.  One of our engineers logged the call.  IBM support wanted DSA logs before they’d progress the call.  Our engineer sent them in.  IBM Support continued to ask for the logs for the following 2 months! In the meantime we had escalated the issue to local IBM staff.  Every single person in IBM refused to send anyone out.  We’d sent in the logs.  We resorted to sending them to local staff members.  After 2 months we finally got an engineer out.  The Megaraid controller firmware had a bug.  I wanted it replaced so it was replaced.  I wanted a complete resolution after 2 months of an outage caused by IBM hardware and “support”.

In the meantime, there was another memory degradation in a HP blade.  An engineer from a reseller was sent out by HP within 3 hours.  There was zero fuss or downtime (Hyper-V cluster).

A few weeks later we started updating firmware on the Megaraid controllers to avoid this issue.  The first one went OK.  The second one failed.  I got a message in POST about foreign configurations.  I had a choice of importing or continuing.  I didn’t know what to do – I’m not an IBM engineer.  I googled but with no joy.  Our storage was not visible on the server.  I called IBM support expecting this to be a 1 minute conversation.  Instead I was on the phone for 2 hours.  The support engineer barely spoke English.  He decided to have me go through a maze of POST configuration tools.  It was clear by the delays in his instructions that he didn’t know the solution; he was searching for answers on an Intranet portal.  I asked 3 times if he knew what he was doing.  “Yes” was the answer.  After the 3rd time I demanded to speak to his team leader.  More excuses followed with him.  If you know me, you can imagine what my temperament was like at this point.  I actually got the guy flustered enough to get him to admit that no one on his team knew how to resolve this issue on the DAS unit.  Stunning!  I demanded an on-site engineer and one came out later that day.  He pressed 1 button to import the foreign configurations in the POST and the issue was resolved.  That’s all I wanted from the support desk …. do I press that button or not?

IBM said they’d come out to upgrade the rest of the firmwares to make up for our experience.  Fair enough. They did that within a few days.  During the process a disk failed in one of the DAS units.  Then I saw how my experience differed from theirs.  They logged a call and a disk was out in a day.

A week later (last week) I was in the data centre to do some network engineering.  I checked on the rack with the IBM gear.  Uh-oh.  Another disk failure in a DAS unit and another DAS chassis had an alert light.  IBM Director picked up neither issue.  I logged 2 calls; one for each issue.  That was Thursday.  A few hours later the IBM support desk called me.  A replacement for the failed disk was not in stock in Ireland.  We’d get a replacement 2 days later once it was shipped from Holland.  What!!!!  Our MD wasn’t happy.  The MD went straight to IBM and complained.  Suddenly the disk was going to be replaced the following morning.  It seems to me that IBM was just delaying in some way to reduce shipment costs.

As for the alerting DAS chassis?  It’s now the following Wednesday and IBM still hasn’t followed up.  I sent in the DSA logs 17 minutes after IBM asked for them last Thursday.  They started this rubbish about saying they hadn’t gotten them.  Ah but boys, didn’t you see that I CC’d another of our engineers, our MD, and 3 people in IBM Ireland.  I aint accepting the BS you’re using to delay action.  According to an email from IBM, I should have used an FTP site to upload the logs as my first choice.  I tried.  Without logging in I had no access to the folder in question.  I was given no credentials.  So then I tried anonymous with my email address as my password (thank God for green screen education in college).  I navigated to the folder but was refused permission to upload.  The delaying rubbish about the logs continued up to Monday.  Then an engineer called to ask for the logs.  I exploded over the phone.  I got onto his team leader (the same guy as before) and suggested that maybe the lot of them should be fired and that Lotus Notes was a pile of steaming ****.  I wasn’t sending in logs again.  It was done once and I told him he could get it from one of the 3 IBM people in Ireland that I’d CC’d.  That went down well 🙂

30 minutes later the field service manager for IBM Ireland called me.  More of the same.  I really don’t care.  “Would I go to a meeting to learn more abou
t IBM?”.  Why the f**k would I want to do that?  I have no time for that BS.  I don’t tolerate sales people; I don’t take their calls because I have no time for crap.  JUST FIX THE DAMNED DAS BOX!  He promised to forward the DSA logs to the support desk.  That was 2 days ago.  Nothing has happened since.

Oh sorry it has, that manager has tried to go above my head to our MD to get us out to talk about IBM.  Oh you sad bugger.  That was the wrong move.  In fact, that pushed me over the edge.  Trying to outmanoeuvre me while still not sending anyone out to fix the DAS unit is the sort of BS I don’t accept from anyone.

So here’s how I compare HP and IBM:

  IBM HP
Sales Awful Pretty good
Product quality Awful Good
Management of hardware Awful Excellent
Support Beyond awful – think BBC Watchdog Very good

Planning Works Out

If you’ve ever seen the back of a server rack that I’ve cabled then you’d never let me even plug in a power lead to a kettle.  I am horrible at cabling.  Simply awful at it.  Those probably aren’t strong enough phrases to be honest.  That’s one of the reasons I like blade/SAN technology; there’s a minimal amount of cabling and it’s all usually done by an expert engineer who’s installing the blade chassis and the SAN.  When we put in our gear, I made sure it was!

The engineer did a nice job at labelling everything.  All lead placements were planned.  We’ve a network mesh going back to our access switches from the blade Ethernet virtual connects.  There’s a divergent path between the blade fibre virtual connects, the fibre switches and the SAN chassis units.  Each server has dual channel HBA mezzanine cards And power is split between circuit A and B in each rack.  That means we can lose a circuit and still be operational.  Adding servers doesn’t require more cabling – only adding a chassis does and then I’ll get the engineer to do the work 🙂

Note: We went with Brocade mezzanine cards instead of the Emulex ones.  At my last job we had 128 HP BL460C’s with Emulex HBA’s.  I’d say at least a quarter of the HBA’s had to be replaced in the month before we went into production.  I spoke with an engineer from the reseller recently and he said they were still regularly failing.  We haven’t had any issues with the Brocade ones.

We put the power and fibre channel fault tolerance to test today.  We needed to replace 2 Power Distributions Units (PDU’s).  They have management boards on them that the data centre doesn’t use.  Instead they have an out-of-band management system.  The management boards faulted so we had annoying alarm lights and sirens.  We often bring people in for a show’n’tell during pre-sales so alarms are not good, even if they mean nothing, which they did.  The data centre power management system and our OpsMgr 2007 HP Management Packs would have told us if we had a power issue.

We scheduled the replacement for this afternoon.  Outages are out of the question for the mission critical services we provide to our managed server hosting customers.  We swapped out the PDU’s with the alarms.  Not a single flicker of a problem was seen.  I watched the OpsMgr console for alerts while I was logged into a few VM’s (stored on the SAN) running tests.  The MPIO fault tolerance (Windows Server 2008 SP2) and the power fault tolerance of the SAN/Blades worked.

I was pretty confident of there not being an issue.  Everything was tested by the HP engineer when we did the installation last year.  All the hardware was looking healthy and the “board” was green in OpsMgr 2007.  This just shows how a little bit of planning before you plug things in and a little testing afterwards works in your favour.

IBM Sucks

Before I joined the company I work for, they’d bought some IBM servers and DAS storage units.  These were built up to host a application that is clustered to the point where we can lose 66% of the infrastructure but still be 100% operational with no loss in performance.

A little while ago, one of those DAS units went offline.  All the disks appeared offline.  I suspected either a dead backplane, SCSI cable or controller card in the attached server.  One of the engineers at work open a support case with IBM.  You see, we paid for that 4 hour on-site support contract so we expected to have that unit back online by the end of the day.  I should have learned from my previous experience with IBM last year where it took a week to get a replacement disk sent out to us.

22 days after we opened the call, we finally got an engineer on-site.  The SCSI card had failed.  Heck, we were even told that the IBM SCSI cards “sometimes lose their configuration”.  WHAT THE F**K????????  I’m sorry, but in 16 years of working with servers from Amdahl, Fujitsu and HP I’ve never had that happen.  Never.  The way the guy said this to me made it sound like a “fait du compli”.  How in the H-E-Double_Hockey_Sticks (enough swearing in this post so far) does this sound any way acceptable that a critical piece of mission critical hardware could be allowed to fail in this tolerated manner.

Luckily we do keep triplicate copies of all data on independent stores so there was zero rick of data loss.

22 days after we called on our “4 hour on site” contract an engineer finally came out to resolve the issue.  22 days.  22 DAYS!  Now that’s some fantastic support from Big Blue (can you smell the sarcasm?). 

It’s clear to me.  IBM sucks.  The hardware sucks.  Their support sucks.  I’ve called on their support twice in a year and both times they sucked.

I know of one company that recently had an awful experience with the IBM S series blade chassis.  Networking didn’t work.  Someone came out to try fix it and couldn’t.  The chassis was replaced and it still didn’t work.  And IBM like to make comments about the very simple HP blade chassis backplane because it has no intelligence.  At least it has fewer parts to break and works reliably.

I’m amending my advice for buying IT products.  It generally came in the form of “never buy software in yellow boxes”.  My new piece of advice: “Never buy from a company that made typewriters”.  I’ve been using HP for 5 years now and I’ve never had an experience like the one I’ve had from IBM.

HP Proliant G6 Servers

Scott Lowe has posted an article on the HP Proliant G6 servers.  It looks like we seeing more performance, new processors, better power/cooling management and interchangeable power supplies (yay!).

If asked to choose between the 4 major brands then here’s my list:

4) Fujitsu Siemens: Rubbish in my experience and no cooperation with others, e.g. Microsoft System Center when I last looked for it.  Back in 2003 we had a branch office that insisted on buying from this company.  My boss relented.  We told them everything had to be W2003 certified because that’s what we were installing.  The branch passed that on and bought the gear.  The onboard SCSI controllers were only W2K certified.

3) IBM: Awful stuff from a has-been.  I’ve had an awful time with their support desk.  I found it impossible to find their OpsMgr 2007 management pack.  And their native hardware monitoring is pitiful.  Who wants a server that fails to reboot because 1 disk in a RAID array has failed?  I’d rather it reboot and alert me.  There is that lagging concern about selling the server business to Lenovo.  They fall off the chart if that happens.

2) Dell: Not for blades though.  They can’t make up their minds if they are in or out of that market.  Excellent management through MS System Center.  Pretty economic.  Not sure about support.

1) HP: Support is not perfect – Why can’t their India office (a) act professionally and (b) get a phone line that works properly?  The best hardware I’ve used and excellent integration with Microsoft System Center.  Easy to use ILO (IP KVM), Virtual Connect (Blade connection virtualisation), EVA Command View (SAN management) and Insight Manager Agents that pick up everything.  I wish they’d catch up with MS and produce versions of everything that worked on a Core installation.

No, neither Sun nor Cisco appeared here.  Not on my list.  I still remember the days when it was cheaper to buy a BMW than a stick of RAM for a Sun server.  And Cisco are too new in this market.

EDIT #1:

You can watch videos of the launch event here.