I inherited a number of IBM servers with this job. They perform a critical business service for our customers. Luckily, the architecture we use is very fault tolerant.
Over the weekend we deployed updates in a staged manner to our production network – after testing of course. On Sunday morning, I woke up to an email from System Center Operations Manager 2007 (gotta love it!) saying that one of the servers we patched on Saturday night was not responding to agent heartbeat requests. Uh oh! This was one of those IBM boxes. We have triplicate redundancy so I knew I could let it wait until Monday morning. To be safe, I suspended updates for the remaining production boxes. I didn’t suspect an update but I wasn’t taking any chances.
I came into the data centre this morning and found the server sitting on a BIOS prompt. Hmm. That’s not good. It had detected a problem with the external disk storage and was waiting for administrator approval to boot up. What? Hello? Note: the failure was nothing to do with the server-internal boot disks.
I checked the Direct Attached Storage (DAS) and it was all green. I booted up the server and saw the DAS was not being connected. I shut down the server and powered down the DAS. I powered up the DAS and was greeted with beeping … non-stop beeping. The front panel now showed a chassis alert on the DAS and one of the disks in the RAID5 array was alerting as well. Huh!?! Why didn’t it tell me this when the server already knew there was a problem?
I powered up the server. Now it didn’t prompt me. But it did tell me the external disk was degraded. Fine, the hardware knows there’s a problem.
I logged in and found there were no hardware logs or any sort of interface into the IBM director agent. Nothing. Sweet F.A. The consultants (before my time) who installed the hardware had set up an IBM director console on another box for centralised monitoring. I logged into it and sure enough, there were no alerts. Hold an a *beep*ing minute; the hardware knows there’s a problem but the monitoring agent from the hardware vendor doesn’t have a clue?
OK, maybe it was the central console at fault? I’ve never trusted it. I went on to the SCOM console but found no alerts or health degradation on the IBM Director monitors. That made it certain in my mind, the IBM Director agent was clueless.
So here’s my summary why I would recommend people to steer clear of IBM hardware in an enterprise deployment based on this little story:
- The DAS failed to show an alert on the front panel or disk despite the server not being able to boot up because it detected a failure.
- The IBM Director agent failed to report an incident of any kind.
- There’s no user interface to the IBM director agent on the server.
- A failure of a single disk in a RAID5 array in a DAS caused a server not to boot up. That’s just stupid.
- We’ve all heard that Lenovo are taking over the server and storage business. My experience of them with their support was awful – A call open for around 4 months and 2 months of that with the regional director taking a personal interest.
I’m now left wondering how long I’ve had a failed disk on this server considering it didn’t give any monitoring alert or visible notification until I reset the DAS chassis.
How would HP handle this?
- The SIM agent would have alerted on this and shown it in the HP SIM log and in the SIM web page on the server.
- The HP SCOM management pack for SIM would have alerted and sent all of the required/responsible administrators/operators/"business owners" a notification of the failure.
- The disk would have shown an alert light immediately.
- It’s unlikely that the server would have been prevented from booting up unless there was a complete failure of the boot disk.
- I would have had the storage back to a healthy state within 4 hours of opening a call with HP.
That’s a very different experience and one you expect to have from enterprise class servers and storage.
As you can guess, I was concerned with the lack of h/w monitoring that the IBM Director agent gave me. The horrid response from the MD was that we’d have to check that the logical disks in question were present on a daily/manual presence. Yuk! I’d a better idea: let SCOM do the work for me. I’ve created a distributed application that entails on the dependancies I can think of for this service, including the presence and health of the logical disk in question.
It was funny to see that the HP management pack allowed me to include discovered HP hardware objects but there were no classes for IBM hardware. Come on IBM; you gotta play better with others! Not everyone wants to buy consultancy-ware like Tivoli.