I’ve just read a news story on Silicon Republic that discusses a CA press release. CA are saying that European businesses are losing €17 billion (over $22 billion) a year in IT down time. I guess their solution is to use CA software to prevent this. But my previous experience working for a CA reseller, being certified in their software, and knowing what their pre-release testing/patching is like, I would suspect that using their software will simply swap “downtime” for “maintenance windows” *ducks flying camera tripods*.
What causes downtime?
The best way to avoid this is to back up your data. Let’s start with file servers. Few administrators know about or decided not to turn on VSS to snapshot the volumes containing their file shares. If a user (or power user) or helpdesk admin can easily right-click to recover a file then why the hell wouldn’t you use this feature? You can quickly recover a file without even launching a backup product console or recalling tapes.
Backup is still being done direct to tape with the full/incremental model. I still see admins collecting those full/incremental tapes in the morning and sending them offsite. How do you recover a file? Well VSS is turned off so you have to recall the tapes. The file might not be in last night’s incremental so you have to call in many more tapes. Tapes need to be mounted, catalogued, etc, and then you hope the backup job ran correctly because the “job engine” in the backup software keeps crashing.
Many backup solutions now use VSS to allow backups to disk, to the cloud, to disk->tape, to disk->cloud, or even to disk->DR site disk->tape. That means you can recover a file with a maximum of 15 minutes loss (depending on the setup) and not have to recall tapes from offsite storage.
Clusting. That word sends shivers down many spines. I starting doing clustering on Windows back in 1997 or thereabouts using third party solutions and then with Microsoft Wolfpack (NT 4.0 Advanced Server or something). I was a junior consultant and used to set up demo labs for making SQL and the like highly available. It was messy and complex. Implementing a cluster took days and specialist skills. Our senior consultant would set up clusters in the UK and Ireland, taking a week or more, and charging the highest rates. Things pretty much stayed like that until Windows 2008 came along. With that OS, you can set up a single-site cluster in 30 minutes once the hardware is set up. Installing the SQL service pack takes longer than setting up a cluster now!
You can cluster applications that are running on physical servers. That might be failover clustering (SQL), network load balancing (web servers) or us in-built application high availability (SQL replication, Lotus Domino clustering, or Exchange DAG).
The vast majority of applications should now be installed in virtual machines. For production systems, you really should be clustering the hosts. That gives you host hardware fault tolerance, allowing virtual machines to move between hosts for scheduled maintenance or in response to faults (move after failure or in response to performance/minor fault issues).
You can implement things like NLB or clustering within virtual machines. They still have an internal single point of failure: the guest OS and services. NLB can be done using the OS or using devices (use static MAC addresses). Using iSCSI, you can present LUNs from a SAN to your virtual machines that will run failover clustering. That allows the services that they run to become highly available. So now, if a host fails, the virtualization clustering allows the virtual machines to move around. If a virtual machine fails then the service can failover to another virtual machine.
It is critical that you know an issue is occurring or about to occur. That’s only possible with complete monitoring. Ping is not enterprise monitoring. Looking at a few SNMP things is not enterprise monitoring. You need to be able to know how healthy the hardware is. Virtualisation is the new hardware so you need to know how it is doing. How is it performing? Is the hardware detecting a performance issue? Is the storage (most critical of all) seeing a problem? Applications are accessed via the network so is it OK? Are the operating systems and services OK? What is the end user experience like?
I’ve said it before and I’ll say it again. Knowing that there is a problem, knowing what it is, and telling the users this will win you some kudos from the business. Immediately identifying a root cause will minimize downtime. Ping won’t allow you to do that. Pulling some CPU temperature from SNMP won’t get you there. You need application, infrastructure and user intelligence and only an enterprise monitoring solution can give you this.
We’re getting outside my space but this is the network and power systems. Critical systems should have A+B power and networking. Put in dual firewalls, dual paths from them to the servers. Put in a diesel generator (with fuel!), a UPS, etc. Don’t forget your Aircon. You need fault tolerance there too. And it’s no good just leaving it there. They need to be tested. I’ve seen a major service provider have issues when these things have not kicked in as expected due to some freak simple circumstances.
Disaster Recovery Site
That’s a whole other story. But virtualisation makes this much easier. Don’t forget to test!