{"id":10845,"date":"2010-09-14T09:25:30","date_gmt":"2010-09-14T09:25:30","guid":{"rendered":"https:\/\/aidanfinn.com\/?p=10845"},"modified":"2010-09-14T09:25:30","modified_gmt":"2010-09-14T09:25:30","slug":"ca-report-on-downtime","status":"publish","type":"post","link":"https:\/\/aidanfinn.com\/?p=10845","title":{"rendered":"CA Report on Downtime"},"content":{"rendered":"<p>I\u2019ve just read a news story on <a href=\"http:\/\/www.siliconrepublic.com\/strategy\/item\/17696-european-firms-losing-17bn\/\" target=\"_blank\">Silicon Republic <\/a>that discusses a CA press release.\u00a0 CA are saying that European businesses are losing \u20ac17 billion (over $22 billion) a year in IT down time.\u00a0 I guess their solution is to use CA software to prevent this.\u00a0 But my previous experience working for a CA reseller, being certified in their software, and knowing what their pre-release testing\/patching is like, I would suspect that using their software will simply swap \u201cdowntime\u201d for \u201cmaintenance windows\u201d *ducks flying camera tripods*.<\/p>\n<p>What causes downtime?<\/p>\n<p><strong><span style=\"text-decoration: underline;\">Data Loss<\/span><\/strong><\/p>\n<p>The best way to avoid this is to back up your data.\u00a0 Let\u2019s start with file servers.\u00a0 Few administrators know about or decided not to turn on VSS to snapshot the volumes containing their file shares.\u00a0 If a user (or power user) or helpdesk admin can easily right-click to recover a file then why the hell wouldn\u2019t you use this feature?\u00a0 You can quickly recover a file without even launching a backup product console or recalling tapes.<\/p>\n<p>Backup is still being done direct to tape with the full\/incremental model.\u00a0 I still see admins collecting those full\/incremental tapes in the morning and sending them offsite.\u00a0 How do you recover a file?\u00a0 Well VSS is turned off so you have to recall the tapes.\u00a0 The file might not be in last night\u2019s incremental so you have to call in many more tapes.\u00a0 Tapes need to be mounted, catalogued, etc, and then you <em>hope<\/em> the backup job ran correctly because the \u201cjob engine\u201d in the backup software keeps crashing.<\/p>\n<p>Many backup solutions now use VSS to allow backups to disk, to the cloud, to disk-&gt;tape, to disk-&gt;cloud, or even to disk-&gt;DR site disk-&gt;tape.\u00a0 That means you can recover a file with a maximum of 15 minutes loss (depending on the setup) and not have to recall tapes from offsite storage.<\/p>\n<p><strong><span style=\"text-decoration: underline;\">High Availability<\/span><\/strong><\/p>\n<p>Clusting.\u00a0 That word sends shivers down many spines.\u00a0 I starting doing clustering on Windows back in 1997 or thereabouts using third party solutions and then with Microsoft Wolfpack (NT 4.0 Advanced Server or something).\u00a0 I was a junior consultant and used to set up demo labs for making SQL and the like highly available.\u00a0 It was messy and complex.\u00a0 Implementing a cluster took days and specialist skills.\u00a0 Our senior consultant would set up clusters in the UK and Ireland, taking a week or more, and charging the highest rates.\u00a0 Things pretty much stayed like that until Windows 2008 came along.\u00a0 With that OS, you can set up a single-site cluster in 30 minutes once the hardware is set up.\u00a0 Installing the SQL service pack takes longer than setting up a cluster now!<\/p>\n<p>You can cluster applications that are running on physical servers.\u00a0 That might be failover clustering (SQL), network load balancing (web servers) or us in-built application high availability (SQL replication, Lotus Domino clustering, or Exchange DAG).<\/p>\n<p>The vast majority of applications should now be installed in virtual machines.\u00a0 For production systems, you really should be clustering the hosts.\u00a0 That gives you host hardware fault tolerance, allowing virtual machines to move between hosts for scheduled maintenance or in response to faults (move after failure or in response to performance\/minor fault issues).<\/p>\n<p>You can implement things like NLB or clustering within virtual machines.\u00a0 They still have an internal single point of failure: the guest OS and services.\u00a0 NLB can be done using the OS or using devices (use static MAC addresses).\u00a0 Using iSCSI, you can present LUNs from a SAN to your virtual machines that will run failover clustering.\u00a0 That allows the services that they run to become highly available.\u00a0 So now, if a host fails, the virtualization clustering allows the virtual machines to move around.\u00a0 If a virtual machine fails then the service can failover to another virtual machine.<\/p>\n<p><strong><span style=\"text-decoration: underline;\">Monitoring<\/span><\/strong><\/p>\n<p>It is critical that you know an issue is occurring or about to occur.\u00a0 That\u2019s only possible with complete monitoring.\u00a0 Ping is not enterprise monitoring.\u00a0 Looking at a few SNMP things is not enterprise monitoring.\u00a0 You need to be able to know how healthy the hardware is.\u00a0 Virtualisation is the new hardware so you need to know how it is doing.\u00a0 How is it performing?\u00a0 Is the hardware detecting a performance issue?\u00a0 Is the storage (most critical of all) seeing a problem?\u00a0 Applications are accessed via the network so is it OK?\u00a0 Are the operating systems and services OK?\u00a0 What is the end user experience like?<\/p>\n<p>I\u2019ve said it before and I\u2019ll say it again.\u00a0 Knowing that there is a problem, knowing what it is, and telling the users this will win you some kudos from the business.\u00a0 Immediately identifying a root cause will minimize downtime.\u00a0 Ping won\u2019t allow you to do that.\u00a0 Pulling some CPU temperature from SNMP won\u2019t get you there.\u00a0 You need application, infrastructure and user intelligence and only an enterprise monitoring solution can give you this.<\/p>\n<p><strong><span style=\"text-decoration: underline;\">Core Infrastructure<\/span><\/strong><\/p>\n<p>We\u2019re getting outside my space but this is the network and power systems.\u00a0 Critical systems should have A+B power and networking.\u00a0 Put in dual firewalls, dual paths from them to the servers.\u00a0 Put in a diesel generator (with fuel!), a UPS, etc.\u00a0 Don\u2019t forget your Aircon.\u00a0 You need fault tolerance there too.\u00a0 And it\u2019s no good just leaving it there.\u00a0 They need to be tested.\u00a0 I\u2019ve seen a major service provider have issues when these things have not kicked in as expected due to some freak simple circumstances.<\/p>\n<p><strong><span style=\"text-decoration: underline;\">Disaster Recovery Site<\/span><\/strong><\/p>\n<p>That&#8217;s a whole other story.\u00a0 But virtualisation makes this much easier.\u00a0 Don&#8217;t forget to test!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I\u2019ve just read a news story on Silicon Republic that discusses a CA press release.\u00a0 CA are saying that European businesses are losing \u20ac17 billion (over $22 billion) a year in IT down time.\u00a0 I guess their solution is to use CA software to prevent this.\u00a0 But my previous experience working for a CA reseller, &hellip; <a href=\"https:\/\/aidanfinn.com\/?p=10845\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;CA Report on Downtime&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[9],"tags":[175,63,181,83,195,117],"class_list":["post-10845","post","type-post","status-publish","format-standard","hentry","category-commentary","tag-dpm","tag-failover-clustering","tag-hyper-v","tag-operations-manager","tag-virtualisation","tag-windows-server-2008-r2"],"aioseo_notices":[],"jetpack_featured_media_url":"","amp_enabled":true,"_links":{"self":[{"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/posts\/10845","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=10845"}],"version-history":[{"count":0,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/posts\/10845\/revisions"}],"wp:attachment":[{"href":"https:\/\/aidanfinn.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=10845"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=10845"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=10845"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}