Something Has Gone Very Wrong With Microsoft Patch Testing

I suspect I’m going to get some “unwanted attention” for this post but what I’m going to say has to be said publicly …

Something has gone wrong with the testing process for Microsoft hotfixes since the release of WS2012.  There has been a number of really bad releases in those 10 or months.  The latest is KB2855336, aka the July 2013 update rollup, which causes hosts to bug check as Hans Vredevoort and some of you reported.  There is also a thread on the Hyper-V TechNet forum.

People like me and Microsoft are trying to encourage people to:

  • Install security hotfixes with minimal delay
  • Embrace a process of updating their Hyper-V hosts with fixes for Hyper-V and Failover Clustering to prevent issues

This string of updates that break hosts (and this is exponentially worse than breaking an occasional physical server here or there) is embarrassing and dangerous.  The latest failure is in an update rollup that is issued via Windows Update.  This just feeds the argument that patching is ba-ad and shouldn’t be done … and creates a security mess for, not just for those companies but, everyone in the community.

I’d love to say I have a fix.  I’d love to say, hey use the automatic approval process in System Center Configuration Manager where we can:

  • Delay approval of updates for X days – letting others find the bugs and Microsoft issue a superseding update
  • Force the deployment

That will work for non-clustered hosts but:

  • Folks with clusters will want to use Cluster Aware Updating.  ConfigMgr does not have a plug-in for CAU integration.  Someone in MSFT will respond with VMM baselines.  Tell ‘em to go take a long walk off of a short pier; no one should have to do that amount of clicking every month.
  • Most businesses are SMEs and SMEs cannot afford System Center anymore.

So what’s left?  Manual approval and patching.  And as I’ve said before: that means patching just does not happen … at all.  I’m not being cynical; I’m being pragmatic and basing this on experience in the real world.

Let me tell you a story …

I used to work for a consulting company that specialised in Computer Associates software.  I was certified and consulted in CA Unicenter, their huge enterprise monitoring system.  I also dabbled in a few x-IT management products.  CA were shite when it came to product quality and patch management.  The process for installing a new product version was:

  • Install from the media
  • Test
  • Find the broken basic functionality
  • Log into Support and download lots of patches and install them for this new release
  • Find the broken functionality that had been patched/fixed 2 months before in the previous release
  • Open up support calls to get them to update the previous release’s fixes for the new release
  • Try cover your ass with the angry customer

I once had a CA tested over in the office to introduce us to a new beta version of Unicenter.  I asked about the huge number of patches that would appear within a week of release because basic features didn’t work.  He explained that CA couldn’t possibly test more than 75% of features before release.  That’s why I’ve flat-out refused to work with CA software since 2001.

Let’s get back on track here.  The problem with KB2855336 is that it breaks hosts that:

  • Connect a virtual switch to a NIC team
  • VMs are on different VLANs

Hmm, seems like one of the most basic configurations for Hyper-V if you ask me.  How the hell was this not tested?

This litany of mistakes cannot continue.  We (the community) cannot continue to recommend fixes if they break stuff in basic or default configurations.  Microsoft, you want to be a cloud company; learn from how hosting companies have been very public with explanations and apologies.  This actually reassures the customers of those hosting companies – I once worked for a company that blacked out 1/3 of the hosted Irish internet for over a day, and that openness saved the day.  Needless to say, I was amazed.  Something must change, Microsoft, and you must be very public with the apology and the explanation of the process changes – and don’t just hide this in a forum response.

In the meantime, Microsoft should:

  • Remove this update rollup from the catalog to prevent further failures Hans reported just now (09:43 GMT) that this was done.
  • Instruct employees to modify blog posts and retract recommendations to deploy this update rollup

I hope any now-angry persons in Microsoft understand that I am writing this in support of Microsoft.  A friend is honest with criticism and wants change for improvement.  I’m not writing this to score points.  I’m writing this because I care.

EDIT:

A fix was released (allegedly) in an updated version of the July 2013 update rollup.

EDIT (27/7/2013):

It looks like UR3 for DPM 2012 SP1 joins the ranks of bad updates.  Almost immediately people reported that they could not upgrade their agents after upgrading DPM servers.  The update was withdrawn several days later, as noted in these comments.

9 thoughts on “Something Has Gone Very Wrong With Microsoft Patch Testing”

  1. Not just the last 10 months. Its been longer than that, and its a general QA issue that is causing issues in other areas too. We’ve been detailing the QA problems over at myITforum.com

  2. Well said Aidan. I for one thought the age where software updates caused serious faults were well behind us.

  3. This has been a scary stint of updates, frankly. Anything that can cause my AD catalogs to corrupt themselves (see Han’s excellent write-up for more info) is bad news in my book.

  4. The numbers of problems are staggering. For Windows, System Center and all patches related. I have always been at the front of people installing patches and newer versions, for the sake of security, stability, and of course new and improved features. Seems lately that this is not the way anymore due to the things you describe. MS can not keep up with the pace they are trying to work at. Too many versions and updates now to keep an eye on and also the interaction with security and .Net fixes. Thing is we are not talking about very strange scenario’s where the combinations of fixes goes wrong. These are default scenarios, which should have been tested and should also occur in the MS labs.

  5. I can so relate to the CA story, I reached the very same conclusion around 2002 with ArcServe 🙂

    Back in the days when Hyper-V v1 hit the market, it was crushed by many on the matter of having run updates on a frequent basis, as it was “nothing more than just another Windows server. At the time, VM-Ware made some serious booboo’s on their ESX product, causing clustered systems in particular to crash. The Microsoft marketing machine jumped on it ibgtime. Well, if you blame another one on bad testing, make very sure you’re not screwing up yourself.

    And yes, we were a victim on the HV rollup, yes, we got picked on the DPM rollup, and there have been others too. I can only support the statement of your blog here…

    Get this act together Microsoft, make this patching tested more thouroughly and don’t leave us with the broken pieces! (And yes, proper CAU in SCCM would be MUCH appreciated indeed, not only for Hyper-V, but also Exchange, SQL and File Clusters…)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.