I suspect I’m going to get some “unwanted attention” for this post but what I’m going to say has to be said publicly …
Something has gone wrong with the testing process for Microsoft hotfixes since the release of WS2012. There has been a number of really bad releases in those 10 or months. The latest is KB2855336, aka the July 2013 update rollup, which causes hosts to bug check as Hans Vredevoort and some of you reported. There is also a thread on the Hyper-V TechNet forum.
People like me and Microsoft are trying to encourage people to:
- Install security hotfixes with minimal delay
- Embrace a process of updating their Hyper-V hosts with fixes for Hyper-V and Failover Clustering to prevent issues
This string of updates that break hosts (and this is exponentially worse than breaking an occasional physical server here or there) is embarrassing and dangerous. The latest failure is in an update rollup that is issued via Windows Update. This just feeds the argument that patching is ba-ad and shouldn’t be done … and creates a security mess for, not just for those companies but, everyone in the community.
I’d love to say I have a fix. I’d love to say, hey use the automatic approval process in System Center Configuration Manager where we can:
- Delay approval of updates for X days – letting others find the bugs and Microsoft issue a superseding update
- Force the deployment
That will work for non-clustered hosts but:
- Folks with clusters will want to use Cluster Aware Updating. ConfigMgr does not have a plug-in for CAU integration. Someone in MSFT will respond with VMM baselines. Tell ‘em to go take a long walk off of a short pier; no one should have to do that amount of clicking every month.
- Most businesses are SMEs and SMEs cannot afford System Center anymore.
So what’s left? Manual approval and patching. And as I’ve said before: that means patching just does not happen … at all. I’m not being cynical; I’m being pragmatic and basing this on experience in the real world.
Let me tell you a story …
I used to work for a consulting company that specialised in Computer Associates software. I was certified and consulted in CA Unicenter, their huge enterprise monitoring system. I also dabbled in a few x-IT management products. CA were shite when it came to product quality and patch management. The process for installing a new product version was:
- Install from the media
- Test
- Find the broken basic functionality
- Log into Support and download lots of patches and install them for this new release
- Find the broken functionality that had been patched/fixed 2 months before in the previous release
- Open up support calls to get them to update the previous release’s fixes for the new release
- Try cover your ass with the angry customer
I once had a CA tested over in the office to introduce us to a new beta version of Unicenter. I asked about the huge number of patches that would appear within a week of release because basic features didn’t work. He explained that CA couldn’t possibly test more than 75% of features before release. That’s why I’ve flat-out refused to work with CA software since 2001.
Let’s get back on track here. The problem with KB2855336 is that it breaks hosts that:
- Connect a virtual switch to a NIC team
- VMs are on different VLANs
Hmm, seems like one of the most basic configurations for Hyper-V if you ask me. How the hell was this not tested?
This litany of mistakes cannot continue. We (the community) cannot continue to recommend fixes if they break stuff in basic or default configurations. Microsoft, you want to be a cloud company; learn from how hosting companies have been very public with explanations and apologies. This actually reassures the customers of those hosting companies – I once worked for a company that blacked out 1/3 of the hosted Irish internet for over a day, and that openness saved the day. Needless to say, I was amazed. Something must change, Microsoft, and you must be very public with the apology and the explanation of the process changes – and don’t just hide this in a forum response.
In the meantime, Microsoft should:
- Remove this update rollup from the catalog to prevent further failures Hans reported just now (09:43 GMT) that this was done.
- Instruct employees to modify blog posts and retract recommendations to deploy this update rollup
I hope any now-angry persons in Microsoft understand that I am writing this in support of Microsoft. A friend is honest with criticism and wants change for improvement. I’m not writing this to score points. I’m writing this because I care.
EDIT:
A fix was released (allegedly) in an updated version of the July 2013 update rollup.
EDIT (27/7/2013):
It looks like UR3 for DPM 2012 SP1 joins the ranks of bad updates. Almost immediately people reported that they could not upgrade their agents after upgrading DPM servers. The update was withdrawn several days later, as noted in these comments.