That Unmerged Snapshot Did More Than I Expected

Last month I blogged about how a Hyper-V snapshot had caused some difficulties.  I hadn’t realised how much effect that unmerged snapshot had.

We run OpsMgr and user it not only for fault monitoring but also for performance monitoring.  I noticed that sometime after we upgraded to OpsMgr 2007 R2 2 of our agents stopped gathering performance stats.  I couldn’t see live performance information in the OpsMgr console nor in the reports (from before a certain date).  PerfMon on the servers worked perfectly.

I repaired the agents and then re-installed them by hand.  Reboots were done.  The agents still refused to gather performance statistics. This was probably back in August/September.

I opened a PSS call under our support program to get some help when I ran out of ideas.  The problem made no sense to the PSS engineers because fault monitoring was working fine.  The machines in question were healthy.  I gathered countless logs and did countless tests.  The call ended up getting escalated not just once, but twice.  A few weeks ago I did some SQL queries on behalf of a PSS engineer.  We could see that performance data stopped being stored in the OpsMgr reporting database some time after the upgrade.

Other agents were fine.  We started focusing on comparing working agents with the 2 non-working agents.  Everything checked out so now we started getting particularly paranoid about things like service packs and regional settings.  I really didn’t like that because we hadn’t had any problems with these machines until maybe a month after we upgraded to OpsMgr 2007 R2.

I was getting ready to give up yesterday afternoon.

I don’t know why I did it, but I went into the OpsMgr console to have a peek at some performance stats for another agent.  One of the non-working agents was still selected from previous tests a while ago.  Wait … I could see a graph for CPU utilisation.  The agent was working.  I checked more stats for disk and memory.  They worked.  I checked the other non-working agent.  It was working.  Huh! 

I fired up the reporting console and did reports on the non working machines for the last year.  I had a complete graph with no data gaps.  That’s strange.  I did a report on when I “knew” that data wasn’t being gathered.  I had complete graphs with correct looking numbers of data samples.

So it appears that data was being gathered but it wasn’t being processed correctly.  Even when I couldn’t see the data in reports, graphs or SQL queries, the data was there somewhere in a pre-processing stage, waiting to be added into the relevant tables.

OK, what had changed in the last month or so since I had tried one of these reports?  We had migrated from Windows Server 2008 Hyper-V to Windows Server 2008 R2 Hyper-V.  Could there be a change in the way that performance data was gathered in a VM?  Definitely not.  Had we any changes at the VM level?  That’s when I remembered the issue in that blog post.

When I moved the OpsMgr VM, Hyper-V had to merge a snapshot that we had deleted some time before hand.  It had been running with the AVHD (snapshot/checkpoint differential disk) for 4 months.  It started to affect performance of the VM so badly that TCP was having timeouts.  There were performance issues that were virtual storage related.  Could it be that this affected database operations of the VM?  Of course they would if they had reached the point of messing up TCP.

NOTE: I have only ever used snapshots in production in VM internals upgrade scenarios.  I usually delete the snapshot after success is checked and allow a merge to take place.  That means there should be no impact on performance as long as you do things in a timely manner.  Some how I must have forgotten to do that this time. 

So here’s what I suspect happened.  The OpsMgr agents actually worked perfectly.  The gathered the performance stats the entire time and sent them to OpsMgr.  I am guessing that OpsMgr caches the data for processing.  Due to the unmerged AVHD/snapshot performance issues, the data stopped being processed correctly and sat in that cache.  We know it didn’t make it to the point of being reportable because a direct SQL query showed a data gap.  The problem reared it’s ugly head around a month after the snapshot was taken.  The AVHD/snapshot was merged back in early November and that resolved the performance issue for this VM.  It also sorted out whatever hitch there was in performance processing for these agents.  The data that was cached somewhere got it’s way into the reporting database and live graphs suddenly appeared for the two machines in the OpsMgr console.  That’s the funny bit; it only affected these two agents.

MS PSS are still curious.  The engineer seems to accept the explanation I’ve given him but he’s still curious to dig around and confirm everything, maybe try to see if he can get details on what happened internally.  I’ve got to credit him for that; most support staff would just close the call and move on.

So once again:

Hyper-V Snapshots or VMM Checkpoints, i.e. AVHD differential disks should not be used in production.  They are a form of differential VHD that doesn’t perform well at all.  They really do affect performance and I’ve seen the proof of that.  In fact they affect functionality in the most unpredictable of ways due to their performance impact.  Use something like DPM instead for state captures via backup at the host level.  That’s an issue right now with the lack of CSV support in DPM.  If you really need that right now then have a look at 3rd party providers or wait until DPM 2010 is released (approx April 2010) until you do deploy CSV.

Technorati Tags: ,,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.