Let’s get something straight. Hyper-V snapshots should only ever be used in lab/demo environments. If you need a snapshot in production you really should be using DPM 2007 SP1.
Note (posted 2/1/2012): Hyper-V does support snapshots in production, but almost every app you’ll install in a VM doesn’t support any virtualisation snapshot (not VSS/SAN, but virtualisation).
Note: CSV doesn’t have a DPM solution yet. DPM v3 will support CSV.
We don’t have DPM so I only ever use snapshots as a very temporary thing when doing an upgrade within a VM.
That’s what I did when I upgraded our Operations Manager VM from OpsMgr 2007 to OpsMgr 2007 R2. I took a snapshot and did the upgrade. I was certain that I had merged the snapshot after doing the upgrade but it appears I didn’t.
Over the weekend our OpsMgr started firing alerts out. They were all network related, e.g. failed heartbeats, inaccessible network devices, websites not responding. I was stunned at the quantity. I was quick to verify there were no outages. These were false alarms. Our network guys investigated. We thought we found a cause but it wasn’t.
I got the noise under control and continued to work on the issue today at the TechEd Europe 2009 conference. I couldn’t find anything. Our network guys did identify that they were losing 3% of pings to the OpsMgr server. That would sure cause the issues we were seeing. I was concerned because our OpsMgr server is a VM. Could the hardware have an issue and could this affect the other VM’s. Some quick tests showed that all hardware and all other VM’s were 100% fine. This problem was limited to our OpsMgr VM.
I decided on this plan:
- I would reboot the VM. If that didn’t fix it I’d go to step 2.
- I would cold migrate the VM using VMM 2008 R2 to our new 2008 R2 Hyper-V host for our management VM’s. If that didn’t fix it I’d go onto long shot step #3
- I would remove the virtual NIC and re-add it, making sure it was a synthetic NIC.
The reboot did nothing. I attempted a cold migrate. That failed almost as soon as the job started. The VHD’s were locked. I logged into the host and fired up the Hyper-V console to have a look around. That’s when I saw a merge was taking place. A large AVHD was being merged back into the VHD.
What was happening? Remember that when you take a snapshot in Hyper-V it creates a special differential disk called a AVHD. It becomes the place where all writes are stored. The original VHD becomes read only for old data. That becomes pretty slow. I think that it must have degraded performance within the VM so much that it affected the TCP stack of the Windows installation running in the VM.
Eventually the merge completed. It did take a while on our 15K EVA SAN disks. I did the cold migration and fired up the VM. I fired 500 pings at it with 0% loss. Monitoring has raised zero alerts 50 minutes later. It appears to me that I’ve fixed the problem. It appears to me that an overlooked snapshot made my life hell today.
That was my second night at the conference. It wasn’t a beerfest for me. In fact, I worked on this last night too. This one really baffled me. I was sure it was hardware. Turns out it was virtual hardware couple with my feeble brain 🙂
One thought on “How A Hyper-V Snapshot Made My Day Hell”