PRO Tips In Action

We have a VM where the load has been slowly growing over time.  It’s peak season is right around now and we started getting alerts from Operations Manager on Friday.  The contents of the alert were:

Alert Monitor:  PRO CPU Utilization

Alert Description
Source:  MTGWSVR001  CPU utilization in the virtual machine has reached critical levels. The threshold monitor for this virtual machine has detected that the average of %Processor Time has been exceeded.

Summary
This monitor tracks the average CPU utilization for the virtual machine. The average Processor Time has exceeded the threshold. (The default threshold is 90 percent.)

Causes
The virtual machine is consuming too many CPU resources for its configuration.

Resolutions
Update the virtual machine configuration to allocate additional virtual CPU resources. For information about configuring the CPU requirements for a virtual machine, see Virtual Machine Manager 2008 R2 Help”.

The monitor in question is the interesting bit.  We have Virtual Machine Manager (2008 or later) running and it is integrated with Operations Manager (2007 SP1 or later).  We have a Windows Server 2008 R2 Hyper-V cluster which is being managed by VMM.  PRO (Performance and Resource Optimization) tips is enabled on the master host group (the top level host group, containing child host groups).  This allows OpsMgr to feed virtualisation performance alerts to VMM and VMM will act on them.

When the VM started getting increased resource demands it needed to use more CPU.  Eventually it got to the point where the CPU was being maxed out.  The PRO tips monitor in question runs every 60 seconds.  It measures the CPU utilisation of the VM.  If 3 sequential samples are greater than 90% CPU utilisation the monitor will create an alert.  That alert will auto resolve when things quieten down – it is a monitor which is a state engine, i.e. aware of good and bad scenarios unlike a basic rule.

Because PRO tips was enabled VMM was able to move the VM from it’s current host to another host.  That move was done using Live Migration so there was no downtime associated with the move of the VM.  This means that other VM’s on the original host weren’t being deprived of resources.  Moving the VM to another, less utilised host, gave it more CPU resources that it could use.  Which host was best?  That was decided by VMM using Intelligent Placement, which I blogged about last week.

What I’ve just described was dynamic IT.  A problem was automatically detected and resolved using two System Center products working closely together.  I was alerted to the issue.  I didn’t need to do anything right there and then because the alert auto resolved immediately after the PRO tips live migrated the VM.  I talked to the customer of the VM and found out that this is peak season for them and CPU demands would be high.  We scheduled a maintenance window for early this morning.  The VM was power down, an extra virtual CPU was added and the VM was powered back up again.  Less than 5 minutes and now the VM has all the CPU it needs.

SQL 2008 R2 Licensing

Emma Healey (MS Licensing person capable of speaking both English and Microsoft Licensing) has just posted about some changes coming with SQL 2008 R2.  The edition comparisons (available now) are:

clip_image002

SQL Datacenter edition will continue to allow unlimited virtual machines to run SQL on a host.  Enterprise edition changes to now allow up to 4 SQL instances to run on a host.

Technorati Tags:

Mastering Virtual Machine Manager 2008 R2

A new book, Mastering Virtual Machine Manager 2008 R2, has been published by Sybex/Wiley covering the subject of VMM 2008 R2.  It is written by two members of the VMM product team so the facts contained will be good.  The product description reads as follows:

“One-of-a-kind guide from Microsoft insiders on Virtual Machine Manager 2008 R2!

What better way to learn VMM 2008 R2 than from the high-powered Microsoft program managers themselves? This stellar author team takes you under the hood of VMM 2008 R2, providing intermediate and advanced coverage of all features.

  • Walks you through Microsoft’s new System Center Virtual Machine Manager 2008, a unified system for managing all virtual and physical assets; VMM 2008 not only supports Windows Server 2008 Hyper-V, but also VMware ESXas well!
  • Features a winning author team behind the new VMM
  • Describes all the new and enhanced features of VMM 2008 R2 and devotes ample time to how it also supports top competitors VMware ES
  • Uses a hands-on approach, giving you plenty of practical examples to clarify concepts

Open this in-depth guide and discover techniques and processes you can put to immediate use”.

VMM 2008 R2 is a powerful tool.  I work almost exclusively from within it and OpsMgr 2007 R2.  The ability to manage a number of Hyper-V hosts, and ESXi/ESX and Virtual Server hosts, and leverage the library to speed up otherwise boring, time consuming and manual (i.e. mistakes) operations is worth the price alone.  On top of that, it adds more to Hyper-V.  I’ve seen several times over this past weekend how PRO tips and Live Migration have optimised the loads on our cluster when there were more-than-normal resource requirements.

If you’re interested in learning how to make the most of your Hyper-V platform then look into VMM.  If you want to learn about VMM 2008 R2 then a book written by members of the product team has to be the best place to start.

Looking Into Other Ways To Automate Maintenance Mode

I’m going to be looking at alternative ways to put computers and other monitored resources (e.g. Web and port monitors) into maintenance mode in Operations Manager 2007 R2 this week.  We pushed out patches this weekend.  We warned customers that they might get one or two nuisance alerts.  Sure, each of them just got a couple of alerts but we got a LOT because we get all of them.  I’ve tried a few batch script and task scheduler approaches and each of them has sucked.

I’m going to have to do this in PowerShell I think.  I’ll see how this week goes.  Any non-customer engineering is frozen until the new year.  I don’t want to make changes that may cause unwanted faults over the holidays.  That gives me some time to do some work; I hope!  Pre-sales is still busy and I’m even going out on-site with some hosting customers to do some work with them.

Post a comment to let me know how you get around scheduling maintenance mode in OpsMgr.

Technorati Tags:

That Unmerged Snapshot Did More Than I Expected

Last month I blogged about how a Hyper-V snapshot had caused some difficulties.  I hadn’t realised how much effect that unmerged snapshot had.

We run OpsMgr and user it not only for fault monitoring but also for performance monitoring.  I noticed that sometime after we upgraded to OpsMgr 2007 R2 2 of our agents stopped gathering performance stats.  I couldn’t see live performance information in the OpsMgr console nor in the reports (from before a certain date).  PerfMon on the servers worked perfectly.

I repaired the agents and then re-installed them by hand.  Reboots were done.  The agents still refused to gather performance statistics. This was probably back in August/September.

I opened a PSS call under our support program to get some help when I ran out of ideas.  The problem made no sense to the PSS engineers because fault monitoring was working fine.  The machines in question were healthy.  I gathered countless logs and did countless tests.  The call ended up getting escalated not just once, but twice.  A few weeks ago I did some SQL queries on behalf of a PSS engineer.  We could see that performance data stopped being stored in the OpsMgr reporting database some time after the upgrade.

Other agents were fine.  We started focusing on comparing working agents with the 2 non-working agents.  Everything checked out so now we started getting particularly paranoid about things like service packs and regional settings.  I really didn’t like that because we hadn’t had any problems with these machines until maybe a month after we upgraded to OpsMgr 2007 R2.

I was getting ready to give up yesterday afternoon.

I don’t know why I did it, but I went into the OpsMgr console to have a peek at some performance stats for another agent.  One of the non-working agents was still selected from previous tests a while ago.  Wait … I could see a graph for CPU utilisation.  The agent was working.  I checked more stats for disk and memory.  They worked.  I checked the other non-working agent.  It was working.  Huh! 

I fired up the reporting console and did reports on the non working machines for the last year.  I had a complete graph with no data gaps.  That’s strange.  I did a report on when I “knew” that data wasn’t being gathered.  I had complete graphs with correct looking numbers of data samples.

So it appears that data was being gathered but it wasn’t being processed correctly.  Even when I couldn’t see the data in reports, graphs or SQL queries, the data was there somewhere in a pre-processing stage, waiting to be added into the relevant tables.

OK, what had changed in the last month or so since I had tried one of these reports?  We had migrated from Windows Server 2008 Hyper-V to Windows Server 2008 R2 Hyper-V.  Could there be a change in the way that performance data was gathered in a VM?  Definitely not.  Had we any changes at the VM level?  That’s when I remembered the issue in that blog post.

When I moved the OpsMgr VM, Hyper-V had to merge a snapshot that we had deleted some time before hand.  It had been running with the AVHD (snapshot/checkpoint differential disk) for 4 months.  It started to affect performance of the VM so badly that TCP was having timeouts.  There were performance issues that were virtual storage related.  Could it be that this affected database operations of the VM?  Of course they would if they had reached the point of messing up TCP.

NOTE: I have only ever used snapshots in production in VM internals upgrade scenarios.  I usually delete the snapshot after success is checked and allow a merge to take place.  That means there should be no impact on performance as long as you do things in a timely manner.  Some how I must have forgotten to do that this time. 

So here’s what I suspect happened.  The OpsMgr agents actually worked perfectly.  The gathered the performance stats the entire time and sent them to OpsMgr.  I am guessing that OpsMgr caches the data for processing.  Due to the unmerged AVHD/snapshot performance issues, the data stopped being processed correctly and sat in that cache.  We know it didn’t make it to the point of being reportable because a direct SQL query showed a data gap.  The problem reared it’s ugly head around a month after the snapshot was taken.  The AVHD/snapshot was merged back in early November and that resolved the performance issue for this VM.  It also sorted out whatever hitch there was in performance processing for these agents.  The data that was cached somewhere got it’s way into the reporting database and live graphs suddenly appeared for the two machines in the OpsMgr console.  That’s the funny bit; it only affected these two agents.

MS PSS are still curious.  The engineer seems to accept the explanation I’ve given him but he’s still curious to dig around and confirm everything, maybe try to see if he can get details on what happened internally.  I’ve got to credit him for that; most support staff would just close the call and move on.

So once again:

Hyper-V Snapshots or VMM Checkpoints, i.e. AVHD differential disks should not be used in production.  They are a form of differential VHD that doesn’t perform well at all.  They really do affect performance and I’ve seen the proof of that.  In fact they affect functionality in the most unpredictable of ways due to their performance impact.  Use something like DPM instead for state captures via backup at the host level.  That’s an issue right now with the lack of CSV support in DPM.  If you really need that right now then have a look at 3rd party providers or wait until DPM 2010 is released (approx April 2010) until you do deploy CSV.

Technorati Tags: ,,

Microsoft Acquires Opalis

It was announced today that Microsoft acquired a company called Opalis.  Opalis provides solutions to:

  • Cloud Bursting – automate public cloud provisioning to handle peak loads and prevent SLA violations
  • Cloud Cover – automate failover to public or private clouds
  • Private Cloud Operation – create and manage service driven, flexible capacity with automation
  • Sophisticated triggering – subscribe to external events to trigger workflow processes that add, reduce or fail-over to cloud resources according to policies and SLAs

I was wondering if this would be something that would be used solely in Azure.  But two things say “no” to me on that.  First is the System Center badge on the above site.  Second is the line “automate failover to public or private clouds”.  Think of Azure as a public cloud.  Think of a MS hosting partner running a Hyper-V based private cloud.  We already know that MS plans to add the ability to migrate VM’s to Azure from private clouds using VMM.  Now I guess they have technology to allow for an automated failover or DR plan, i.e. you can run your daily operations in one cloud and fail over to another cloud.

I can see the bursting and triggering tying in nicely with the OpsMgr/VMM integration provided by PRO tips, e.g. OpsMgr sees a bottleneck and Opalis technology in VMM triggers a new VM deployment to cater for the load.  When demand goes down then the burst VM’s are drained and withdrawn.  Sounds like a cool idea!

I wouldn’t expect to see this stuff appear for another 2 years.  We’ve just gotten VMM 2008 R2 and the Software Assurance cycle will next kick in around October/November 2011 (GA date, RTM being around August 2011).

Technorati Tags: ,

It Seems The Big Buzz Right Now Is …

I was talking to a few consultants last week and lots of the CIO’s they are meeting are talking about one thing right now: Virtual Desktop Infrastructure or VDI.  They’ve been hearing this term from many sources.  VMware has made a bit of a push on it, Citrix have made a huge push on it seeing their Presentation Server (or whatever the hell it’s called this week) getting squeezed out by MS, and MS has released Remote Desktop Services in Windows Server 2008 R2.  It seems these CIO’s want to talk about nothing else right now.

I can understand the thinking about VDI.  It can solve branch office issue by placing the desktop beside the data and server applications in the data centre.  Unlike Terminal Services a helpdesk engineer can mage changes to a VDI machine without change control.  Instead of PC’s you can use terminals that should be cheaper and should have no OS to manage.  It all sounds like costs should be cheaper and all that “nasty” PC management should disappear.  Right?

*Ahem* Not quite.

  • Branch Offices: Yes this is true.  By placing the VM, the user’s execution environment, in the data centre you speeds up access to data and services for remote users.  Let me ask a question here.  How much does sit cost to buy a PC?  Around €400 or thereabouts will do for a decent office PC.  It even comes with an OEM license for Windows.  How much does it cost for 2GB RAM in a server?  Around €200, not to mention the cost of the server chassis, the rack space, the power and the cooling.  How about storage?  A PC comes with a SATA disk.  A €250 GB SATA drive for a server is around €250.  It seems to me that we’ve already exceeded the up fronts.  I have done detailed breakdowns on this stuff at work to compare VDI with Terminal Services.  With VDI there is no memory or storage usage optimisation.  You get this with Terminal Services.  My opinion has changed over time.  Now I say if you want to do end user computing in the data centre then Terminal Services is probably the way to go.
  • Change Control: On a very basic VDI system, yes a helpdesk engineer an fix a problem for a end user without change control.  Terminal Services does absolutely require change control because a change to software on the server affects everyone.  However, if you are using pooled VDI or trash’n’burn VDI (VM invoked when a user logs in and destroyed when the log out) then there’s a good chance the problem returns when the user logs in again, thus requiring second or third level engineering.
  • Terminal Cheaper than PC’s: Hah!  I went out of my way at a recent Citrix VDI event here in Dublin to talk to one of the sponsors about terminals and their costs.  Their terminals were about the same cost as a PC or laptop depending on the form factor.
  • Terminals have less management than PC’s: Uh, wrong again.  There is still an operating system to manage on these machines and it’s one that has less elegant management solutions.  It still needs to be populated and controlled.  I’ve also been unable to get an answer from anyone on whether EasyPrint support is added into any of the terminals out there.  Without EasyPrint you either have awful cross-WAN printing experience or pay up for expensive 3rd party printing solutions.
  • Terminals cheaper part 2: The user still needs a copy of Vista or Windows 7 for their virtual machine where does that come from? You need to know that you cannot go out and use just any old Windows license in a VDI environment.  It has to be a special one called Virtual Enterprise Centralised Desktop (VECD).  This can only be purchased if you have software assurance on your desktop … uh … but we’re running terminals without a Windows Vista/7 license.  Yeah, ask your LAR about that one!  And we know SA adds around 33% to your costs every 2 to 3 years.  That PC with an OEM install of Windows 7 Professional or Ultimate is sounding pretty sweet right about now.
  • VDI is easier to manage: How do you manage a PC?  You have to put AV on it, you have to patch it, you have to deploy software to it, you have to report on license usage, you have to use group policy, etc.  That’s everything you also have to do with VDI using the exact same techniques and systems.  I see nothing so far about hardware management.  Let’s look at that.  You have to have 2 power sockets, a network socket and cabling, and every now and then one breaks and has to be replaced/repaired.  That sounds like everything you have to do with a terminal.  OK; the operating system on the machine?  I grant you that one.  A terminal has a built in OS.  A PC has to be installed but you can easily use MDT (network or media) to build PC’s with almost no effort and it’s free.  You also have ConfigMgr and WDS as alternative approaches.  WDS even allows people to build their own PC’s from an access controlled image.

For me, VDI is just too expensive to be an option right now.  Why do you think Microsoft hasn’t been singing from the heavens about Remote Desktop Services.  Sure, it’s a messy looking architecture but they know that the PC is here to stay for a long time yet.  The PC is relatively cheap to buy an own.  TCO?  Citrix have screamed about that one since the days of WinFrame and they haven’t managed to convert the world.  Sure, Citrix/Terminal Services is in most organisations but it’s more of an application deployment solution for remote users than a PC replacement solution.

And let’s not forget that the PC paradigm is changing.  It’s expected that the ownership of the business PC will change from the business to the end user.  In fact it’s already happening.  The business can still retain some sort of control and protect itself using things like NAP and port access control.

Feel free to post a comment on what you think about what’s going to happen.

Technorati Tags: ,

How Does VMM Place Virtual Machines?

How does Virtual Machine Manager know where to locate a virtual machine when it needs to migrate or you go to create one? 

If you don’t have VMM to manage your Hyper-V hosts then it becomes either 100% manual (when you manually migrate or create a VM) or 100% automated when there is a failover.  There is no in-system intelligence involved.

VMM does it very differently using Intelligent Placement.  The basic premise is that VMM is monitoring key resources on the Hyper-V hosts that it manages.  Using an algorithm that is either default or customised by you it will take those resources and know where to place a VM in one of a number of scenarios:

  • (Automated Placement) When a VM is created in the self-service console it is automatically place on a host by VMM based on the host ratings.
  • (Manual Placement) When you create a VM in the admin console it will recommend a host for you to choose. 
  • (Automated Placement) When there is a host failure VMM will use Intelligent Placement to move the VM to the highest rated host.
  • (Manual Placement) When OpsMgr and PRO tips initiate an alert, VMM will use Intelligent Placement to relocate VM’s to the host with the most available resources, i.e. the highest rated host.
  • (Automated Placement) When you drag and drop a VM to a host group the VM will be automatically placed on a host in that group based on host ratings.

You can alter how the Intelligent Placement algorithm works on your VMM server.  There are two basic models:

  1. Resource Maximisation: This is the model you take when you want VMM to make the very most out of each and every host.  VMM will try to place as many VM’s on a single host as is reasonable.
  2. Load Balancing: The goal here is to get the very best performance from your VM’s that you can.  VMM will locate VM’s in an effort to balance the resource utilisation across all hosts.

There are 4 basic resource types that will be utilised in the algorithm.  There is a slider to allow you to prioritise these resources when they are evaluated:

  1. CPU
  2. Memory (RAM)
  3. Disk I/O capacity
  4. Network capacity

To be honest I think most people will choose a load balancing model and will prioritise CPU.  Disk I/O and network capacity probably come next depending on where your bottlenecks are.  Those few going with the maximisation model will probably prioritise memory because it then likely becomes the bottleneck resource.

How do these host ratings get calculated?  VMM measures the resources of the host around every 10 minutes.  There are circumstances that change the available resources of a host and thus the rating of the host.  These are:

  • New Virtual Machine
  • Deploy Virtual Machine
  • Store Virtual Machine
  • Migrate Virtual Machine
  • Delete Virtual Machine
  • Virtual Machine Turned On
  • Virtual Machine Turned Off, Stopped, Paused, Saved State

The host is rated only when a VM is to be placed.  The gathered information is used to compare the host against the resources required by the new/moved virtual machine.  A rating is generated, anywhere from 0 stars to 5 stars in half star increments.  The host ratings do not involve comparing and contrasting hosts.  They simply show how suitable each host will be based on empirical data and an estimation of what resources that VM will require in the future.  In automated scenarios the host with the highest rating will be chosen.  In manual scenarios it’s up to the administrator to agree with or reject the recommendation.

A number of circumstances can cause a host’s rating to be zero stars, i.e. VMM believes the placement to be unsuitable:

  • There is not enough RAM available for the VM you want to place on a host.
  • There is not enough available storage for a VM, e.g. a Windows Server 2008 Hyper-V cluster does not have an available LUN for the VM.
  • The virtual network the VM is configured to use is not available on the host.
  • Some advanced VM configuration is not supported by the host, e.g. advanced networking or high availability.  You can still force a placement in this scenario by changing that setting when promoted by the wizard.

Intelligent placement is an estimation based on empirical data combined with the tuning of the algorithm.  It’s up to you to tune that algorithm to suit the VM’s, hosts and business requirements for your organisation.  VMM will then do it’s best by making recommendations to you when you move/create a VM or when an automated action must be performed by VMM.

Technorati Tags: ,