We have a VM where the load has been slowly growing over time. It’s peak season is right around now and we started getting alerts from Operations Manager on Friday. The contents of the alert were:
“Alert Monitor: PRO CPU Utilization
Source: MTGWSVR001 CPU utilization in the virtual machine has reached critical levels. The threshold monitor for this virtual machine has detected that the average of %Processor Time has been exceeded.
This monitor tracks the average CPU utilization for the virtual machine. The average Processor Time has exceeded the threshold. (The default threshold is 90 percent.)
The virtual machine is consuming too many CPU resources for its configuration.
Update the virtual machine configuration to allocate additional virtual CPU resources. For information about configuring the CPU requirements for a virtual machine, see Virtual Machine Manager 2008 R2 Help”.
The monitor in question is the interesting bit. We have Virtual Machine Manager (2008 or later) running and it is integrated with Operations Manager (2007 SP1 or later). We have a Windows Server 2008 R2 Hyper-V cluster which is being managed by VMM. PRO (Performance and Resource Optimization) tips is enabled on the master host group (the top level host group, containing child host groups). This allows OpsMgr to feed virtualisation performance alerts to VMM and VMM will act on them.
When the VM started getting increased resource demands it needed to use more CPU. Eventually it got to the point where the CPU was being maxed out. The PRO tips monitor in question runs every 60 seconds. It measures the CPU utilisation of the VM. If 3 sequential samples are greater than 90% CPU utilisation the monitor will create an alert. That alert will auto resolve when things quieten down – it is a monitor which is a state engine, i.e. aware of good and bad scenarios unlike a basic rule.
Because PRO tips was enabled VMM was able to move the VM from it’s current host to another host. That move was done using Live Migration so there was no downtime associated with the move of the VM. This means that other VM’s on the original host weren’t being deprived of resources. Moving the VM to another, less utilised host, gave it more CPU resources that it could use. Which host was best? That was decided by VMM using Intelligent Placement, which I blogged about last week.
What I’ve just described was dynamic IT. A problem was automatically detected and resolved using two System Center products working closely together. I was alerted to the issue. I didn’t need to do anything right there and then because the alert auto resolved immediately after the PRO tips live migrated the VM. I talked to the customer of the VM and found out that this is peak season for them and CPU demands would be high. We scheduled a maintenance window for early this morning. The VM was power down, an extra virtual CPU was added and the VM was powered back up again. Less than 5 minutes and now the VM has all the CPU it needs.