A Particularly Odd OpsMgr 2007 Problem (And Solution)

The Operations Manager 2007 agent and management server communicate with each other and perform mutual authentication using Kerberos.  They’re in the same forest and hence in the same Kerberos domain.  But what happens if you have agents outside the forest?  If you read anything from Microsoft (or the OpsMgr book I just bought) you’d be left under the impression that you must install the OpsMgr gateway.  You’d then install a custom X.509 cert (requiring a cert server running on Windows Enterprise Edition) on that machine and on the OpsMgr server.  There’s two problems with this:

  • What if the un-trusted network is a workgroup, e.g. a DMZ?  There’s no Kerberos domain for the agents on the network to authenticate with the Gateway.
  • What if you are monitoring many networks with only one or two agents on each network?  Are you going to install lots and lots of Gateways?

If you are persistent with your searches you will find that:

  • There is one mention by Microsoft in a downloadable Word document that you can install agents with the X.509 cert so that the agents can communicate directly with the management server.
  • There is an almost complete guide by Duncan McAlynn on how to install the certs using MOMCERTIMPORT /SUBJECTNAME (the subject name is the name of the cert in the certificate store).

Duncan appears to be the only person to have attempted to document this process so he deserves credit for it.  The MS documentation folks have done a poor job with OpsMgr, e.g. failing to cover this subject and failing to document complete management pack authoring.  The instructions for setting up the CA are in the OpsMgr 2007 Security Guide and Duncan walks you through installing the agent.  The only missing step is you need to install and import CA and agent certs on the OpsMgr management server(s) so that they have a means for mutual authentication with the agents.

I’d been doing this successfully on servers and then I hit one server where the agent could not use the cert.  I saw the following in the Operations Manager Event Log:

Source: OpsMgr Connector

Type: Error

Event ID: 21036

The certificate specified in the registry at HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft Operations Manager3.0Machine Settings cannot be used for authentication.  The error is The credentials supplied to the package were not recognized
(0x8009030D).

I reissued that cert, re-imported it, re-installed the agent half a dozen times.  I’d opened a call with MS (thanks to IT Pro Momentum) but the first PSS agent was not the Mae West to deal with.  He kept claiming the my CA was at fault but I knew it wasn’t – other agents were fine.  Finally the ticket got reassigned to Brian who was a pleasure to work with.

He started coming up with some new ideas straight away.  The first was maybe the cert store was corrupt.  I tried a fix for that (CERTUTIL -F -REPAIRSTORE MY “<thumbprint of agent cert>”) but that didn’t fix the problem.  Brian asked if we could look at the server together using "EasyAssist" … it’s MS’s answer to WebEx or LogMeIn so they can get Remote Assistance over web friendly protocols.  We poked around and saw something interesting.

  • The CA cert in ComputerTrusted Root Authorities was fine.
  • The agent cert in the ComputerPersonal store was fine.  The certification path was fine.
  • When you run MOMCERTIMPORT it copies the cert into ComputerOperations Manager in the certificate store.  I had overlooked this.  Here, the certification path was invalid.  Weird, because it was fine in the ComputerPersonal store.

We manually imported the cert into there and the certification path was still screwed.  We re-imported the CA cert but it was still screwed.  We re-imported the CA cert and the operations manager copy of the cert.  The certification path was fine but the agent didn’t appear to be using it.  We re-ran MOMCERTIMPORT and the certification path was invalid again.  OK … I thought we’d try this:

  • Delete all copies of the agent and CA certs from the certificate store.
  • Brian suggested restarting the cryptography and the OpsMgr Health service.
  • I went through the process of re-importing: Import the CA cert into ComputerTrusted Root Authorities, import the agent PFX into ComputerPersonal, re-run MOMCERTIMPORT /SUBJECTNAME and restarted the OpsMgr Health service.

Lo and behold … it worked!  In fact, it worked so well that we detected a hardware fault on the server that we hadn’t known about.  Sweet; OpsMgr rules!

A big "Thank You" to Brian for helping out on that one.  For the most part, I’ve always had good dealings with MS PSS agents going back to 2003.  It was good to see this one being rescued so professionally.

Official: Support for Operations Manager 2007 on Windows Server 2008

Microsoft has just given us the green light to install OpsMgr 2007 on W2008.  We’ve been waiting since February but we finally have support and as I mentioned earlier today, we saw the first few management packs hit the streets. 

It’s a complicated process to be compliant before installing SCOM 2007 on Windows 2008.  You have to first install 3 updates:

Then you need to install a hotfix rollup.

Operations Manager Management Packs for Windows Server 2008

Finally!  Microsoft has released a set of management packs that include monitoring support for Windows Server 2008.  These include:

 

I haven’t seen anything on agent support for 2008 yet.  I was under the impression that a patch would be required.  Hold off on deploying agents to 2008 until you read something official from MS.

Service Level Dashboard Management Pack for SCOM 2007

Why is System Center Operations Manager 2007 different to everything else?  You’ve already heard about management packs: how they use state models instead of just traditional triggers and how they use the monitored products vendor expertise.  The other big difference is that SCOM recognises that IT is their to serve a customer.  Think of this from the ITIL or MOF point of view.  IT provides services to a customer, either someone in the same organisation or a client who subscribes for the service.  That customer doesn’t care about IIS sites, disk utilisation or CPU interrupt time.  They care about the uptime and performance of their service, e.g. the user who complains about there "being no Internet" doesn’t care if a network switch is dead.  Their service enables their business.  SCOM gives you the ability to model that service using a distributed application model.  Up to now, to give the customer visibility to their service was messy.

Microsoft has just released the Service Level Dashboard Management Pack for Operations Manager 2007.  This allows you to use an accelerator to present the availability and performance of the service to a customer in a more accessible manner.

You can watch a video on the subject on MSN.  There’s also an executive summary on TechNet (note the MS link are mostly dead for this one so use my link).

Here’s what Microsoft has to say:

"The Service Level Dashboard Management Pack for Operations Manager 2007 assists you in tracking, managing, and reporting on your line-of-business (LOB) application service level compliance. It displays a list of applications and their performance and availability against a target SLA.

The application or service is defined using the Operations Manager distributed application model. This model allows the user to define all components of the application or service that affect the health state and SLA calculation. When an application does not meet the defined performance or availability thresholds, it is placed into a warning or error state within Operations Manager. This state shows the current status of an application relative to its defined thresholds.

The Service Level Dashboard report uses the history of the state of an application to calculate the time the application was in each state over the duration of the report. Based on this information, the report derives a performance and availability percentage for the time period that the report covers".

Audit Collection Database and Disk Sizing Calculator for SCOM 2007

I went to TechEd for the first time in Amsterdam in 2004.  One of the cool things I heard about was a product in the works called Audit Collection Services.  This was going to be a free download from Microsoft (like WSUS) that would be an intelligent version of Syslog for Microsoft products.  Intelligent?  Have a look at the security logs on a Windows box when auditing is enabled and tell me if you can figure things out.  MS’s developers identified the important messages that allowed you to track those events and would gather them into a dedicated and centralised SQL database in near real time.

We waited and waited but nothing got released.  Nobody was talking.  Then the news came out: it was going to be in the next version of Microsoft Operations Manger (we were still at MOM 2005 at the time) and not a free download.  I first got to play with Systems Center Operations Manager 2007 while it was in beta back in 2006.  ACS was one of the components I was most interested in.  I listened to a MS webcast and immediately got scared.  They had no way to calculate how big the database for ACS would be.  It’s still a dedicated database, allowing auditors and security officers to have sole access.

Think about this for a moment.  Every network is different.  Some networks have normal amounts of user activity.  Some more and some less.  Some networks are Internet facing and are attacked a lot and some are quietly isolated.  There was no real way to calculate the disk requirements without significant empirical data.  All MS could say was that they used terabytes of disk every month, 8 I think (I could be wrong with that number – it was 2 years ago).

I’ve just read that a SCOM MVP called Pete Zerger has built a ACS requirements calculator using guesstimates.  According to the MOM Team blog, it looks pretty accurate compared to customer data that they are familiar with.

ACS is a really cool tool.  If you’re using SCOM 2005 and need some sort of security central logging or auditing solution then it just makes so much sense to enable it.  Have a read about Audit Collection Services and see what you think.

Credit: Pete Zerger.