I wasn’t always an IT pro. When I was in college I pretty much wanted to be on the development side of things. I coded in Cobol, Pascal, C and C++ on Unix, VMS and Windows. Very early on I learned an important lesson. When were doing complex work we can often get so used to what we’re doing. I would write a long piece of code, go to compile and get some unhelpful syntax or logic error that I just couldn’t find. A great example was when I left college and started work. Our new team leader wanted to test us so he gave us a network coding challenge. I got done early so I decided to get fancy with my code. I had this mad Boolean calculation to determine if I should spin off a new listener or not … I was trying to put an entire function worth of logic into the IF statement! I got one tiny little thing wrong but didn’t notice. The thing compiled, linked and I ran it. Everyone’s session onto the Solaris box died. It came back a few minutes later and I ran it again. Boom! 10 minutes later a very angry sys admin came storming down wanting to know who was consuming every available TCP port on his server.
I went through my code and couldn’t find what was wrong. I called in some of the others and immediately the saw it. Sometimes a fresh pair of eyes can identify the simplest of issues when you’re looking for the complex one.
I had an issue with a pair of new OpsMgr agents. The machines were Windows 2008 x64 and are in a firewalled workgroup. This means I need to use certificates. I installed the certs and used MOMCERTIMPORT /SUBJECTNAME <CERTFILENAME> to get the cert loaded. The cert was in the computer’s personal store and the CA cert was in the trusted root certificate authorities store. The certification path was fine. I checked that the cert was copied into the operations manager store by MOMCERTIMPORT and it was. I restarted the agent and it couldn’t find a cert to load. The agent does not appear in Pending Management in the Operations Manager 2007 console. The following alerts appear in the agent computer’s Operations Manager event log:
"SOURCE: OpgMgrConnector
EVENT ID: 21022
No certificate was specified. This Health Service will not be able to communicate with other health services unless those health services are in a domain that has a trust relationship with this domain. If this health service needs to communicate with health services in untrusted domains, please configure a certificate."
This is quickly followed by:
"SOURCE: HealthService
EVENT ID: 7006
The Health Service has published the public key [1E 48 38 90 8A 46 11 B2 43 17 DC 64 0D 5A F4 A5 ] used to send it secure messages to management group CInfinity. This message only indicates that the key is scheduled for delivery, not that delivery has been confirmed."
"SOURCE: OpsMgr Connector
EVENT ID: 21007
The OpsMgr Connector cannot create a mutually authenticated connection to cinwsvr003.cinfinity.ie because it is not in a trusted domain."
"SOURCE: OpsMgr Connector
EVENT ID: 21016
OpsMgr was unable to set up a communications channel to cinwsvr003.cinfinity.ie and there are no failover hosts. Communication will resume when cinwsvr003.cinfinity.ie is both available and allows communication from this computer."
I did everything I could think of to fix this. Eventually I gave up and called in Microsoft PSS. One quick call with a very helpful engineer from India and we identified the issue. I think he’d already figured it out before he called me because he zeroed in on the diagnostic immediately … it helps if you give them EVERYTHING you can think of when logging the call. My ticket was probably two A4 pages long.
We fire up REGEDIT and went to HKLMSoftwareMicrosoftMicrosoft Operations Manager3.0Machine Settings. There should be a key in there called ChannelCertificateSerialNumber and it should have the serial number of the agent certificate. I didn’t have one there at all, hence the agent didn’t know to load the cert in the Operations Manager store. I’d used UAC to raise my CMD prompt to administrator so it wasn’t a permissions thing.
We then looked at MOMCERTIMPORT.EXE. Mine was around 44KB in size. The engineer knew what was wrong. I keep the installation media for servers on the servers (for troubleshooting reasons) on my servers. I opened the OpsMgr SP1 media and in Support ToolsAMD64 I could see that MOMCERTIMPORT.EXE was 51KB. I had used the 32bit version of MOMCERTIMPORT on a 64bit installation. It fails to create the registry key and does not log an error.
I copied the correct version onto the agent server, re-ran the command and restarted the OpsMgr Health service. Almost immediately the agent appeared in Pending Management on the OpsMgr console and I approved it. 10 minutes later it was being monitored.
So the lesson is relearned. If you are bashing your head against the wall with a problem, get a fresh “pair of eyes” to analyse the issue and you might get lucky and they’ll spot the simple cause straight away. Oh … and make sure you only use the 64 bit version of MOMCERTIMPORT.EXE on 64 bit Windows 🙂