When I worked in the VM hosting business, we offered monitoring via System Center Operations Manager as a part of the service. It was great for us as a service provider because we were aware of everything that was happening. One of the things I tried to do for customers was website monitoring, using an agent to fire client perspective tests at the customers’ website(s) to see if they were responsive. On more than one occasion, a customer would upload new code, assume it was OK, and OpsMgr would see the code failure in the form of an offline website. The customer (and us) got the alerts and they could quickly undo the change.
When you work in hosting, you learn what a mess the Internet is. Consider this example. I worked for a hosting company in Dublin (that’s on the east coast of Ireland). Our helpdesk got a bunch of calls from customers saying that the services we were providing to them were “offline”. That sent the networking engineers into a bit of a tizzy – oh, did I mention this was happening as 99% of the staff were leaving for our Christmas party? Nice timing! The strange thing was that not all customers were having a problem. That suggested a routing issue and the networking folks started making calls. In the end it turned out that only customers of a certain ISP were affected. Their route sent packets to a router in Dublin, possibly only a kilometre away from our data centre (almost all of the major datacenters, including the Dublin “Azure” one, are on one glow-in-the-dark road in south-west Dublin). From there, packets were routed to Germany. They bounced around there, and normally, came back to Dublin to our data center. Something went wrong in Germany and packets went in a loop before timing out. From the customers’ perspective, we were offline. A simple traceroute test would have highlighted the issue but most (not all) hosting customers are … hmm … how do I put this? … special
Hosting (or as it’s called now, the public cloud) customers typically sell services globally. They need their product available everywhere. That means you have routes all over the globe to contend with. Take the above example, and turn it into a rats nest of ISPs and peering all over the world. Those global;y available web services are typically not just simple websites placed in a single site, either. Any service needing a responsive user experience must use content distribution. That throws another variable into the mix. Testing the availability of the website from a single location will not do. You need to test globally.
Using an older style tool, including client perspective website monitoring in OpsMgr 2007, you could do this by renting VMs in globally located data centres and installing agents on them. The problems with this are:
- Increased complexity.
- A reliance on those global data centres – would you rely on the Virginia Amazon data centre that’s made lots of headlines in recent months? What about Honest Jose’s Hosting in Argentina?
- Renting VMs is adding a cost to the hosting company, that must be passed onto the customer, and every cent add to the per-month charges makes the cloud service less competitive.
System Center 2012 SP1 Operations Manager includes a new feature called Global Service Monitoring (GSM). It’s an Azure based service that will perform the synthetic web transactions of client perspective monitoring for you, from locations around the world. This is an invaluable feature for any public facing service, such as a public cloud (IaaS, web, or SaaS). The hosting/service provider can see how available (uptime and performance) their service is to customers worldwide, whether the problem is internal infrastructure or an ISP routing related issue.
The most difficult helpdesk ticket is the “slow” website. Using traditional tools you can do only so much. The warehouse in OpsMgr can rule out disk, memory, and CPU bottlenecks, but that doesn’t satisfy the customer. I haven’t tried this yet, but apparently GFSM adds 360 degree dashboards, offering you availability and performance information using internal (from the data centre) and external (from GSM) metrics. That would be very useful when troubleshooting performance issues; you can see where the slowness begins if it happens externally, and you can redirect the customer to their local ISP if the fault lies there.
If I was still in the hosting business, GSM is one of the features that would have driven me to upgrade OpsMgr to 2012 SP1.
See these Microsoft TechNet posts for more:
- Global Service Monitor for System Center 2012: Observing application availability from an “outside in” perspective
- Using multistep web tests for monitoring with Global Service Monitor Beta release