TEE14–Lessons From Scale

I am live blogging this so hit refresh to see more

Speaker: Mark Russinovich, CTO of Azure

Stuff Everyone Knows About Cloud Deployment

  • Automate: necessary to work at scale
  • Scout out instead of scale up. Leverage cheap compute to get capacity and fault tolerance
  • Test in production – devops
  • Deploy early, deploy often

But there are many more rules and that’s what this session is about. Case studies from “real big” customers on-boarding to Azure. He omits the names of these companies, but most are recognisable.

Customer Lessons

30-40% have tried Azure already. A few are considering Azure. The rest are here just to see Russinovich!

Election Tracking – Vote Early, Vote Often

Customer (a US state) create an election tracking system for live tally of US, state and local elections. Voters can see a live tally online. A regional election worked out well. Concerned because it was a little shaky with this light-load election. Called in MSFT to analyze the architecture/scalability. The system was PaaS based.

Each TM load balanced (A/P) view resulted in 10 SQL transactions. Expected 6,000,000 views in the peak hour or nearly 17,000 queries per sec. Azure DB scales to 5000 connects, 180 concurrent requests and 1000 requests per sec.

image

MSFT CAT put a caches between the front-end and DB with 40,000 requests per instance capability. Now the web roles hit the cache (now called Redis) and the cache hit the Results Azure DB.

At peak load, the site hit 45,000 hits/sec, well over the planned 17,000. They did a post-mortem. The original architecture would have failed BADLY. With the cache, they barely made it through the peak demand. Buffering the databases saved their bacon.

To The Cloud

A customer that does CAD for bildings, plants, cicil and geospatial engineering.

Went with PaaS: web roles on the front, app worker roles in the middles, and IaaS SQL (mirrored DB) on the backed. When they tested the Azure system had 1/3 of the work capacity of the on-premises system.

The web/app tier were on the same server on-premises. Adding a network hop and serialization of data transfer in the Azure implementation reduced performance. They merged them in Azure … web role and worker roles. They decided colocation in the same VMs was fine: they didn’t need independent scalability.

Then they found IOPS of a VHD in Azure was too slow. They used multiple VHDs to create two storage spaces pools/vdisks for logs and databases. They then created a 16 VHD pool with 1 LUN for DBs and logs. And they got 4 times the IOPS.

What Does The Data Say?

A company that does targeted advertising, and digests a huge amount of date to report to advertisers.

Data sources imported to Azure blobs. Azure worker roles sucked the data into an Azure DB. They used HDInsight to report on 7 days of data. They imported 100 CSV files between 10 MB and 1.4GB each. Average of 50 GB/day. Ingestion took 37 hours (over 1 day so fell behind in analysis).

  1. They moved to Azure DB Premium.
  2. They parallelized import/ingestion by having more worker roles.
  3. They created a DB table for each day. This allowed easy 8th day data truncation and ingestion of daily data.

This total solution solved the problem … .now an ingestion took 3 hours instead of 37.

Catch Me If You Can

A Movie Company called Link Box or something. Pure PaaS streaming. Web role, talking using WCF Binary Remotiing over TCP to a multi-instance cache worker roles tier. A Movie meta database, and the movies were in Azure blobs and cached by CDN.

If the cache role rebooted or updated, the web role would overwhelm the DB. They added a second layer of cache in the web worker roles – removed pressure from worker roles and dependency on the worker role to be “always on”.

Calling all Cars

A connected car services company did pure PaaS on Azure. A web role for admin and a web role for users.  The cars are Azure connected to Azure Service Bus – to submit data to the cloud. The bus is connected to multi-instances of message processor worker roles. This included cache, notifications, and message processor worker roles. Cache worked with a backend Azure SQL DB.

  • Problem 1: the message processing worker (retrieving messages from bus) role was synchronous – 1 message processed at a time. Changed this to asynchronous – “give me lots of messages at once”.
  • Problem 2: Still processing was one at a time. They scaled out to process asynchronously.

Let me Make You Comfortable

IoT… thermostats that would centralize data and provide a nice HVAC customer UI. Data is sent to the cloud service. Initial release failed to support more than 35K connected devices. But they needed 100K connected devices. Goal was to get to 150K devices.

Synchronous processing of messages by a web role that wrote to an Azure DB. A queue sent emails to customers via an SMTP relay. Another web role, accessing the same DB, allowed mobile devices to access the system for user admin. Synchronous HTTP processing was the bottleneck.

Changed it so interacting queries were synchronous. Normal data imports (from thermostats) switched to asynchronous. Changed DB processing from single-row to batch multi-row. Moved hot DB tables from standard Azure SQL to Premium. XML client parameters were converted into DB info to save CPU.

A result of the redesign was increase in capacity and reduced the number of VMs by 75%.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.