Understanding Big Data on Azure–Structured, Unstructured, and Streaming

These are my notes from the recording of this Ignite 2017 session, BRK2293.

Speaker: Nishant Thacker, Technical Product Manager – Big Data

This Level-200 overview session is a tour of big data in Azure, it explains why the services were created, and what is their purpose. It is a foundation for the rest of the related sessions at Ignite. Interestingly, only about 30% of the audience had done any big data work in the past – I fall into the other 70%.

What is Big Data?

  • Definition: A term for data sets that are so large or complex that traditional data processing application s/w if inadequate to deal with it. Nishant stresses “complex”.
  • Challenges: Capturing (velocity) data, data storage, data analysis, search, sharing, visualization, querying, updating, and information privacy. So … there’s a few challenges Smile

The Azure Data Landscape

This slide is referred to for quite a while:


Data Ingestion

He starts with the left-top corner:

  • Azure Data Factory
  • Azure Import/Export Service
  • Azure CLI
  • Azure SDK

The first problem we have is data ingestion into the cloud or any system. How do you manage that? Azure can manage ingestion of data.

Azure Data Factory is a scheduling, orchestration, and ingestion service. It allows us to create sophisticated data pipelines from the ingestion of the data through to processing, through to storing, through to making it available to end users to access. It does not have compute power of it’s own; it taps into other Azure services to deliver any required compute.

The Azure Import/Export service can help bring incremental data on board. You can also use it to bulk load on Azure. If you have terabytes of data to upload, bandwidth might not be enough. You can securely courier data via disk to an Azure region.

The Azure CLI is designed for bulk uploads to happen in parallel. The SDKs can be put into your code, so you can generate the data in your application in the cloud, instead of uploading to the cloud.

Operational Database Services

  • Azure SQL DB
  • Azure Cosmos DB

The SQL database offers SQL Server, MySQL, and PostgreSQL.

Cosmos DB is the more interesting one – it’s NoSQL and offers global storage. It also supports 4 programming models: Mongo, Gremlin/Graph, SQL (DocumentDB), and Table. You have flexibility to bring in data in its native form, and data can be accessed in an operational environment. Cosmos DB has plugs into other aspects of Azure that make it more than just an operational database such as, Azure Functions or Spark (HDInsight).

Analytical Data Warehouse

  • Azure SQL Data Warehouse

When you want to do reporting and dashboards from data in operational databases then you will need an analytical data warehouse that aggregates data from many sources.

Traits of Azure SQL Data Warehouse:

  • Can grow, shrink, and pause in seconds – up to 1 Petabyte
  • Fill enterprise-class SQL Server – means you can migrate databases and bring your scripts with you. Independent scale of compute and storage in seconds
  • Seamless integration with Power BI, Azure Machine Learning, HDInsight, and Azure Data Factory

NoSQL Data

  • Azure Blob storage
  • Azure Data Lake Store

When your data doesn’t fit into the rows and columns structure of a traditional database then this is when you need specialized big data storages – capacity, unstructured sorting/reading.

Unstructured Data Compute Engines

  • Azure Data Lake Analytics
  • Azure HDInsight (Spark / Hadoop): managed clusters of Hadoop and Spark with enterprise-level SLAs with lower TCO than on-premises deployment.

When you get data into a big unstructured stores such as Blob or Data Lake then you need specialized compute engines for the complexity and volume of the data. This compute must be capable of scaling out because you cannot wait hours/days/months to analyse the data.

Ingest Streaming Data

  • Azure IoT Hub
  • Azure Event Hubs
  • Kafka on Azure HDInsight

How do you ingest this real-time data as it is generated? You can tap into event generators (e.g. devices) and buffer up data for your processing engines.

Stream Processing Engines

  • Azure Stream Analytics
  • Storm and Spark streaming on Azure HDInsight

These systems allow you to process streaming data on the fly. You have a choice of “easy” or “open source extensibility” with either of these solutions.

Reporting and Modelling

  • Azure Analysis Services
  • Power BI

You have cleansed and curated the data, but what do you do with it? Now you want some insights from it. Reporting & modelling is the first level of these insights.

Advanced Analytics

  • Azure Machine Learning
  • ML Server (R)

The basics of reporting and modelling are not new. Now we are getting into advanced analytics. Using data from these AI systems we can predict outcomes or prescribe recommended actions.

Deep Learning

  • Cognitive Services
  • Bot Service

Taking advanced analytics to a further level by using these toolkits.

Tracking Data

  • Azure Search
  • Azure Data Catalog

When you have such a large data estate you need ways to track what you have, and to be able to search it.

The Azure Platform

  • ExpressRoute
  • Azure AD
  • Network Security Groups
  • Azure Key Management Service
  • Operations Management Suite
  • Azure Functions (serverless compute)
  • Visual Studio

You need a platform with enterprise capabilities in the best ways possible in a compliant manner.

Big Data Services

Nishant says that that darker shaded services are the ones usually being talked about when they talk about Big Data:


To understand what all these services are doing as a whole, and why Microsoft has gotten into Big Data, we have to step all the way back. There are 3 high-level trends that are a kind of an industrial revolution, making data a commodity:

  • Cloud
  • Data
  • AI

We are on the cusp of an era where every action produces data.

The Modern Data Estate


There are 2 principles:

  • Data on-premises and
  • Data in the cloud

Few organizations are just 1 or the other; most span both locations. Data warehouses aggregate operational databases. Data Lakes store the data used for AI, and will be used to answer the questions that we don’t even know of today.

We need three capabilities for this AI functionality:

  • The ability to reason over this data from anywhere
  • You have the flexibility to choose – MS (simplicity & ease of use), open source (wider choice), programming models, etc.
  • Security & privacy, e.g. GDPR

Microsoft has offerings for both on-premises and in Azure, spanning MS code and open source, with AI built-in as a feature.

Evolution of the Data Warehouse


There are 3 core scenarios that use Big Data:

  • Modern DW: Modernizing the old concept of a DW to consume data from lots of sources, including complexity (big data)
  • Advanced Analytics: Make predictions from data using Deep Learning (AI)
  • IoT: Get real time insights from data produced by devices

Implementing Big Data & Data Warehousing in Azure

Here is a traditional DW


Data from operational databases are fed into a single DW. Some analysis is done and information is reported/visualized for users.

SQL Server Integration services, a part of the Azure Data Factory, can allow you to consume data from your multiple operational assets and aggregate them as a DW.

Azure Analysis Services allows you yo build tabular models for your BI needs, and Power BI can be used to report and visualize those models.

If you have existing huge repositories of data that you want to bring into a DW then you can use:

  • Azure CLI
  • Azure Data Factory
  • BCP Command Line Utility
  • SQL Server Integration Services

This traditional model breaks when some of your data is unstructured. For example:


Structured operational data is coming in from Azure SQL DB as before.

Log files and media files are coming into blob storage as unstructured data – the structure of queries is unknown and the capacity is enormous.  That unstructured data breaks your old system but you still need to ingest it because you know that there are insights in it.

Today, you might only know some questions that you’d like to ask of the unstructured data. But later on, you might have more queries that you’d like to create. The vast scale of economy of Azure storage makes this feasible.

ExpressRoute will be used to ingest data from an enterprise if:

  • You have security/compliance concerns
  • There is simply too much data for normal Internet connections

Back to the previous unstructured data scenario. If you are curating the data so it is filtered/clean/useful, then you can use Polybase to ingest it into the DW. Normally, that task of cleaning/filtering/curating is too huge for you to do on the fly.


HDInsight can tap into the unstructured blob storage to clean/curate/process it before it is ingested into the DW.

What does HDInsight allow you to do? You forget that the data was structured/unstructured/semi-structured. You forget the complexity of the analytical queries that you want to write. You forget the kinds of questions you would like to ask of the data. HDInsight allows you to add structure to the data using some of it’s tools. Once the data is structured, you can import it into the DW using Polybase.

Another option is to use Azure Functions instead of HDInsight:


This serverless option can suit if the required manipulation of the unstructured data is very simple. This cannot be sophisticated – why re-invent the wheel of HDInsight?

Back to HDInsight:


Analytical dashboards can tap into some of the compute engines directly, e.g. tap into raw data to identify a trend or do ad-hoc analytics using queries/dashboards.

Facilitating Advanced Analytics

So you’ve got a modern DW that aggregates structured and unstructured data. You can write queries to look for information – but we want deeper insights.


The compute engines (HDIsnight) enable you to use advanced analytics. Machine Learning can only be as good as the quality and quantity of data that you provide to it – the compute engine’s job. The more data machine learning has to learn from, the more accurate the analysis will be. If the data is clean, then garbage results won’t be produced. To do this with TBs or PBs of data, you will need the scale-out compute engine (HDInsight) – a VM just cannot do this.

Some organizations are so large or so specialized that they need even better engines to work with:


Azure Data Lake store replaces blob storage for greater scales. Azure Data Lake Analytics replaces HDInsight offers a developer-friendly T-SQL-like & C# environment. You can also write Python R models. Azure Data Lake Analytics is serverless – there are no clusters as there are in HDInsight. You can focus on your service instead of being distracted by monitoring.

Note that HDInsights works with interactive queries against streaming data. Azure Data Lake is based on batch jobs.


You have the flexibility of choice for your big data compute engines:


Returning to the HDInsight scenario:


HDInsight, via Spark, can integrate with Cosmos DB. Data can be stored in Cosmos DB for users to consume. Also, data that users are generating and storing in Cosmos DB can be consumed by HDInsight for processing by advanced analytics, with learnings being stored back in Cosmos DB.


He opens an app on an iPhone. It’s a shoe sales app. The service is (in theory) using social media, fashion trends, weather, customer location, and more to make a prediction about what shoes the customer wants. Those shoes are presented to the customer, with the hope that this will simplify the shopping experience and lead to a sale on this app. When you pick a shoe style, the app predicts your favourite colour. If you view a shoe, but don’t buy it., the app can automatically entice you with promotional offers – stock levels can be queried to see what kind of promotion is suitable – e.g. try shift less popular stock by giving you a discount to do an in-store pickup where stock levels are too high and it would cost the company money to ship stock back to the warehouse. The customer might also be tempted to buy some more stuff when in the shop.

He then switches to the dashboard that the marketing manager of the shoe sales company would use. There’s lots of data visualization from the Modern DW, combining structured and unstructured data – the latter can come from social media sentiment, geo locations, etc. This sentiment can be tied to product category sales/profits. Machine learning can use the data to recommend promotional campaigns. In this demo, choosing one of these campaigns triggers a workflow in Dynamics to launch the campaign.

Here’s the solution architecture:


There are 3 data sources:

  • Unstructured data from monitoring the social and app environments – Azure Data Factory
  • Structured data from CRM (I think) – Azure Data Factory
  • Product & customer profile data from Cosmos DB (Service Fabric in front of it servicing the mobile apps).

HDInsight is consuming that data and applying machine learning using R Server. Data is being written back out to:

  • Blob storage
  • Cosmos DB – Spark integration

The DW consumes two data sources:

  • The data produced by HDInsight from blob storage
  • The transactional data from the sales transactions (Azure SQL DB)

Azure Analysis Services then provides the ability to consume the information in the DW for the Marketing Manager.

Enabling Real-Time Processing

This is when we start getting in IoT data, e.g. sensors – another source of unstructured data that can come in big and fast. We need to capture the data, analyse it, derive insights, and potentially do machine learning analysis to take actions on those insights.


Event hubs can ingest this data and forward it to HDIngsights – stream analysis can be done using Spark Streaming or Storm. Data can be analysed by Machine Learning and reported in real-time to users.

So the IoT data is:

  • Fed into HDInsights for structuring
  • Fed into Machine Learning for live reporting
  • Stored in Blob Storage.
  • Consumed by the DW using Polybase for BI

There are alternatives to this IOT design.


You should use Azure IoT Hub if you want:

  • Device registration policies
  • Metadata about your devices to be stored

If you have some custom operations to perform, Azure HDInsight (Kafka) can scale up from millions of events per second. It can apply some custom logic that cannot be done by Event Hub or IoT Hub.

We also have flexibility of choice when it comes to processing.


Azure Stream Analytics gives you ease-of-use versus HDInsight. Instead of monitoring the health & performance of compute clusters, you can use Stream Analytics.

The Azure Platform

The platform of Azure wraps this package up:

  • ExpressRoute: Private SLA networking
  • Azure Data Factory: Orchestration of the data processing, not just ingestion.
  • Azure Key Vault: Securely storing secrets
  • Operations Management Suite: Monitoring & alerting

And now that your mind is warped, I’ll leave it there Smile I thought it was an excellent overview session.

Enable IoT Solutions with Windows 10 IoT Platform

Speakers: Adi Hariharan (Group Marketing Manager Windows IoT) and Jimmy Chen (Senior Consultant PDS Sales – IoT)

This is an introduction to IoT.

Cuts Across Industries

You cannot avoid IoT. It is spreading everywhere in all parts of life. Manufacturing, smart cities, transportation, retail, healthcare, energy, public safety, and agriculture according to the slide. But it’s in real life too … fridges, TVs, ambient devices, etc. IoT is one of the methods of Digital Transformation.

IoT is not new and it is complex, but Microsoft is trying to make it easier. Integrate the technology into the device (intelligent edge), and use the power of cloud to gather data, filter/process it, and make use of it (intelligent cloud).

Why Choose Windows IoT

MS has been in the embedded business for over 20 years. Lots of competition: Linux, Wind, Ubuntu, AndroidTGhings, raspberry, redhat, ARMmbed, Riot, Tizen. Windows is mature. It can get to market faster, has security built in, is deisgned for the intelligent edge.

Solution Journey

  1. Build
  2. Connect
  3. Scale
  4. Operate


Jimmy Chen takes over.

Building IoT devices isn’t that easy. “Building a device is as easy as building an app” – not really, but that is the aspiration. Build on Windows, using familiar tools (Visual Studio) and managing it using familiar tools. Things like speech, touch, and ink are embedded.

Visual Studio: C#, HTML/JS/ C++ and more. One Dev Center for pulling back device information. Legacy applications can work on most editions of Windows.

Using Windows IoT embedded, then drivers/firmware are easier. You use the system features and APIs to use the hardware, and use background services for long running tasks. Specialized hardware still requires drivers, but this isn’t that common.

Windows 10 IoT innovations:

  • New SoCs, including Raspberry Pi 3.
  • Azure IoT Hub Device Provisioning
  • Azure IoT Hub Device Management
  • Project “Rome” remove device management
  • New controls and embedded features, standby, on-SOC PWM, NFC and more,.
  • Productization resources
  • Turn-key security: Device Guard for IoT, Defender, BitLocker
  • App Services

App Servicing for IoT:

  • Windows Store
  • Install (Windows Store Preinstall Program and MDM)
  • Servicing (Windows Store)

We see a video from a service called Xogo: an app that can turn any Windows 10 device into a digital sign: www.xogo.io


Two kinds of connections: connect to the cloud and connecting to the legacy devices that you already have. Windows 10 IoT has all the APIs for connecting to the cloud built in. Every IoT device connects to the same URL in Azure. You have to configure which IoT Hub you need to connect to. Device is manufactured with an ID from the Azure IoT Hub Device Provisioning Service. The device is shipped to the customer and connects to the UIiT Hub Device Provisioning Service. That knows which Azure IoT hub to register the device to. Assymetric keys are sent to the device, which then uses that information to connect directly to the IoT Hub.

Intelligence at the edge:

  • Modern, familiar UI development
  • Natural user interface and world sensing support
  • Edge compute: not just a sensor but the ability to do some tasks

Vision, speech, and sensor perception open up powerful industry scenarios using:

  • Windows sensor & perception APIs
  • Microsoft Cognitive Services
  • Computer Vision – OpenCV
  • Speech recognition and synthesis APIs
  • Bing Cloud Speech
  • Cortana and natural language understanding


You need cloud to scale to more than just a few devices. App servicing and telemetry via the Windows Store.

Building a trusted device/solution starts with the device:

  • TPM
  • Windows Device Attested Health – Device Health Attestation for IoT Core (public preview) – device uses TPM to measure configuration – reposts to MS device halth attestation service, and then you use the report in MDM
  • Secure Boot
  • BitLocker

Threat resistance:

  • Windows as a Service
  • Device Guard
  • Windows Firewall
  • Windows Defender

Data protection in-motion:

  • X.509/TLS-based handshake and encryption

Cloud Security

  • Encryption at rest
  • AAD
  • Key Vault
  • Policy-based access control
  • IP-based blocking
  • Secure Device Registration (above)
  • Standad-base best practices


  • Device management
  • Device recovery
  • Device-specific repudiation: device wipe or block the device.


This is all about managing the device. Enterprise scale is 100,000 devices apparently – typical MDM might be fine. When you go into millions of devices, you need to use the Azure IoT management solution: IoT Hub.

Why Microsoft IoT?

You can easily:

  • Build devices with Windows IoT Enterprise (smarter devices) and Windows IoT Core (basic devices)
  • Connect to Azure IoT with lots of ready services

You can build all the security, management, AI, etc yourself on another platform, but Windows 10 IoT and Azure have all that ready to use.

Example Customers

FarmBeats is a MS Research project to reduce the cost of farming and to make it smarter. Uses sensors, helium balloons with vision sensors, and drones. Comms based on the unused TV spectrum. They use a heatmap of the farm to plan crop management, using Azure Machine Learning.

Why Should Enterprise Care?

Opportunities to build smarts into all kinds of systems are possible. Businesses can consume and use data with AI-based intelligence, or can optimise existing dumb systems to improve production, reduce times, decrease waste, optimize human effort, etc.