{"id":20722,"date":"2017-10-02T05:38:52","date_gmt":"2017-10-02T04:38:52","guid":{"rendered":"https:\/\/aidanfinn.com\/?p=20722"},"modified":"2017-10-02T05:38:52","modified_gmt":"2017-10-02T04:38:52","slug":"understanding-big-data-on-azure-structured-unstructured-and-streaming","status":"publish","type":"post","link":"https:\/\/aidanfinn.com\/?p=20722","title":{"rendered":"Understanding Big Data on Azure&ndash;Structured, Unstructured, and Streaming"},"content":{"rendered":"<p>These are my notes from the recording of this Ignite 2017 session, BRK2293.<\/p>\n<p>Speaker: Nishant Thacker, Technical Product Manager \u2013 Big Data<\/p>\n<p>This Level-200 <em>overview<\/em> session is a tour of big data in Azure, it explains why the services were created, and what is their purpose. It is a foundation for the rest of the related sessions at Ignite. Interestingly, only about 30% of the audience had done any big data work in the past \u2013 I fall into the other 70%.<\/p>\n<h2>What is Big Data?<\/h2>\n<ul>\n<li>Definition: A term for data sets that are so large or <strong><em><u>complex<\/u><\/em><\/strong> that traditional data processing application s\/w if inadequate to deal with it. Nishant stresses \u201ccomplex\u201d.<\/li>\n<li>Challenges: Capturing (velocity) data, data storage, data analysis, search, sharing, visualization, querying, updating, and information privacy. So \u2026 there\u2019s a few challenges <img decoding=\"async\" class=\"wlEmoticon wlEmoticon-smile\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/wlEmoticon-smile.png\" alt=\"Smile\" \/><\/li>\n<\/ul>\n<h2>The Azure Data Landscape<\/h2>\n<p>This slide is referred to for quite a while:<\/p>\n<p align=\"center\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; display: inline; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb.png\" alt=\"image\" width=\"600\" height=\"342\" border=\"0\" \/><\/a><\/p>\n<h2 align=\"left\">Data Ingestion<\/h2>\n<p align=\"left\">He starts with the left-top corner:<\/p>\n<ul>\n<li>\n<div align=\"left\">Azure Data Factory<\/div>\n<\/li>\n<li>\n<div align=\"left\">Azure Import\/Export Service<\/div>\n<\/li>\n<li>\n<div align=\"left\">Azure CLI<\/div>\n<\/li>\n<li>\n<div align=\"left\">Azure SDK<\/div>\n<\/li>\n<\/ul>\n<p>The first problem we have is data ingestion into the cloud or any system. How do you manage that? Azure can manage ingestion of data.<\/p>\n<p>Azure Data Factory is a scheduling, orchestration, and ingestion service. It allows us to create sophisticated data pipelines from the ingestion of the data through to processing, through to storing, through to making it available to end users to access. It does not have compute power of it\u2019s own; it taps into other Azure services to deliver any required compute.<\/p>\n<p>The Azure Import\/Export service can help bring incremental data on board. You can also use it to bulk load on Azure. If you have terabytes of data to upload, bandwidth might not be enough. You can securely courier data via disk to an Azure region.<\/p>\n<p>The Azure CLI is designed for bulk uploads to happen in parallel. The SDKs can be put into your code, so you can generate the data in your application in the cloud, instead of uploading to the cloud.<\/p>\n<h2>Operational Database Services<\/h2>\n<ul>\n<li>Azure SQL DB<\/li>\n<li>Azure Cosmos DB<\/li>\n<\/ul>\n<p>The SQL database offers SQL Server, MySQL, and PostgreSQL.<\/p>\n<p>Cosmos DB is the more interesting one \u2013 it\u2019s NoSQL and offers global storage. It also supports 4 programming models: Mongo, Gremlin\/Graph, SQL (DocumentDB), and Table. You have flexibility to bring in data in its native form, and data can be accessed in an operational environment. Cosmos DB has plugs into other aspects of Azure that make it more than just an operational database such as, Azure Functions or Spark (HDInsight).<\/p>\n<h2>Analytical Data Warehouse<\/h2>\n<ul>\n<li>Azure SQL Data Warehouse<\/li>\n<\/ul>\n<p>When you want to do reporting and dashboards from data in operational databases then you will need an analytical data warehouse that aggregates data from many sources.<\/p>\n<p>Traits of Azure SQL Data Warehouse:<\/p>\n<ul>\n<li>Can grow, shrink, and pause in seconds \u2013 up to 1 Petabyte<\/li>\n<li>Fill enterprise-class SQL Server \u2013 means you can migrate databases and bring your scripts with you. Independent scale of compute and storage in seconds<\/li>\n<li>Seamless integration with Power BI, Azure Machine Learning, HDInsight, and Azure Data Factory<\/li>\n<\/ul>\n<h2>NoSQL Data<\/h2>\n<ul>\n<li>Azure Blob storage<\/li>\n<li>Azure Data Lake Store<\/li>\n<\/ul>\n<p>When your data doesn\u2019t fit into the rows and columns structure of a traditional database then this is when you need specialized big data storages \u2013 capacity, unstructured sorting\/reading.<\/p>\n<h2>Unstructured Data Compute Engines<\/h2>\n<ul>\n<li>Azure Data Lake Analytics<\/li>\n<li>Azure HDInsight (Spark \/ Hadoop): managed clusters of Hadoop and Spark with enterprise-level SLAs with lower TCO than on-premises deployment.<\/li>\n<\/ul>\n<p>When you get data into a big unstructured stores such as Blob or Data Lake then you need specialized compute engines for the complexity and volume of the data. This compute must be capable of scaling out because you cannot wait hours\/days\/months to analyse the data.<\/p>\n<h2>Ingest Streaming Data<\/h2>\n<ul>\n<li>Azure IoT Hub<\/li>\n<li>Azure Event Hubs<\/li>\n<li>Kafka on Azure HDInsight<\/li>\n<\/ul>\n<p>How do you ingest this real-time data as it is generated? You can tap into event generators (e.g. devices) and buffer up data for your processing engines.<\/p>\n<h2>Stream Processing Engines<\/h2>\n<ul>\n<li>Azure Stream Analytics<\/li>\n<li>Storm and Spark streaming on Azure HDInsight<\/li>\n<\/ul>\n<p>These systems allow you to process streaming data on the fly. You have a choice of \u201ceasy\u201d or \u201copen source extensibility\u201d with either of these solutions.<\/p>\n<h2>Reporting and Modelling<\/h2>\n<ul>\n<li>Azure Analysis Services<\/li>\n<li>Power BI<\/li>\n<\/ul>\n<p>You have cleansed and curated the data, but what do you do with it? Now you want some insights from it. Reporting &amp; modelling is the first level of these insights.<\/p>\n<h2>Advanced Analytics<\/h2>\n<ul>\n<li>Azure Machine Learning<\/li>\n<li>ML Server (R)<\/li>\n<\/ul>\n<p>The basics of reporting and modelling are not new. Now we are getting into advanced analytics. Using data from these AI systems we can predict outcomes or prescribe recommended actions.<\/p>\n<h2 align=\"left\">Deep Learning<\/h2>\n<ul>\n<li>\n<div align=\"left\">Cognitive Services<\/div>\n<\/li>\n<li>\n<div align=\"left\">Bot Service<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">Taking advanced analytics to a further level by using these toolkits.<\/p>\n<h2 align=\"left\">Tracking Data<\/h2>\n<ul>\n<li>\n<div align=\"left\">Azure Search<\/div>\n<\/li>\n<li>\n<div align=\"left\">Azure Data Catalog<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">When you have such a large data estate you need ways to track what you have, and to be able to search it.<\/p>\n<h2 align=\"left\">The Azure Platform<\/h2>\n<ul>\n<li>\n<div align=\"left\">ExpressRoute<\/div>\n<\/li>\n<li>\n<div align=\"left\">Azure AD<\/div>\n<\/li>\n<li>\n<div align=\"left\">Network Security Groups<\/div>\n<\/li>\n<li>\n<div align=\"left\">Azure Key Management Service<\/div>\n<\/li>\n<li>\n<div align=\"left\">Operations Management Suite<\/div>\n<\/li>\n<li>\n<div align=\"left\">Azure Functions (serverless compute)<\/div>\n<\/li>\n<li>\n<div align=\"left\">Visual Studio<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">You need a platform with enterprise capabilities in the best ways possible in a compliant manner.<\/p>\n<h2>Big Data Services<\/h2>\n<p>Nishant says that that darker shaded services are the ones usually being talked about when they talk about Big Data:<\/p>\n<p align=\"center\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-1.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; display: inline; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-1.png\" alt=\"image\" width=\"600\" height=\"385\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">To understand what all these services are doing as a whole, and why Microsoft has gotten into Big Data, we have to step all the way back. There are 3 high-level trends that are a kind of an industrial revolution, making data a commodity:<\/p>\n<ul>\n<li>\n<div align=\"left\">Cloud<\/div>\n<\/li>\n<li>\n<div align=\"left\">Data<\/div>\n<\/li>\n<li>\n<div align=\"left\">AI<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">We are on the cusp of an era where every action produces data.<\/p>\n<h2 align=\"left\">The Modern Data Estate<\/h2>\n<p align=\"left\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-2.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-2.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">There are 2 principles:<\/p>\n<ul>\n<li>\n<div align=\"left\">Data on-premises <em>and<\/em><\/div>\n<\/li>\n<li>\n<div align=\"left\">Data in the cloud<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">Few organizations are just 1 or the other; most span both locations. Data warehouses aggregate operational databases. Data Lakes store the data used for AI, and will be used to answer the questions that we don\u2019t even know of today.<\/p>\n<p align=\"left\">We need three capabilities for this AI functionality:<\/p>\n<ul>\n<li>\n<div align=\"left\">The ability to reason over this data from anywhere<\/div>\n<\/li>\n<li>\n<div align=\"left\">You have the flexibility to choose \u2013 MS (simplicity &amp; ease of use), open source (wider choice), programming models, etc.<\/div>\n<\/li>\n<li>\n<div align=\"left\">Security &amp; privacy, e.g. GDPR<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">Microsoft has offerings for both on-premises and in Azure, spanning MS code and open source, with AI built-in as a feature.<\/p>\n<h2 align=\"left\">Evolution of the Data Warehouse<\/h2>\n<p align=\"left\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-3.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-3.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">There are 3 core scenarios that use Big Data:<\/p>\n<ul>\n<li>\n<div align=\"left\">Modern DW: Modernizing the old concept of a DW to consume data from lots of sources, including complexity (big data)<\/div>\n<\/li>\n<li>\n<div align=\"left\">Advanced Analytics: Make predictions from data using Deep Learning (AI)<\/div>\n<\/li>\n<li>\n<div align=\"left\">IoT: Get real time insights from data produced by devices<\/div>\n<\/li>\n<\/ul>\n<h2 align=\"left\">Implementing Big Data &amp; Data Warehousing in Azure<\/h2>\n<p align=\"left\">Here is a traditional DW<\/p>\n<p align=\"center\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-4.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; display: inline; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-4.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p>Data from operational databases are fed into a single DW. Some analysis is done and information is reported\/visualized for users.<\/p>\n<p>SQL Server Integration services, a part of the Azure Data Factory, can allow you to consume data from your multiple operational assets and aggregate them as a DW.<\/p>\n<p>Azure Analysis Services allows you yo build tabular models for your BI needs, and Power BI can be used to report and visualize those models.<\/p>\n<p>If you have existing huge repositories of data that you want to bring into a DW then you can use:<\/p>\n<ul>\n<li>Azure CLI<\/li>\n<li>Azure Data Factory<\/li>\n<li>BCP Command Line Utility<\/li>\n<li>SQL Server Integration Services<\/li>\n<\/ul>\n<p>This traditional model breaks when some of your data is unstructured. For example:<\/p>\n<p><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-5.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-5.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">Structured operational data is coming in from Azure SQL DB as before.<\/p>\n<p align=\"left\">Log files and media files are coming into blob storage as unstructured data &#8211; the structure of queries is unknown and the capacity is enormous.\u00a0 That unstructured data breaks your old system but you still need to ingest it because you know that there are insights in it.<\/p>\n<p align=\"left\">Today, you might only know some questions that you\u2019d like to ask of the unstructured data. But later on, you might have more queries that you\u2019d like to create. The vast scale of economy of Azure storage makes this feasible.<\/p>\n<p align=\"left\">ExpressRoute will be used to ingest data from an enterprise if:<\/p>\n<ul>\n<li>\n<div align=\"left\">You have security\/compliance concerns<\/div>\n<\/li>\n<li>\n<div align=\"left\">There is simply too much data for normal Internet connections<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">Back to the previous unstructured data scenario. If you are curating the data so it is filtered\/clean\/useful, then you can use Polybase to ingest it into the DW. Normally, that task of cleaning\/filtering\/curating is too huge for you to do on the fly.<\/p>\n<p align=\"left\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-6.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-6.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">HDInsight can tap into the unstructured blob storage to clean\/curate\/process it before it is ingested into the DW.<\/p>\n<p align=\"left\">What does HDInsight allow you to do? You forget that the data was structured\/unstructured\/semi-structured. You forget the complexity of the analytical queries that you want to write. You forget the kinds of questions you would like to ask of the data. HDInsight allows you to add structure to the data using some of it\u2019s tools. Once the data is structured, you can import it into the DW using Polybase.<\/p>\n<p align=\"left\">Another option is to use Azure Functions instead of HDInsight:<\/p>\n<p align=\"left\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-7.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-7.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">This serverless option can suit if the required manipulation of the unstructured data is very simple. This cannot be sophisticated \u2013 why re-invent the wheel of HDInsight?<\/p>\n<p align=\"left\">Back to HDInsight:<\/p>\n<p align=\"left\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-8.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-8.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">Analytical dashboards can tap into some of the compute engines directly, e.g. tap into raw data to identify a trend or do ad-hoc analytics using queries\/dashboards.<\/p>\n<h2 align=\"left\">Facilitating Advanced Analytics<\/h2>\n<p align=\"left\">So you\u2019ve got a modern DW that aggregates structured and unstructured data. You can write queries to look for information \u2013 but we want deeper insights.<\/p>\n<p align=\"center\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-9.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; display: inline; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-9.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">The compute engines (HDIsnight) enable you to use advanced analytics. Machine Learning can only be as good as the quality and quantity of data that you provide to it \u2013 the compute engine\u2019s job. The more data machine learning has to learn from, the more accurate the analysis will be. If the data is clean, then garbage results won\u2019t be produced. To do this with TBs or PBs of data, you will need the scale-out compute engine (HDInsight) \u2013 a VM just cannot do this.<\/p>\n<p align=\"left\">Some organizations are so large or so specialized that they need even better engines to work with:<\/p>\n<p align=\"center\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-10.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; display: inline; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-10.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">Azure Data Lake store replaces blob storage for greater scales. Azure Data Lake Analytics replaces HDInsight offers a developer-friendly T-SQL-like &amp; C# environment. You can also write Python R models. Azure Data Lake Analytics is serverless \u2013 there are no clusters as there are in HDInsight. You can focus on your service instead of being distracted by monitoring.<\/p>\n<p align=\"left\">Note that HDInsights works with interactive queries against streaming data. Azure Data Lake is based on batch jobs.<\/p>\n<p align=\"left\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-11.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-11.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">You have the flexibility of choice for your big data compute engines:<\/p>\n<p align=\"left\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-12.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-12.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">Returning to the HDInsight scenario:<\/p>\n<p align=\"left\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-13.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-13.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">HDInsight, via Spark, can integrate with Cosmos DB. Data can be stored in Cosmos DB for users to consume. Also, data that users are generating and storing in Cosmos DB can be consumed by HDInsight for processing by advanced analytics, with learnings being stored back in Cosmos DB.<\/p>\n<h2 align=\"left\">Demo<\/h2>\n<p align=\"left\">He opens an app on an iPhone. It\u2019s a shoe sales app. The service is (in theory) using social media, fashion trends, weather, customer location, and more to make a prediction about what shoes the customer wants. Those shoes are presented to the customer, with the hope that this will simplify the shopping experience and lead to a sale on this app. When you pick a shoe style, the app predicts your favourite colour. If you view a shoe, but don\u2019t buy it., the app can automatically entice you with promotional offers \u2013 stock levels can be queried to see what kind of promotion is suitable \u2013 e.g. try shift less popular stock by giving you a discount to do an in-store pickup where stock levels are too high and it would cost the company money to ship stock back to the warehouse. The customer might also be tempted to buy some more stuff when in the shop.<\/p>\n<p align=\"left\">He then switches to the dashboard that the marketing manager of the shoe sales company would use. There\u2019s lots of data visualization from the Modern DW, combining structured and unstructured data \u2013 the latter can come from social media sentiment, geo locations, etc. This sentiment can be tied to product category sales\/profits. Machine learning can use the data to recommend promotional campaigns. In this demo, choosing one of these campaigns triggers a workflow in Dynamics to launch the campaign.<\/p>\n<p align=\"left\">Here\u2019s the solution architecture:<\/p>\n<p align=\"left\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-14.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-14.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">There are 3 data sources:<\/p>\n<ul>\n<li>\n<div align=\"left\">Unstructured data from monitoring the social and app environments \u2013 Azure Data Factory<\/div>\n<\/li>\n<li>\n<div align=\"left\">Structured data from CRM (I think) \u2013 Azure Data Factory<\/div>\n<\/li>\n<li>\n<div align=\"left\">Product &amp; customer profile data from Cosmos DB (Service Fabric in front of it servicing the mobile apps).<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">HDInsight is consuming that data and applying machine learning using R Server. Data is being written back out to:<\/p>\n<ul>\n<li>\n<div align=\"left\">Blob storage<\/div>\n<\/li>\n<li>\n<div align=\"left\">Cosmos DB \u2013 Spark integration<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">The DW consumes two data sources:<\/p>\n<ul>\n<li>\n<div align=\"left\">The data produced by HDInsight from blob storage<\/div>\n<\/li>\n<li>\n<div align=\"left\">The transactional data from the sales transactions (Azure SQL DB)<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">Azure Analysis Services then provides the ability to consume the information in the DW for the Marketing Manager.<\/p>\n<h2 align=\"left\">Enabling Real-Time Processing<\/h2>\n<p align=\"left\">This is when we start getting in IoT data, e.g. sensors \u2013 another source of unstructured data that can come in big and fast. We need to capture the data, analyse it, derive insights, and potentially do machine learning analysis to take actions on those insights.<\/p>\n<p align=\"left\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-15.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-15.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">Event hubs can ingest this data and forward it to HDIngsights \u2013 stream analysis can be done using Spark Streaming or Storm. Data can be analysed by Machine Learning and reported in real-time to users.<\/p>\n<p align=\"left\">So the IoT data is:<\/p>\n<ul>\n<li>\n<div align=\"left\">Fed into HDInsights for structuring<\/div>\n<\/li>\n<li>\n<div align=\"left\">Fed into Machine Learning for live reporting<\/div>\n<\/li>\n<li>\n<div align=\"left\">Stored in Blob Storage.<\/div>\n<\/li>\n<li>\n<div align=\"left\">Consumed by the DW using Polybase for BI<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">There are alternatives to this IOT design.<\/p>\n<p align=\"center\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-16.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; display: inline; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-16.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">You should use Azure IoT Hub if you want:<\/p>\n<ul>\n<li>\n<div align=\"left\">Device registration policies<\/div>\n<\/li>\n<li>\n<div align=\"left\">Metadata about your devices to be stored<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">If you have some custom operations to perform, Azure HDInsight (Kafka) can scale up from millions of events per second. It can apply some custom logic that cannot be done by Event Hub or IoT Hub.<\/p>\n<p align=\"left\">We also have flexibility of choice when it comes to processing.<\/p>\n<p align=\"left\"><a href=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image-17.png\"><img loading=\"lazy\" decoding=\"async\" style=\"border: 0px currentcolor; margin-right: auto; margin-left: auto; float: none; display: block; background-image: none;\" title=\"image\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/image_thumb-17.png\" alt=\"image\" width=\"600\" height=\"400\" border=\"0\" \/><\/a><\/p>\n<p align=\"left\">Azure Stream Analytics gives you ease-of-use versus HDInsight. Instead of monitoring the health &amp; performance of compute clusters, you can use Stream Analytics.<\/p>\n<h2 align=\"left\">The Azure Platform<\/h2>\n<p align=\"left\">The platform of Azure wraps this package up:<\/p>\n<ul>\n<li>\n<div align=\"left\">ExpressRoute: Private SLA networking<\/div>\n<\/li>\n<li>\n<div align=\"left\">Azure Data Factory: Orchestration of the data processing, not just ingestion.<\/div>\n<\/li>\n<li>\n<div align=\"left\">Azure Key Vault: Securely storing secrets<\/div>\n<\/li>\n<li>\n<div align=\"left\">Operations Management Suite: Monitoring &amp; alerting<\/div>\n<\/li>\n<\/ul>\n<p align=\"left\">And now that your mind is warped, I\u2019ll leave it there <img decoding=\"async\" class=\"wlEmoticon wlEmoticon-smile\" src=\"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/wlEmoticon-smile.png\" alt=\"Smile\" \/> I thought it was an excellent overview session.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>These are my notes from the recording of this Ignite 2017 session, BRK2293. Speaker: Nishant Thacker, Technical Product Manager \u2013 Big Data This Level-200 overview session is a tour of big data in Azure, it explains why the services were created, and what is their purpose. It is a foundation for the rest of the &hellip; <a href=\"https:\/\/aidanfinn.com\/?p=20722\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Understanding Big Data on Azure&ndash;Structured, Unstructured, and Streaming&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":20724,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[14],"tags":[170,224,218,222,176,177,219,203,201,223,221,220],"class_list":["post-20722","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-eventnotes","tag-azure","tag-data-warehouse","tag-dig-data","tag-event-hub","tag-eventnotes","tag-events","tag-hdinsight","tag-ignite","tag-iot","tag-iot-hub","tag-machine-learning","tag-stream-analytics"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/aidanfinn.com\/wp-content\/uploads\/2017\/10\/Azure-HDInsight.png","amp_enabled":true,"_links":{"self":[{"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/posts\/20722","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=20722"}],"version-history":[{"count":2,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/posts\/20722\/revisions"}],"predecessor-version":[{"id":20725,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/posts\/20722\/revisions\/20725"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=\/wp\/v2\/media\/20724"}],"wp:attachment":[{"href":"https:\/\/aidanfinn.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=20722"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=20722"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aidanfinn.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=20722"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}