microsoft azure big data analytics
TRANSCRIPT
Big Data Analytics in the CloudMicrosoft Azure
Cortana Intelligence Suite
Mark KromerMicrosoft Azure Cloud Data Architect
@kromerbigdata@mssqldude
What is Big Data Analytics?Tech Target: “… the process of examining large data sets to uncover hidden patterns, unknown
correlations, market trends, customer preferences and other useful business information.”Techopedia: “… the strategy of analyzing large volumes of data, or big data. This big data is
gathered from a wide variety of sources, including social networks, videos, digital images, sensors, and sales transaction records. The aim in analyzing all this data is to uncover patterns and connections that might otherwise be invisible, and that might provide valuable insights about the users who created it. Through this insight, businesses may be able to gain an edge over their rivals and make superior business decisions.”
2
Requires lots of data wrangling and Data Engineers
Requires Data Scientists to uncover patterns from complex raw data
Requires Business Analysts to provide business value from multiple data sources
Requires additional tools and infrastructure not provided by traditional database and BI technologies
Why Cloud for Big Data Analytics?
• Quick and easy to stand-up new, large, big data architectures
• Elastic scale• Metered pricing• Quickly evolve architectures to rapidly changing
landscapes
Microsoft Azure Big Data Analytics
Cortana Intelligence SuiteAzure Data Platform-at-a-glance
Cortana Intelligence Suite
Action
People
Automated Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards & Visualizations
Cortana
Bot Framework
Cognitive Services
Power BI
Information Management
Event Hubs
Data Catalog
Data Factory
Machine Learning and Analytics
HDInsight (Hadoop and Spark)
Stream Analytics
Intelligence
Data Lake Analytics
Machine Learning
Big Data Stores
SQL Data Warehouse
Data Lake Store
Data Sources
Apps
Sensors and devices
Data
Microsoft AzureWhat it is:
When to use it:
Microsoft’s Cloud Platform including IaaS, PaaS and SaaS• Storage and Data• Networking• Security• Services• Virtual Machines• On-demand Resources and Services
Azure Data FactoryWhat it is:
When to use it:
A pipeline system to move data in, perform activities on data, move data around, and move data out
• Create solutions using multiple tools as a single process
• Orchestrate processes - Scheduling• Monitor and manage pipelines• Call and re-train Azure ML models
ADF Components
ADF Logical Flow
Example - Churn
Azure Blob Storage
Call Log Files
Customer Table
On Premises Data Mart
Call Log Files
Customer Table
Azure DB
Customer Churn Table
Act (Visualize)
Azure Data Factory:
Activity: a processing step (Hadoop job, custom code, ML model, etc)
Data Set(Collection of files, DB table, etc)
Pipeline: a logical group of activities
Data Sources
Customers Likely to
ChurnCustomer
Call Details
Analyze
MoveTransform, Combine, etc
Transform & Analyze PublishIngest
Simple ADF:• Business Goal: Transform and Analyze Web
Logs each month
• Design Process: Transform Raw Weblogs, using a Hive Query, storing the results in Blob Storage
Web Logs Loaded to Blob
Files ready for analysis and use in AzureML
HDInsight HIVE query to transform Log entries
PowerShell ADF Example1. Add-AzureAccount and enter the user name and
password2. Get-AzureSubscription to view all the subscriptions
for this account.3. Select-AzureSubscription to select the subscription
that you want to work with.4. Switch-AzureMode AzureResourceManager5. New-AzureResourceGroup -Name
ADFTutorialResourceGroup -Location "West US"6. New-AzureDataFactory -ResourceGroupName
ADFTutorialResourceGroup –Name DataFactory(your alias)Pipeline –Location "West US"
Using Visual Studio
• Use in mature dev environments • Use when integrated into larger
development process
SQL Data WarehouseWhat it is:
When to use it:
A Scaling Data Warehouse Service in the Cloud
• When you need a large-data BI solution in the cloud
• MPP SQL Server in the Cloud• Elastic scale data warehousing• When you need pause-able scale-out compute
Elastic scale & performanceReal-time elasticity
Resize in <1 minute
On-demand compute
Expandor reduceas needed
Pause Data Warehouse to Save on Compute Costs.
I.e. Pause during non-business hours
Storage can be as big or small as required
Users can execute niche workloads without re-scanning data
Elastic scale & performanceScale
Logical overview
ControlCo
mpu
teSt
orag
e
Distributed queriesQuer
y
ControlCo
mpu
teSt
orag
eResul
t
Simple ExampleSELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
SELECT COUNT_BIG(*)FROM dbo.[FactInternetSales];
Compute
Control
Data LakeWhat it is:
When to use it:
Data storage (Web-HDFS) and Distributed Data Processing (HIVE, Spark, HBase, Storm, U-SQL) Engines
• Low-cost, high-throughput data store• Non-relational data• Larger storage limits than Blobs
The Data Lake approach
Ingest all data regardless of requirements
Store all data in native format without schema definition
Do analysisUsing analytic engines like Hadoop and ADLA
Interactive queriesBatch queries
Machine LearningData warehouse
Real-time analytics
Devices
WebHDFS
YARN
U-SQL
ADL Analytics
ADL HDInsight
1
1
1
1
1
1 1
1
1
1
1
1
Store
HiveAnalytics
Storage
Azure Data Lake (Store, HDInsight, Analytics)
No limits to SCALE
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for the cloud
Optimized for analytic workload PERFORMANCE
ENTERPRISE GRADE authentication, access control, audit, encryption at rest
Azure Data Lake StoreA hyper scale repository for big data analytics workloads
Introducing ADLS
• No fixed limits on:• Amount of data stored• How long data can be stored• Number of files• Size of the individual files• Ingestion/egress throughput
Seamlessly scales from a few KBs to several PBs
No limits to scale
No limits to storage
24
• Each file in ADL Store is sliced into blocks
• Blocks are distributed across multiple data nodes in the backend storage system
• With sufficient number of backend storage data nodes, files of any size can be stored
• Backend storage runs in the Azure cloud which has virtually unlimited resources
• Metadata is stored about each fileNo limit to metadata either.
Azure Data Lake Store file
…Block 1 Block 2 Block 2
Backend Storage
Data node Data node Data node Data node Data nodeData node
Block
Block
Block
Block
Block
Block
Massive throughput
25
• Through read parallelism ADL Store provides massive throughput
• Each read operation on a ADL Store file results in multiple read operations executed in parallel against the backend storage data nodes
Read operation
Azure Data Lake Store file
…Block 1 Block 2 Block 2
Backend storage
Data node Data node Data node Data node Data nodeData node
Block
Block
Block
Block
Block
Block
Enterprise grade securityEnterprise-grade security permits even sensitive data to be stored securelyRegulatory compliance can be enforcedIntegrates with Azure Active Directory for authenticationData is encrypted at rest and in flightPOSIX-style permissions on files and directoriesAudit logs for all operations
26
Enterprise grade availability and reliability
27
• Azure maintains 3 replicas of each data object per region across three fault and upgrade domains
• Each create or append operation on a replica is replicated to other two
• Writes are committed to application only after all replicas are successfully updated
• Read operations can go againstany replica
• Provides ‘read-after-write’ consistency
Data is never lost or unavailableeven under failures
Replica 1
Replica 2 Replica 3
Fault/upgradedomains
Write
Repli
catio
n ReplicationCommit
Enterprise-grade
Limitless scale
Productivity from day one
Easy and powerful data preparation
All data
28
010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100100010101010010100
Azure Data Lake Analytics
Developing big data appsAuthor, debug, & optimize big data apps in Visual StudioMultiple LanguagesU-SQL, Hive, & PigSeamlessly integrate .NET
Work across all cloud data
Azure Data Lake Analytics
Azure SQL DW Azure SQL DB Azure Storage Blobs
Azure Data Lake Store
SQL DB in an Azure VM
Simplified management and administrationWeb-based management in Azure PortalAutomate tasks using PowerShellRole-based access control with Azure ADMonitor service operations and activity
What isU-SQL?
A hyper-scalable, highly extensible language for preparing, transforming and analyzing all dataAllows users to focus on the what—not the how—of business problemsBuilt on familiar languages (SQL and C#) and supported by a fully integrated development environmentBuilt for data developers & scientists
32
U-SQL language philosophyDeclarative query and transformation language:• Uses SQL’s SELECT FROM WHERE with GROUP
BY/aggregation, joins, SQL Analytics functions• Optimizable, scalable
Operates on unstructured & structured data• Schema on read over files• Relational metadata objects (e.g. database, table)
Extensible from ground up:• Type system is based on C#• Expression language is C#
21User-defined functions (U-SQL and C#)User-defined types (U-SQL/C#) (future)User-defined aggregators (C#)User-defined operators (UDO) (C#)
U-SQL provides the parallelization and scale-out framework for usercode• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINERSExpression-flow programming style:• Easy to use functional lambda composition • Composable, globally optimizable
Federated query across distributed data sources (soon)
REFERENCE MyDB.MyAssembly;CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount floatFROM "/input/orders.txt“USING Extractors.Csv();
@c = EXTRACT cid int, name string, city stringFROM "/input/customers.txt“USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt, SUM(c.amount) AS totalamountFROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cidWHERE c.city.StartsWith("New")&& MyNamespace.MyFunction(o.odate) > 10GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"USING new MyData.Write();INSERT INTO T SELECT * FROM @j;
33
Automatic "in-lining" of SQLIP expressions – whole script leads to a single execution modelExecution plan that is optimized out-of-the-box and w/o user interventionPer-job and user-driven parallelizationDetail visibility into execution steps, for debuggingHeat map functionality to identify performance bottlenecks
Expression-flow programming style
010010
100100
010101
• Schema on Read• Write to File• Built-in and custom
Extractors and Outputters• ADL Storage and Azure Blob
Storage
“Unstructured” Files EXTRACT Expression@s = EXTRACT a string, b int FROM "filepath/file.csv"USING Extractors.Csv(encoding: Encoding.Unicode);
• Built-in Extractors: Csv, Tsv, Text with lots of options• Custom Extractors: e.g., JSON, XML, etc.
OUTPUT ExpressionOUTPUT @sTO "filepath/file.csv"USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text• Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io)
Filepath URIs• Relative URI to default ADL Storage account: "filepath/file.csv"
• Absolute URIs:• ADLS:
"adl://account.azuredatalakestore.net/filepath/file.csv"• WASB: "wasb://container@account/filepath/file.csv"
• Create assemblies• Reference assemblies• Enumerate assemblies• Drop assemblies
• VisualStudio makes registration easy!
Managing Assemblies• CREATE ASSEMBLY db.assembly FROM @path;• CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)• All provided by system.dll system.core.dll
system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies• Powershell command• U-SQL Studio Server Explorer
• DROP ASSEMBLY db.assembly;
USING clause 'USING' csharp_namespace | Alias '=' csharp_namespace_or_class.
Examples: DECLARE @ input string = "somejsonfile.json";
REFERENCE ASSEMBLY [Newtonsoft.Json];REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
@data0 = EXTRACT IPAddresses string FROM @input USING new JsonExtractor("Devices[*]");
USING json = [Microsoft.Analytics.Samples.Formats.Json.JsonExtractor];
@data1 = EXTRACT IPAddresses string FROM @input USING new json("Devices[*]");
Allows shortening and disambiguating C# namespace and class names
• Simple Patterns• Virtual Columns• Only on EXTRACT for now
(On OUTPUT by end of year)
File Sets Simple pattern language on filename and path@pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";
• Binds two columns date and suffix• Wildcards the filename• Limits on number of files
(Current limit 800 and 3000 being increased in next refresh)
Virtual columnsEXTRACT
name string, suffix string // virtual column, date DateTime // virtual column
FROM @patternUSING Extractors.Csv();
• Refer to virtual columns in query predicates to get partition elimination(otherwise you will get a warning)
• Naming• Discovery• Sharing• Securing
U-SQL Catalog Naming• Default Database and Schema context: master.dbo• Quote identifiers with []: [my table]• Stores data in ADL Storage /catalog folder
Discovery• Visual Studio Server Explorer• Azure Data Lake Analytics Portal• SDKs and Azure Powershell commands
Sharing• Within an Azure Data Lake Analytics account
Securing• Secured with AAD principals at catalog and Database
level
• Views for simple cases• TVFs for parameterization
and most cases
VIEWs and TVFs
Views
CREATE VIEW V AS EXTRACT…CREATE VIEW V AS SELECT …
• Cannot contain user-defined objects (e.g. UDF or UDOs)!• Will be inlined
Table-Valued Functions (TVFs)
CREATE FUNCTION F (@arg string = "default") RETURNS @res [TABLE ( … )] AS BEGIN … @res = … END;
• Provides parameterization• One or more results• Can contain multiple statements• Can contain user-code (needs assembly reference)• Will always be inlined • Infers schema or checks against specified return
schema
ProceduresCREATE PROCEDURE P (@arg string = "default“) ASBEGIN …; OUTPUT @res TO …; INSERT INTO T …;END;
• Provides parameterization• No result but writes into file or table• Can contain multiple statements• Can contain user-code (needs assembly reference)• Will always be inlined • Can contain DDL (but no CREATE, DROP
FUNCTION/PROCEDURE)
Allows encapsulation of U-SQL scripts
• CREATE TABLE• CREATE TABLE AS SELECT
Tables CREATE TABLE T (col1 int , col2 string , col3 SQL.MAP<string,string> , INDEX idx CLUSTERED (col2 ASC) PARTITION BY (col1) DISTRIBUTED BY HASH (driver_id) );
• Structured Data, built-in Data types only (no UDTs)• Clustered Index (needs to be specified): row-oriented• Fine-grained distribution (needs to be specified):
• HASH, DIRECT HASH, RANGE, ROUND ROBIN• Addressable Partitions (optional)
CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT);
• Infer the schema from the query• Still requires index and distribution (does not support
partitioning)
When to use Tables Benefits of Table clustering and distribution• Faster lookup of data provided by distribution and
clustering when right distribution/cluster is chosen• Data distribution provides better localized scale out• Used for filters, joins and grouping
Benefits of Table partitioning• Provides data life cycle management (“expire” old
partitions)• Partial re-computation of data at partition level• Query predicates can provide partition elimination
Do not use when…• No filters, joins and grouping• No reuse of the data for future queries
• ALTER TABLE ADD/DROP COLUMN
Evolving TablesALTER TABLE T ADD COLUMN eventName string;
ALTER TABLE T DROP COLUMN col3;
ALTER TABLE T ADD COLUMN result string, clientId string, payload int?;
ALTER TABLE T DROP COLUMN clientId, result;
• Meta-data only operation• Existing rows will get
• Non-nullable types: C# data type default value (e.g., int will be 0)
• Nullable types: null
U-SQLAnalytics
Windowing Expression
Window_Function_Call 'OVER' '(' [ Over_Partition_By_Clause ]
[ Order_By_Clause ] [ Row _Clause ]')'.
Window_Function_Call :=Aggregate_Function_Call
| Analytic_Function_Call| Ranking_Function_Call.
Windowing Aggregate Functions
ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP
Analytics Functions
CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK, LEAD, LAG
Ranking Functions
DENSE_RANK, NTILE, RANK, ROW_NUMBER
12Expression-flow Programming Style
• Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model.
• Execution plan that is optimized out-of-the-box and w/o user intervention.
• Per job and user driven level of parallelization.
• Detail visibility into execution steps, for debugging.
• Heatmap like functionality to identify performance bottlenecks.
Visual Studio integration
What can you do with Visual Studio?
Visualize and replay progress
of job
Fine-tune query performance
Visualize physical plan of U-SQL
query
Browse metadata catalog
Author U-SQL scripts (with
C# code)
Create metadata objects
Submit and cancel U-SQL
Jobs
Debug U-SQL and C# code
48
Plug-in
Authoring U-SQL queriesVisual Studio fully supports authoring U-SQL scriptsWhile editing, it provides:
IntelliSenseSyntax color codingSyntax checking…
Contextual Menu
50
Authoring with code-behind fileC# code to extend U-SQL can be authored and used directly in U-SQL Studio, without having to first creating and registering an external assembly.
CustomProcessor
51
Submitting a U-SQL jobJobs can be submitted directly from Visual Studio in two waysYou have to be logged into Azure and have to specify the target Azure Data Lake account.
Concepts: jobs, stages and vertexesEach job is broken into ‘n’ number of verticesEach vertex is some work that needs to be done
Input
Output
Output
6 Stages8 Vertexes
Vertexes are organized into stages– Vertexes in each stage do the
same work on the same data– Vertex in one stage may depend
on a vertex in a earlier stageStages themselves are organized into an acyclic graph
53
Job execution graph After a job is submitted the progress of the execution of the job as it goes through the different stages is shown and updated continuouslyImportant stats about the job are also displayed and updated continuously
54
Job diagnosticsDiagnostics information is shown to help with debugging and performance issues
Metadata objectsADL Analytics creates and stores a set of metadata objects in a catalog maintained by a metadata serviceTables and TVFs are created by DDL statements(CREATE TABLE …)Metadata objects can be created directly through the Server Explorer
Azure Data Lake Analytics accountDatabases– Tables– Table valued functions– Jobs– SchemasLinked storage
Metadata catalogThe metadata catalog can be browsed with the Visual Studio Server Explorer
Server Explorer lets you:1. Create new tables,
schemas and databases2. Register assemblies
HDInsight: Cloud Managed Hadoop
What it is:
When to use it:
Microsoft’s implementation of apache Hadoop (as a service) that uses Blobs for persistent storage
• When you need to process large scale data (PB+)
• When you want to use Hadoop or Spark as a service
• When you want to compute data and retire the servers, but retain the results
• When your team is familiar with the Hadoop Zoo
Hadoop and HDInsight
Using the Hadoop Ecosystem to process and query data
Microsoft Azure Big Data Analytics
Cortana Intelligence SuiteHDInsight Tools for Visual Studio
Deploying HDInsight Clusters• Cluster Type: Hadoop, Spark, HBase and Storm.
• Hadoop clusters: for query and analysis workloads• HBase clusters: for NoSQL workloads• Spark clusters: for in-memory processing, interactive queries, stream, and machine learning workloads
• Operating System: Windows or Linux• Can be deployed from Azure portal, Azure
Command Line Interface (CLI), or Azure PowerShell and Visual Studio
• A UI dashboard is provided to the cluster through Ambari.
• Remote Access through SSH, REST API, ODBC, JDBC.• Remote Desktop (RDP) access for Windows clusters
Azure MLWhat it is:
When to use it:
A multi-platform environment and engine to create and deploy Machine Learning models and API’s
• When you need to create predictive analytics• When you need to share Data Science
experiments across teams• When you need to create call-able API’s for ML
functions• When you also have R and Python experience on
your Data Science team
The Azure ML EnvironmentDevelopment Environment• Creating Experiments• Sharing a Workspace
Deployment Environment• Publishing the Model• Using the API• Consuming in various tools
Creating an Experiment
Get/Prepare Data
Build/Edit Experiment
Create/Update Model
Evaluate Model Results
Build and ModelCreateWorkspace
Deploy Model
Consume Model
Basic Azure ML Elements
Import Data
Preprocess
Algorithm
Train Model
Split Data
Score Model
Power BIWhat it is:
When to use it:
Interactive Report and Visualization creation for computing and mobile platforms
• When you need to create and view interactive reports that combine multiple datasets
• When you need to embed reporting into an application
• When you need customizable visualizations• When you need to create shared datasets,
reports, and dashboards that you publish to your team
Microsoft Azure Big Data Analytics
Cortana Intelligence SuiteCommon architectural patterns
Big Data Analytics – Data Flow
DATA
Business apps
Custom apps
Sensors and devices
INTELLIGENCE ACTION
People
Preparation, Analytics and Machine Learning
Azure Data Lake Store
Ingestion
Bulk Ingestion
Event Ingestion
Discovery
Azure Data Catalog
Visualization
Power BI
HDInsight Data Lake Analytics
Event Ingestion Patterns
Business apps
Custom apps
Sensors and devices
Events Events
Azure Data Lake Store
Transformed Data
Real Time Dashboards
Power BI
Raw Events
Azure Event Hubs
Kafka
Event Collection
Azure Stream Analytics
Spark Streaming
Stream Processing
Bulk Ingestion and Preparation
Business apps
Custom apps
Sensors and devices
Azure Data Lake Store
Prepared Data (Structured)
Raw DataBulk Load
Azure Data Factory
Prepared Data (Unstructured)
Data Preparation
Batch Analytics
Interactive Analytics
Power BI Notebooks
Spark on HDInsight
Azure SQL DW
Azure Data Catalog
Data Transformati
on
Data Collection
Presentation and action
Queuing System
Data Storage
Big Data Lambda Architecture78
Azure Search
Data analytics (Excel, Power BI, Looker, Tableau)
Web/thick client dashboards
Devices to take actionEvent hub
Event & data producers
Applications
Web and social
Devices
Live Dashboards
DocumentDBMongoDBSQL AzureADWHbaseBlob StorageKafka/RabbitMQ/
ActiveMQ
Event hubs Azure ML
Storm / Stream Analytics
Hive / U-SQL
Data Factory
Sensors
Pig
Cloud gateways(web APIs)
Field gateways
Get started today!
http://aka.ms/cisolutions 79
Cortana Intelligence Solutions
Cortana Intelligence Solutions: Try
Cortana Intelligence Solutions: Deploy
Instructions and Next Steps: Customize