usq landdemos azure data lake
TRANSCRIPT
BASLE BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Azure Data Lake EventBig data solutions on Microsoft Azure using Azure Data Lake
Regensdorf, 02.03.2018
Agenda
Azure Data Lake Analytics2
1. Begrüssung
(Willfried Färber – Trivadis)
2. Historie, Aktuelles zu Azure Data Lake
(Michael Rys - Microsoft)
3. Code Session, ein oder zwei Beispiele
(Marco Amhof - Trivadis)
4. Azure Data Lake Services fast, flexible and at your fingertips
(Patrik Borosch – Microsoft)
5. Zusammenfassung, Ausblick, Q&A
(Michael Rys - Microsoft)
6. Lunch
KOPENHAGEN
MÜNCHEN
LAUSANNE
BERN
ZÜRICH
BRUGG
GENF
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
WIEN
Mit über 600 IT- und Fachexperten bei Ihnen vor Ort.
© Trivadis – Das Unternehmen3
14 Trivadis Niederlassungen mit
über 600 Mitarbeitenden.
Über 200 Service Level Agreements.
Mehr als 4'000 Trainingsteilnehmer.
Forschungs- und Entwicklungsbudget:
CHF 5.0 Mio.
Finanziell unabhängig und
nachhaltig profitabel.
Erfahrung aus mehr als 1'900 Projekten
pro Jahr bei über 800 Kunden.
Big Data
Data that is too large or complex for analysis in traditional relational databases
Typified by the “3 V’s”:
Volume – Huge amounts of data to process
Variety – A mixture of structured and unstructured data
Velocity – New data generated extremely frequently
Web server click-streams Sensor and IoT ProcessingSocial media sentiment analysis
Big Data Processing
Filter, cleanse, and shape data for analysis
Apply statistical
algorithms for
classification,
regression, clustering,
and prediction
Capture, filter, and
aggregate streams of
data for low-latency
querying
Batch Processing Predictive AnalyticsReal-Time Processing
..110100101001..
Action
People
Automated Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream
Analytics
Intelligence
Data Lake
Analytics
Machine
Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data Sources
Apps
Sensors and devices
Data
U-SQL – A Language makes Big Data Processing Easy
Azure Data Lake Analytics8
Requirements and characteristics of Big Data analytics
▪ Process any type and any size of data
▪ BotNet attack patterns
▪ Security logs
▪ Extract features from images and videos (machine learning)
▪ The language enables you to work on any data
▪ Use custom code / algorithms to easily express your complex and often proprietary business algorithms
▪ User Defined Functions
▪ Custom Input- and Output Formats
▪ Scale efficiently to any size of data without you focusing on scale-out topologies, plumbing code, or
limitations of a specific distributed infrastructure
U-SQL Origins
Azure Data Lake Analytics9
▪ SCOPE – Microsoft’s internal Big Data language
▪ COSMOS – Microsoft’s internal Big Data analysis platform
▪ SQL and C# integration model
▪ Optimization and Scaling model
▪ Runs 100’000s of jobs daily
▪ Hive
▪ Complex data types (Maps, Arrays)
▪ Data format alignment for text files
▪ T-SQL/ANSI SQL
▪ Many of the SQL capabilities (windowing functions, meta data model etc.)
U-SQL Features
Azure Data Lake Analytics10
▪ Operating over set of files with patterns
▪ Using (Partitioned) Tables
▪ Federated Queries against Azure SQL DB
▪ Encapsulating your U-SQL code with Views, Table-Valued Functions, and
Procedures
▪ SQL Windowing Functions
▪ Programming with C# User-defined Operators (custom extractors, processors)
▪ Complex Types (MAP, ARRAY)
▪ Using U-SQL in data processing pipelines
▪ U-SQL in a lambda architecture for IOT analytics
Query the data where it lives
Azure Data Lake Analytics11
▪ Avoid moving large amount of data across the network between stores
▪ Single view of data irrespective of physical location
▪ Minimize data proliferation issues caused by maintaining multiple copies
▪ Single query language for all data
▪ Each data store maintains its own sovereignty
▪ Design choices based on the need
▪ Push SQL expressions to remote SQL sources
▪ Projections
▪ Filters
▪ Joins
U-SQL = SQL + C#
Azure Data Lake Analytics12
▪ unifies the ease of use of SQL with the
expressive power of C#
▪ Get benefits of both…
▪ Unstructured and structured data processing
▪ Declarative SQL and custom imperative Code
▪ Local and remote Queries
U-SQL Language Overview
Azure Data Lake Analytics13
U-SQL Fundamentals
All the familiar SQL clauses
▪ SELECT | FROM | WHERE
GROUP BY | JOIN | OVER
▪ Operate on unstructured and
structured data
▪ Relational metadata objects
▪ Federated Queries against Azure
SQL DB and Azure SQL DWH
▪ SQL Windowing Functions
▪ EXCEPT / INTERSECT / UNION
.NET integration and extensibility
U-SQL expressions are full C# expressions
Reuse .NET code in your own assemblies
Use C# to define your own
▪ Types
▪ Functions
▪ Joins
▪ Aggregators
▪ I/O (Extractors, Outputters)
U-SQL extensibility
Azure Data Lake Analytics14
▪ Extend U-SQL with C#
(.NET)
▪ Extensions require .NET
assemblies to be
registered with a database
Why use U-SQL
Azure Data Lake Analytics17
SQL makes Big Data processing easy because it:
▪ Unifies declarative nature of SQL with the imperative power of C#
▪ Unifies querying structured, semi-structured and unstructured data
▪ Unifies local and remote queries
▪ Distributed query support over all data
▪ Increases productivity and agility from Day 1 for YOU!
Azure Data Lake in Visual Studio
Azure Data Lake Analytics18
Prerequisites
▪ Visual Studio 2017 (under data storage and processing workload), Visual Studio 2015 update 3,
Visual Studio 2013 update 4, or Visual Studio 2012
Enterprise (Ultimate/Premium), Professional, Community editions are supported; Express edition
is not supported
▪ Microsoft Azure SDK for .NET version 2.7.1 or above. Install it using the Web platform installer.
▪ Data Lake Tools for Visual Studio
▪ Once Data Lake Tools for Visual Studio is installed, you will see a "Data Lake Analytics" node
in Server Explorer under the "Azure" node (Open Server Explorer by pressing Ctrl+Alt+S).
▪ Data Lake Analytics account and sample data
The Data Lake Tools do not support creating Data Lake Analytics accounts. Create an account
using the Azure portal, Azure PowerShell, .NET SDK or Azure CLI. For your convenience, a
PowerShell script for creating a Data Lake Analytics service and uploading the source data file
can be found in Appx-A PowerShell sample for preparing the tutorial.
Documentation
U-SQL Script
Azure Data Lake Analytics20
@t = EXTRACT date string
, time string
, author string
, tweet string
FROM "/input/MyTwitterHistory.csv"
USING Extractors.Csv();
@res = SELECT author
, COUNT(*) AS tweetcount
FROM @t
GROUP BY author;
OUTPUT @res TO "/output/MyTwitterAnalysis.csv"
ORDER BY tweetcount DESC
USING Outputters.Csv();
This U-SQL script
▪ extracts the source data file
using Extractors.Tsv()
▪ Transforms (aggregates) the
data using SQL
▪ creates a csv file using
Outputters.Csv()
Read the input, write it directly to output (just a simple copy)
Azure Data Lake Analytics21
Apply Schema on read
From a file in a Data Lake
Easy delimited text
handling
Write out
Rowset
Extract – Transform - Persist
Azure Data Lake Analytics22
▪ Retrieve data from stored locations in rowset format
▪ Stored locations can be files that will be schematized on read with EXTRACT
expressions
▪ Stored locations can be U-SQL tables that are stored in a schematized format
▪ Or can be tables provided by other data sources such as an Azure SQL database
▪ Transform the rowset(s)
▪ Several transformations over the rowsets can be composed in a data flow format
▪ Store the transformed rowset data
▪ Store it in a file with an OUTPUT statement, or
▪ Store it in a U-SQL table with an INSERT statement
Run the U-SQL job
Azure Data Lake Analytics23
Submit Job
▪ (local) to run script locally
▪ Data Lake Analytics account
to run script in the cloud
Solution Explorer
▪ Right-click Script.usql and
click Submit Script
ADLA - Jobs
Azure Data Lake Analytics24
▪ Create a Cluster of N Nodes
▪ Pay as long as the cluster exist and
is up and running
▪ Delete the cluster when done
▪ Submit a Job (a U-SQL Script) and
reserve N Nodes of parallelism per job run
▪ Pay as long as the job is running
(1 AU = CHF 1.807 / Hour)
▪ Nodes go away when the job finishes
HDInsight Analytics
“Cluster Service” “Job Service”
Benefits of ADLA Job Service
Azure Data Lake Analytics25
▪ Pay for what you use
▪ Easier
▪ No need to fetch logs on the cluster
▪ No tuning needed of cluster
▪ Job History / Job Replay
▪ Performance analysis
▪ Vertex Debugging
▪ Built-in Job Monitoring
▪ Built-in Auditing
U-SQL Job Workflow
Azure Data Lake Analytics26
Job Front End
Job Scheduler Compiler Service
Job Queue
Job Manager
U-SQL Catalog
YARN
Job submission
Job execution
U-SQL Runtime Vertex execution
Demo oeVCH
Azure Data Lake Analytics28
EinwohnerCH.csv (Quelle BFS)
HaltestellenCH.csv (Quelle BAV)
Azure
SQL DW
U-SQL = SQL + C#
Azure Data Lake Analytics29
▪ unifies the ease of use of SQL with the
expressive power of C#
▪ Get benefits of both…
▪ Unstructured and structured data processing
▪ Declarative SQL and custom imperative Code
▪ Local and remote Queries
ADLA – Cognitive Services
Azure Data Lake Analytics30
Imaging
– Detect faces
– Detect emotion
– Detect objects (tagging)
– OCR (optical character recognition)
Text
– Key Phrase Extraction
– Sentiment Analysis
Demo
Azure Data Lake Analytics31
images.csv
images_with_food.csv
tags_aggregated.csv