usq landdemos azure data lake

BASLE BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA

HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Azure Data Lake EventBig data solutions on Microsoft Azure using Azure Data Lake

Regensdorf, 02.03.2018

Agenda

Azure Data Lake Analytics2

1. Begrüssung

(Willfried Färber – Trivadis)

2. Historie, Aktuelles zu Azure Data Lake

(Michael Rys - Microsoft)

3. Code Session, ein oder zwei Beispiele

(Marco Amhof - Trivadis)

4. Azure Data Lake Services fast, flexible and at your fingertips

(Patrik Borosch – Microsoft)

5. Zusammenfassung, Ausblick, Q&A

(Michael Rys - Microsoft)

6. Lunch

KOPENHAGEN

MÜNCHEN

LAUSANNE

BERN

ZÜRICH

BRUGG

GENF

HAMBURG

DÜSSELDORF

FRANKFURT

STUTTGART

FREIBURG

BASEL

WIEN

Mit über 600 IT- und Fachexperten bei Ihnen vor Ort.

© Trivadis – Das Unternehmen3

14 Trivadis Niederlassungen mit

über 600 Mitarbeitenden.

Über 200 Service Level Agreements.

Mehr als 4'000 Trainingsteilnehmer.

Forschungs- und Entwicklungsbudget:

CHF 5.0 Mio.

Finanziell unabhängig und

nachhaltig profitabel.

Erfahrung aus mehr als 1'900 Projekten

pro Jahr bei über 800 Kunden.

Big Data

Data that is too large or complex for analysis in traditional relational databases

Typified by the “3 V’s”:

Volume – Huge amounts of data to process

Variety – A mixture of structured and unstructured data

Velocity – New data generated extremely frequently

Web server click-streams Sensor and IoT ProcessingSocial media sentiment analysis

Big Data Processing

Filter, cleanse, and shape data for analysis

Apply statistical

algorithms for

classification,

regression, clustering,

and prediction

Capture, filter, and

aggregate streams of

data for low-latency

querying

Batch Processing Predictive AnalyticsReal-Time Processing

..110100101001..

Action

People

Automated Systems

Apps

Web

Mobile

Bots

Intelligence

Dashboards &

Visualizations

Cortana

Bot

Framework

Cognitive

Services

Power BI

Information

Management

Event Hubs

Data Catalog

Data Factory

Machine Learning

and Analytics

HDInsight

(Hadoop and

Spark)

Stream

Analytics

Intelligence

Data Lake

Analytics

Machine

Learning

Big Data Stores

SQL Data

Warehouse

Data Lake Store

Data Sources

Apps

Sensors and devices

Data

Azure Data Lake (Store, HDInsight, Analytics)


U-SQL – A Language makes Big Data Processing Easy


Requirements and characteristics of Big Data analytics

▪ Process any type and any size of data

▪ BotNet attack patterns

▪ Security logs

▪ Extract features from images and videos (machine learning)

▪ The language enables you to work on any data

▪ Use custom code / algorithms to easily express your complex and often proprietary business algorithms

▪ User Defined Functions

▪ Custom Input- and Output Formats

▪ Scale efficiently to any size of data without you focusing on scale-out topologies, plumbing code, or

limitations of a specific distributed infrastructure

U-SQL Origins


▪ SCOPE – Microsoft’s internal Big Data language

▪ COSMOS – Microsoft’s internal Big Data analysis platform

▪ SQL and C# integration model

▪ Optimization and Scaling model

▪ Runs 100’000s of jobs daily

▪ Hive

▪ Complex data types (Maps, Arrays)

▪ Data format alignment for text files

▪ T-SQL/ANSI SQL

▪ Many of the SQL capabilities (windowing functions, meta data model etc.)

http://www.vldb.org/pvldb/1/1454166.pdf

https://www.quora.com/What-is-Microsofts-Cosmos

U-SQL Features


▪ Operating over set of files with patterns

▪ Using (Partitioned) Tables

▪ Federated Queries against Azure SQL DB

▪ Encapsulating your U-SQL code with Views, Table-Valued Functions, and

Procedures

▪ SQL Windowing Functions

▪ Programming with C# User-defined Operators (custom extractors, processors)

▪ Complex Types (MAP, ARRAY)

▪ Using U-SQL in data processing pipelines

▪ U-SQL in a lambda architecture for IOT analytics

Query the data where it lives


▪ Avoid moving large amount of data across the network between stores

▪ Single view of data irrespective of physical location

▪ Minimize data proliferation issues caused by maintaining multiple copies

▪ Single query language for all data

▪ Each data store maintains its own sovereignty

▪ Design choices based on the need

▪ Push SQL expressions to remote SQL sources

▪ Projections

▪ Filters

▪ Joins

U-SQL = SQL + C#


▪ unifies the ease of use of SQL with the

expressive power of C#

▪ Get benefits of both…

▪ Unstructured and structured data processing

▪ Declarative SQL and custom imperative Code

▪ Local and remote Queries

U-SQL Language Overview


U-SQL Fundamentals

All the familiar SQL clauses

▪ SELECT | FROM | WHERE

GROUP BY | JOIN | OVER

▪ Operate on unstructured and

structured data

▪ Relational metadata objects

▪ Federated Queries against Azure

SQL DB and Azure SQL DWH

▪ SQL Windowing Functions

▪ EXCEPT / INTERSECT / UNION

.NET integration and extensibility

U-SQL expressions are full C# expressions

Reuse .NET code in your own assemblies

Use C# to define your own

▪ Types

▪ Functions

▪ Joins

▪ Aggregators

▪ I/O (Extractors, Outputters)

https://msdn.microsoft.com/en-us/library/mt591959.aspx

U-SQL extensibility


▪ Extend U-SQL with C#

(.NET)

▪ Extensions require .NET

assemblies to be

registered with a database

Azure Data Lake and Azure SQL Data Warehouse


U-SQL Distributed Query


Why use U-SQL


SQL makes Big Data processing easy because it:

▪ Unifies declarative nature of SQL with the imperative power of C#

▪ Unifies querying structured, semi-structured and unstructured data

▪ Unifies local and remote queries

▪ Distributed query support over all data

▪ Increases productivity and agility from Day 1 for YOU!

Azure Data Lake in Visual Studio


Prerequisites

▪ Visual Studio 2017 (under data storage and processing workload), Visual Studio 2015 update 3,

Visual Studio 2013 update 4, or Visual Studio 2012

Enterprise (Ultimate/Premium), Professional, Community editions are supported; Express edition

is not supported

▪ Microsoft Azure SDK for .NET version 2.7.1 or above. Install it using the Web platform installer.

▪ Data Lake Tools for Visual Studio

▪ Once Data Lake Tools for Visual Studio is installed, you will see a "Data Lake Analytics" node

in Server Explorer under the "Azure" node (Open Server Explorer by pressing Ctrl+Alt+S).

▪ Data Lake Analytics account and sample data

The Data Lake Tools do not support creating Data Lake Analytics accounts. Create an account

using the Azure portal, Azure PowerShell, .NET SDK or Azure CLI. For your convenience, a

PowerShell script for creating a Data Lake Analytics service and uploading the source data file

can be found in Appx-A PowerShell sample for preparing the tutorial.

Documentation

https://www.microsoft.com/en-us/download/details.aspx?id=49504

https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-data-lake-tools-get-started#appx-a-powershell-sample-for-preparing-the-tutorial

https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-data-lake-tools-get-started

https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-data-lake-tools-get-started

Input – Transform - Output


U-SQL Script


@t = EXTRACT date string

, time string

, author string

, tweet string

FROM "/input/MyTwitterHistory.csv"

USING Extractors.Csv();

@res = SELECT author

, COUNT(*) AS tweetcount

FROM @t

GROUP BY author;

OUTPUT @res TO "/output/MyTwitterAnalysis.csv"

ORDER BY tweetcount DESC

USING Outputters.Csv();

This U-SQL script

▪ extracts the source data file

using Extractors.Tsv()

▪ Transforms (aggregates) the

data using SQL

▪ creates a csv file using

Outputters.Csv()

Read the input, write it directly to output (just a simple copy)


Apply Schema on read

From a file in a Data Lake

Easy delimited text

handling

Write out

Rowset

Extract – Transform - Persist


▪ Retrieve data from stored locations in rowset format

▪ Stored locations can be files that will be schematized on read with EXTRACT

expressions

▪ Stored locations can be U-SQL tables that are stored in a schematized format

▪ Or can be tables provided by other data sources such as an Azure SQL database

▪ Transform the rowset(s)

▪ Several transformations over the rowsets can be composed in a data flow format

▪ Store the transformed rowset data

▪ Store it in a file with an OUTPUT statement, or

▪ Store it in a U-SQL table with an INSERT statement

Run the U-SQL job


Submit Job

▪ (local) to run script locally

▪ Data Lake Analytics account

to run script in the cloud

Solution Explorer

▪ Right-click Script.usql and

click Submit Script

ADLA - Jobs


▪ Create a Cluster of N Nodes

▪ Pay as long as the cluster exist and

is up and running

▪ Delete the cluster when done

▪ Submit a Job (a U-SQL Script) and

reserve N Nodes of parallelism per job run

▪ Pay as long as the job is running

(1 AU = CHF 1.807 / Hour)

▪ Nodes go away when the job finishes

HDInsight Analytics

“Cluster Service” “Job Service”

Benefits of ADLA Job Service


▪ Pay for what you use

▪ Easier

▪ No need to fetch logs on the cluster

▪ No tuning needed of cluster

▪ Job History / Job Replay

▪ Performance analysis

▪ Vertex Debugging

▪ Built-in Job Monitoring

▪ Built-in Auditing

U-SQL Job Workflow


Job Front End

Job Scheduler Compiler Service

Job Queue

Job Manager

U-SQL Catalog

YARN

Job submission

Job execution

U-SQL Runtime Vertex execution

Input – Transform - Output


Demo oeVCH


EinwohnerCH.csv (Quelle BFS)

HaltestellenCH.csv (Quelle BAV)

Azure

SQL DW

U-SQL = SQL + C#


▪ unifies the ease of use of SQL with the

expressive power of C#

▪ Get benefits of both…

▪ Unstructured and structured data processing

▪ Declarative SQL and custom imperative Code

▪ Local and remote Queries

ADLA – Cognitive Services


Imaging

– Detect faces

– Detect emotion

– Detect objects (tagging)

– OCR (optical character recognition)

Text

– Key Phrase Extraction

– Sentiment Analysis

Demo


images.csv

images_with_food.csv

tags_aggregated.csv

https://blogs.msdn.microsoft.com/azuredatalake/2016/08/18/introducing-image-processing-in-u-sql/

PDF Keyword Extractor


wordCount.txt

Polybase – Azure SQL DW


Import PowerBI Model to Azure SSAS


usq landdemos azure data lake

Technology