stéphane fréchette - samedi sql - introduction to hdinsight

29
Introduction to HDInsight Stéphane Fréchette Saturday February 7, 2015

Upload: msdevmtl

Post on 30-Jul-2015

125 views

Category:

Technology


5 download

TRANSCRIPT

Introduction to HDInsight

Stéphane FréchetteSaturday February 7, 2015

Who am I?

My name is Stéphane Fréchette

SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data |NoSQL | Data Science. Drums, good food and fine wine. Founder @TEDxGatineau

I have a passion for architecting, designing and building solutions that matter.

Twitter: @sfrechetteBlog: stephanefrechette.comEmail: [email protected]

Topics

• What is Big Data?• Apache Hadoop• Hadoop Ecosystem• Microsoft Azure HDInsight• Demos• Summary• Resources• Q&A

“Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time…” - Wikipedia

What is Big Data?

Many Options

Variability

Internet of thingsAudio / Video

Log Files

Text/Image

Social Sentiment

Data Market FeedseGov Feeds

Weather

Wikis / Blogs

Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising Collaboration

eCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

PayablesPayroll

Inventory

Contacts

Deal Tracking

Terabytes(10E12)

Gigabytes(10E9)

Exabytes(10E18)

Petabytes(10E15)

Velocity - Variety

Volu

me

1980190,000$

20100.07$

19909,000$

200015$

Storage/GB

ERP / CRM WEB 2.0

Internet of things

What is Big Data?

Common Scenarios

Clickstream Analysis Sensor/Machine

Time and Place Server Logs

Sentiment

Text

What is Big Data?

Hadoop

• Apache Hadoop is for big data• Open-source software framework that allows for the distributed processing

of large data sets across clusters of computers using simple programming models• Designed to scale up from single servers to thousands of machines, each

offering local computation and storage

TRADITIONAL RDBMS HADOOP

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

DBA Ratio 1:40 1:3000

Reference: Tom White’s Hadoop: The Definitive Guide

Hadoop

HDFS

• Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

HDFS ≠ Database

MapReduce

• MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Processing function:- Mapping- Reducing

First, store the data

Server

Files

Server Server

Server

How it works?

Second, take the processing to the data…

// Map Reduce function in JavaScript

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());

}context.write(key, sum);};

ServerServer

ServerServer

Runtime

Code

How it works?

Distributed Storage(HDFS)

Query(Hive)

Distributed Processing(MapReduce)

Scripting(Pig)

NoSQ

L Database(HBase)

Metadata(HCatalog)

Data Integration( O

DBC / SQO

OP/ REST)

Relational(SQ

L Server)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processing(RHadoop)

Event Pipeline(Flum

e)

Active Directory (Security)

Monitoring & Deployment

(System Center)

C#, F#, .NETPowerShell

Pipeline / workflow

(Oozie)

Azure Storage Vault (ASV)

APS | Polybase

Business Intelligence

(Excel, Power

View, SSAS)World's Data (Azure Data

Marketplace)

Event Driven Processing

LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages

Hadoop Ecosystem

HDInsight

• HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop solution that runs on the Microsoft Azure platform• Based on the Hortonworks Data Platform (HDP)• Scalable, on-demand service

Storage

Azure Storage (Blob)File System

Two choices

Demo[Spinning up a HDInsight Cluster ;-)]

Now what?

Working with your HDInsight cluster - running jobs, import/export data, viewing and consuming data…

• .NET• Java• Pig• Hive• Sqoop• Excel• Others

What is Hive?

• A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis• Provides an SQL-Like language called HiveQL to query data• Integration between Hadoop and BI and visualization tools

http://hive.apache.org

What is Pig?

• Write complex MapReduce jobs using a simple script language (Pig Latin)• A platform for analyzing large data sets that consists of high-level language

for expressing data analysis programs• Pig translates and compiles complex MapReduce jobs on the fly

http://pig.apache.org

What is Sqoop?

• Command-line interface application to transfer bulk data between Hadoop and relational datastores

http://sqoop.apache.org

Demo[Query, Analyze, Transfer + Visual Studio Tools for HDInsight]

HadoopData Analytics

Data Flow

Demo[Self-Service BI with Hive and Excel…]

Machine Learning

Graph Processing

Distributed Compute

Extract Load Transform

Predictive Analysis

Capabilities

Data Knowledge Action

Summary

Resources

• Apache Projects (list with links) http://bit.ly/MfpLtE• Microsoft Azure HDInsight http://bit.ly/1dnlAX1• HDInsight Documentation & Tutorials http://bit.ly/LWRYol• Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte• Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH• Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O• Microsoft Hive ODBC Driver http://bit.ly/NFkhcH• Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd• Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1• Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F

What Questions Do You Have?

Thank YouFor attending this session