hdinsight on azure and map-reduce richard conway windows azure mvp elastacloud limited

38
HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Upload: baldwin-mosley

Post on 27-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

HDInsight on Azure and Map-Reduce

Richard ConwayWindows Azure MVPElastacloud Limited

Page 2: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Agenda

Introduction

Big Data with HDInsight

Page 3: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Introduction

Page 4: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Solving problems through distributionSome challenges become bound by hardware capacity; 24 hours on 1 machine can be 1 hours on 24 machines.

These 24 machines require orchestration; jobs are to be divided into tasks and tasks are distributed across a cluster.

There are systems of software required to facilitate the distribution; examples are Hadoop and HPC Server.

We will now provision a Hadoop cluster on Windows Azure.

Page 5: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Big Data vs Big Compute

Page 6: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Compute Bound IO Bound

HPC ServerOpen MPI

Hadoop

Page 7: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

All distributed compute works on the basis of taking a large JOB and breaking it to many smaller TASKS which are then parallelised

Page 8: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Hadoop

Name Node Name Node

Data Nodes

HPC

Head Node Broker Node

Worker Nodes

Page 9: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Understanding Big Data

Page 10: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Cheap Storage

$100 gets you 3million times

more storage in 30 years)

Inexpensive Computing

1980 10 MIPS/$ 2005 10M MIPS/$

Device Explosion

>5.5 billion (70+% of global population)

KEY TRENDS

Social Networks

>2 Billionusers

Ubiquitous Connection

Web traffic2010 130 Exabyte (10 E18)

2015 1.6 ZettaByte (10 E21)

Sensor Networks

>10 Billion

Page 11: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Internet of things Audio /

VideoLog Files

Text/Image

Social Sentiment

Data Market FeedseGov Feeds

Weather

Wikis / Blogs

Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising

Collaboration

eCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

PayablesPayroll

Inventory

Contacts

Deal Tracking

Terabytes(10E12)

Gigabytes(10E9)

Exabytes(10E18)

Petabytes(10E15)

Velocity - Variety - variability

Volu

me

1980190,000$

20100.07$

19909,000$

200015$Storage/GB

ERP / CRM WEB 2.0

Internet of things

What is Big Data?

Page 12: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Big Data, BIG OPPORTUNITY

Big Data is a top priority for institutions

49% CEOs and CIOs are planning big data projects

Software Growth

2012

2013

2014

2015

0

41.8 2.5

3.44.6

Bil

lio

ns

$

34% compound annual growth rate2

Services Growth

2012

2013

2014

2015

048

2.7 3.9 5.16.5

Bil

lio

ns

$

39% compound annual growth rate2

1. McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, 20122. IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast , 2012

Page 13: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Devices: Internet and Internet of things

Internet of

things Invisible devicesTrillions of networked

nodes

Low bandwidth last-mile

connection

100kBit/sec

Mostly addressed by local schemes

Machine-centric Sensing-focus

Trillions of computer-enabled

devices which are part of the

IoT

Global addressing

User-centricCommunication-

focus

Internet

Laptops / tablets / smartphones

Billions of networked devices

High-bandwidth access

Cable: 10Mbs+Fiber: 50-100Mbs

6+billion people

1.5 billion use net

US: 4.3 devices per adult

Page 14: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Big Data Scenarios

Page 15: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Short History of Hadoop

Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scaleHadoop started as a part of the Nutch project.In Jan 2006 Doug Cutting started working on Hadoop at YahooFactored out of Nutch in Feb 2006First release of Apache Hadoopin September 2007Jan 2008 Hadoop became a top level Apache project

Page 16: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Hadoop Distributed Architecture

Page 17: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

FIRST, STORE THE DATA

Server

ServerServer

MapReduce: Move Code to the Data

Files

Server

Page 18: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

SECOND, TAKE THE PROCESSING TO THE DATA

So How Does It Work?

// Map Reduce function in JavaScript

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "")context.write(words[i].toLowerCase(),1);}}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());

}context.write(key, sum);};

ServerServer

ServerServer

RUNTIME

Code

Page 19: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Traditional RDBMS vs. NoSQL

TRADITIONAL RDBMS HADOOP

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

DBA Ratio 1:40 1:3000

Reference: Tom White’s Hadoop: The Definitive Guide

Page 20: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Windows Azure HDInsight Service

Page 21: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Creating an HDInsightCluster Demo

Page 22: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

MICROSOFT CONFIDENTIAL – INTERNAL ONLY

Distributed Storage(HDFS)

Query(Hive)

Distributed Processing

(MapReduce)

Scripting(Pig)

NoSQ

L Data

base

(HB

ase

)

Metadata(HCatalog)

Data

Inte

gra

tion

( OD

BC

/ SQ

OO

P/ REST)

Rela

tiona

l(S

QL

Serve

r)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processin

g(RHadoo

p)

Eve

nt Pip

elin

e(Flu

me)

Active Directory (Security)

Monitoring & Deployment

(System Center)

C#, F#, .NET

JavaScript

Pipelin

e / w

orkflo

w(O

ozie

)

Azure Storage Vault (ASV)

PD

W Po

lybase

Busin

ess

Inte

lligence

(E

xcel, Po

wer

Vie

w, S

SA

S)

HDINSIGHT / HADOOP Eco-System

World's Data (Azure Data Marketplace)

Eve

nt

Drive

n

Proce

ssing

LegendRed = Core HadoopBlue = Data processingPurple = Microsoft integration points and value addsOrange = Data MovementGreen = Packages

Page 23: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Storing Data with HDInsight

Page 24: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

MICROSOFT CONFIDENTIAL – INTERNAL ONLY

Front end

Front end

Stream Layer

Partition Layer

HDFS on Azure: Tale of two File Systems

Name Node

de

Data Node Data Node

Front end

HDFS API

DFS (1 Data Node per Worker Role)and Compute Cluster

Azure Storage (ASV)

Azure Blob Storage

Page 25: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

MICROSOFT CONFIDENTIAL – INTERNAL ONLY

Azure Storage (ASV)• Default file system for HDInsight Service• Provides sharable, persistent, highly-scalable Storage with high

availability (Azure Blob Store)• Azure storage itself does not provide compute• Fast access from compute nodes to data in same data center• Several file systems, addressable via:asv[s]:<container>@<account>.blob.core.windows.net/<path>

• Requires storage key in core-site.xml:<property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value></property>

Page 26: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Map Reduce

Examples in C#

Page 27: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Map/Reduce

Map/Reduce is a programming model for efficient distributed computingInput > Map > Shuffle & Sort > Reduce > Output

Efficiency from Streaming through data, reducing seeksA good fit for a lot of applicationsLog processingWeb index buildingData mining and machine learning

Page 28: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited
Page 29: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited
Page 30: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Hadoop SDK

C# integrationRemote Data & JobsHive in C#Serialization

Page 31: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

http://hadoopsdk.codeplex.com

Page 32: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

public class FrenchSessionsJob : HadoopJob<FrenchSessionsMapper, SessionsReducer>

{

public override HadoopJobConfiguration Configure(ExecutorContext context)

{

var config = new HadoopJobConfiguration()

{

InputPath = "\"/AllSessions/*.gz\"",

OutputFolder = "/FrenchSessions/"

};

return config;

}

}

Jobs

Page 33: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

public class FrenchSessionsMapper : MapperBase

{

public override void Map(string inputLine, MapperContext context)

{

if (inputLine.Contains("Country=France")

{

context.IncrementCounter("FrenchSession");

context.EmitKeyValue("FR", "1");

}

}

}

Mapper

Page 34: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

public class SessionsReducer : ReducerCombinerBase

{

public override void Reduce(string key, IEnumerable<string> values, ReducerContext context)

{

context.EmitKeyValue(key, values.Count());

}

}

Reducer

Page 35: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Navigating the HDInsight portal Demo

Page 36: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

C# and Map/ReduceDemo

Page 37: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

https://elastastorage.blob.core.windows.net/hdinsight/Map-Reduce HDInsight Lab.pdf

Page 38: HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited

Questions?