software systems development map-reduce, hadoop, hbase

32
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE , Hadoop, HBase

Upload: randolph-walsh

Post on 27-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

SOFTWARE SYSTEMS DEVELOPMENT

MAP-REDUCE , Hadoop, HBase

Page 2: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

The problem

Batch (offline) processing of huge data set using commodity hardware

Linear scalability

Need infrastructure to handle all the mechanics, allow for developer to focus on the processing logic/algorithms

Page 3: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Data Sets

The New York Stock Exchange: 1 Terabyte of data per day

Facebook: 100 billion of photos, 1 Petabyte(1000 Terabytes)

Internet Archive: 2 Petabyte of data, growing by 20 Terabytes per month

Can’t put data on a single node, need distributed file system to hold it

Page 4: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Batch processing

Single write/append multiple reads Analyze Log files for most frequent URL

Each data entry is self-contained At each step , each data entry can be

treated individually After the aggregation, each aggregated

data set can be treated individually

Page 5: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Grid Computing

Grid computing Cluster of processing nodes attached to

shared storage through fiber (typically Storage Area Network)

Work well for computation intensive tasks, problem with huge data sets as network become a bottleneck

Programming paradigm: Low level Message Passing Interface (MPI)

Page 6: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Hadoop

Open-source implementation of 2 key ideas HDFS: Hadoop distributed file system Map-Reduce: Programming Model

Build based on Google infrastructure (GFS, Map-Reduce papers published 2003/2004)

Java/Python/C interfaces, several projects built on top of it

Page 7: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Approach

Limited but simple model fit to broad range of applications

Handle communications, redundancies , scheduling in the infrastructure

Move computation to data instead of moving data to computation

Page 8: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Who is using Hadoop?

Page 9: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Distributed File System (HDFS) Files are split into large blocks (128M,

64M) Compare with typical FS block of 512Bytes

Replicated among Data Nodes(DN) 3 copies by default

Name Node (NN) keeps track of files and pieces Single Master node

Stream-based I/O Sequential access

Page 10: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

HDFS: File Read

Page 11: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

HDFS: File Write

Page 12: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

HDFS: Data Node Distance

Page 13: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Map Reduce

A Programming Model

Decompose a processing job into Map and Reduce stages

Developer need to provide code for Map and Reduce functions, configure the job and let Hadoop handle the rest

Page 14: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Map-Reduce Model

Page 15: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

MAP function

Map each data entry into a pair <key, value>

Examples Map each log file entry into <URL,1> Map day stock trading record into <STOCK,

Price>

Page 16: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Hadoop: Shuffle/Merge phase Hadoop merges(shuffles) output of the

MAP stage into <key, valulue1, value2, value3>

Examples <URL, 1 ,1 ,1 ,1 ,1 1> <STOCK, Price On day 1, Price On day 2..>

Page 17: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Reduce function

Reduce entries produces by Hadoop merging processing into <key, value> pair

Examples Map <URL, 1,1,1> into <URL, 3> Map <Stock, 3,2,10> into <Stock, 10>

Page 18: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Map-Reduce Flow

Page 19: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Hadoop Infrastructure

Replicate/Distribute data among the nodes Input Output Map/Shuffle output

Schedule Processing Partition Data Assign processing nodes (PN) Move code to PN(e.g. send Map/Reduce code) Manage failures (block CRC, rerun MAP/Reduce

if necessary)

Page 20: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Example: Trading Data Processing Input:

Historical Stock Data Records are CSV (comma separated values)

text file Each line : stock_symbol, low_price, high_price 1987-2009 data for all stocks one record per

stock per day

Output: Maximum interday delta for each stock

Page 21: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Map Function: Part I

Page 22: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Map Function: Part II

Page 23: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Reduce Function

Page 24: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Running the Job : Part I

Page 25: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Running the Job: Part II

Page 26: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Inside Hadoop

Page 27: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Datastore: HBASE

Distributed Column-Oriented database on top of HDFS

Modeled after Google’s BigTable data store

Random Reads/Writes on to of sequential stream-oriented HDFS

Billions of Rows * Millions of Columns * Thousands of Versions

Page 28: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

HBASE: Logical View

Row Key Time Stamp

Column Contents

Column Family Anchor (Referred by/to)

Column “mime”

“com.cnn.www”

T9 cnnsi.com cnn.com/1

T8 my.look.ca

cnn.com/2

T6 “<html>.. “

Text/html

T5 “<html>.. “

t3 “<html>.. “

Page 29: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

Physical View

Row Key Time Stamp Column: Contents

Com.cnn.www T6 “<html>..”

T5 “<html>..”

T3 “<html>..”

Row Key Time Stamp Column Family: Anchor

Com.cnn.www T9 cnnsi.com cnn.com/1

T5 my.look.ca cnn.com/2

Row Key Time Stamp Column: mime

Com.cnn.www T6 text/html

Page 30: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

HBASE: Region Servers

Tables are split into horizontal regions Each region comprises a subset of rows

HDFS Namenode, dataNode

MapReduce JobTracker, TaskTracker

HBASE Master Server, Region Server

Page 31: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

HBASE Architecture

Page 32: SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase

HBASE vs RDMS

HBase tables are similar to RDBS tables with a difference Rows are sorted with a Row Key Only cells are versioned Columns can be added on the fly by client

as long as the column family they belong to preexists