principles of data management lecture #16 (mapreduce & dfs ... · the google odsi 2004 talk...

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Principles of Data Management

Lecture #16 (MapReduce & DFS for Big Data)

Instructor: Mike Carey [email protected]


Today’s News Bulletin

v  Project dates §  Query execution layer is due on 3/17

v  Upcoming lectures §  Today: MapReduce (and distributed file systems) §  Next week: Wrap-up & review, in-class endterm

v  Other upcoming events §  The long-lost midterms will appear on Tuesday!

v  Class participation opportunities §  Teaching evaluations (after Tuesday J) §  End-of-term opinion survey (watch for it!)


Motivation

v  Google needed to process web-scale data §  Data much larger than what fits on one machine §  Needed parallel processing to get results in a

reasonable time §  Wanted to use cheap commodity machines to do

the job v  Credits: Some of the following slide content is excerpted from

the Google ODSI 2004 talk where MapReduce was publically born and/or Google’s SOSP 2003 talk on the underlying DFS.


Requirements

v  Solution must §  Scale to 1000s of compute nodes §  Must automatically handle faults §  Provide monitoring of jobs §  Be easy for programmers to use


MapReduce Programming model

v  Input and Output are sets of key/value pairs v  Programmer provides two functions

§  map(K1, V1) -> list(K2, V2) • Produces list of intermediate key/value pairs for each

input key/value pair

§  reduce(K2, list(V2)) -> list(K3, V3) • Produces a list of result values for all intermediate values

that are associated with the same intermediate key


MapReduce Pipeline

Map Shuffle Reduce

Read from DFS Write to DFS


MapReduce in Action

Map (k1, v1) à list(k2, v2) •  Processes one input key/value pair •  Produces a set of intermediate key/value pairs

Reduce (k2, list(v2)) list(k3, v3) •  Combines intermediate values for one particular key •  Produces a set of merged output values (usually one)


MapReduce Architecture

MapReduce MapReduce MapReduce MapReduce

Distributed File System

Network

MapReduce Job Tracker


Software Components

v  Job Tracker (Master) §  Maintains Cluster membership of workers §  Accepts MR jobs from clients and dispatches tasks

to workers §  Monitors workers’ progress §  Restarts tasks in the event of failure

v  Task Tracker (Worker) §  Provides an environment to run a task §  Maintains and serves intermediate files between

Map and Reduce phases


MapReduce Parallelism

Hash Partitioning


Example 1: Count Word Occurrences

v  Input: Set of (Document name, Document Contents)

v  Output: Set of (Word, Count(Word)) v  map(k1, v1):

for each word w in v1 emit(w, 1)

v  reduce(k2, v2_list): int result = 0; for each v in v2_list

result += v; emit(k2, result)


Map

Example 1: Count Word Occurrences

Map

Reduce

Reduce

this is a line

this is another line

another line

yet another line

this, 1 is, 1 a, 1 line, 1 this, 1 is, 1 another, 1 line, 1

another, 1 line, 1

yet, 1 another, 1 line, 1

a, 1 another, 1 is, 1 is, 1

line, 1 line, 1 this, 1 this, 1

another, 1 another, 1

line, 1 line, 1 yet, 1

a, 1 another, 3 is, 2

line, 4 this, 2 yet, 1


(Picture borrowed from Shiv Babu @ Duke University)

MapReduce Pipeline Revisited


Example 2: Equijoins

v  Input: Rows of Relation R, Rows of Relation S v  Output: R join S on R.x = S.y v  map(k1, v1)

if (input == R) emit(v1.x, [“R”, v1])

else emit(v1.y, [“S”, v2])

v  reduce(k2, v2_list) for r in v2_list where r[1] == “R”

for s in v2_list where s[1] == “S” emit(1, result(r[2], s[2]))


Other Examples

v  Distributed grep v  Inverted index construction v  Machine learning v  Distributed sort v  Fuzzy join v  … v  Or: A Pig script or a Hive query (which are

then auto-converted to a Hadoop MapReduce job series under the covers) – e.g., at Netflix


Fault Tolerant Evaluation

v  Task Fault Tolerance is achieved through re-execution ( roll forward, not back!)

v  All consumers consume data only after completely generated by the producer §  This is an important property to isolate faults to

one task

v  Task completion committed through Master v  Cannot handle master failure


Task granularity and pipelining

v  Fine granularity tasks §  Many more map tasks than machines

• Minimizes time for fault recovery • Pipelines shuffling with map execution • Better load balancing


Optimization: Combiners

v  Sometimes partial aggregation is possible on the Map side

v  May cut down the amount of data needing to be transferred to the reducer (significantly in some cases, like grouped aggregation in Hive)

v  combine(K2, list(V2)) -> K2, list(V2) v  For Word Occurrence Count example,

Combine == Reduce (Q: Why?)


Map

Example 1: Word Count Revisited (With Combiners)

Map

Reduce

Reduce

this is a line

this is another line

another line

yet another line

this, 1 is, 1 a, 1 line, 1 this, 1 is, 1 another, 1 line, 1

another, 1 line, 1

yet, 1 another, 1 line, 1


line, 2 this, 2

another, 2

line, 2 yet, 1


line, 4 this, 2 yet, 1

ß Inte

rmedi

ate

Result

s


Optimization: Redundant Execution

v  Slow workers lengthen completion time v  Slowness happens because of

§  Other jobs consuming resources §  Bad disks/network etc

v  Solution: Near the end of the job spawn extra copies of long running tasks §  Whichever copy finishes first, wins. §  Kill the rest

v  In Hadoop this is called “speculative execution”


Optimization: Locality

v  Task scheduling policy §  Ask DFS (next topic!) for locations of replicas of

input file blocks §  Map tasks scheduled so that input blocks are

machine local or rack local

v  Effect: Tasks read data at local disk speeds v  Without this, rack switches limit data rate


Distributed (Big!) Filesystem

v  Used as the “store” for MapReduce data v  MapReduce reads its input from DFS and

writes its output to DFS v  Provides a “shared disk” view to applications

using local storage on shared-nothing hardware v  Provides redundancy by replication to protect

from node/disk failures


DFS Architecture

Taken from Ghemawat’s SOSP’03 paper (The Google Filesystem)

•  Single Master (with backups) that track DFS file name to chunk mapping •  Several Chunk servers that store chunks on local disks •  Chunk Size ~ 64MB or larger •  Chunks are replicated •  Master only used for chunk lookups – Does not participate in transfer of data


Chunk Replication

v  Several Replicas of each Chunk §  Replicas usually spread across racks and data centers

to maximize availability §  3 replicas common (local, same rack, different rack)

v  Master tracks location of each replica of a chunk v  When chunk failure is detected, master

automatically rebuilds new replica to maintain replication level

v  Automatically picks chunk servers for new replicas based on utilization


MapReduce & DFS: Summary

v  Google laid a foundation for a new flurry of large-scale data storage and processing with their MR and DFS work in the early 2000’s

v  Apache open source versions soon sprung up outside of Google: Hadoop MapReduce & HDFS

v  Today, Big Data use cases are addressed with a mix of parallel RDBMS technologies as well as more “flexible” Hadoop-based technologies

v  So where are we now…?


(Pig)

Today’s Tangled World


(Pig)

Today’s Tangled World


Additional Reading v  Original MapReduce Paper (*** MUST READ! ***)

§  “Simplified Data Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat in OSDI ’04

v  Original DFS Paper §  “The Google Filesystem” by Sanjay Ghemawat, Howard Gobioff, and

Shun-Tak Leung in SOSP ’03

v  MapReduce vs. Parallel DBMS Papers in CACM (Jan. 2010) §  “MapReduce and Parallel DBMSs: Friends or Foes?” by Michael

Stonebraker, Daniel Abadi, David DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin

§  “MapReduce: A Flexible Data Processing Tool” by Jeffrey Dean and Sanjay Ghemawat

v  EDBT “Ogres & Onions Keynote” Paper §  “Inside "Big Data Management": Ogres, Onions, or Parfaits?” by Vinayak

Borkar, Michael J. Carey, and Chen Li in EDBT '12 (or watch the movie )

principles of data management lecture #16 (mapreduce & dfs ... · the google odsi 2004 talk...

Documents