Data Management Using MapReduce
M. Tamer Ozsu
University of Waterloo
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 1 / 24
Basics
I For data analysis of very large data setsI Highly dynamic, irregular, schemaless, etc.I SQL too heavy
I “Embarrassingly parallel problems”I New, simple parallel programming model
I Data structured as (key, value) pairsI E.g. (doc-id, content), (word, count), etc.
I Functional programming style with two functions to be given:I Map(k1,v1) → list(k2,v2)I Reduce(k2, list (v2)) → list(v3)
I Implemented on a distributed file system (e.g., Google File System)on very large clusters
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 2 / 24
Map Function
I User-defined functionI Processes input key/value pairsI Produces a set of intermediate key/value pairs
I Map function I/OI Input: read a chunk from distributed file system (DFS)I Output: Write to intermediate file on local disk
I MapReduce libraryI Executes map functionI Groups together all intermediate values with the same key (i.e.,
generates a set of lists)I Passes these lists to reduce functions
I Effect of map functionI Processes and partitions input dataI Builds a distributed map (trnasparent to user)I Similar to “group by” operaton in SQL
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 3 / 24
Reduce Function
I User-defined functionI Accepts one intermediate key and a set of values for that key (i.e., a
list)I Merges these values together to form a (possibly) smaller setI Typically, zero or one output value is generated per invocation
I Reduce function I/OI Input: read from intermediate files using remote reads on local files of
corresponding mapper nodesI Output: Each reducees writes its output as a file back to DFS
I Effect of map functionI Similar to aggregation operaton in SQL
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 4 / 24
MapReduce Processing
...Inp
ut
dat
ase
t
Map
Map
Map
Map
(k1, v)
(k2, v)(k2, v)
(k2, v)
(k1, v)
(k1, v)
(k2, v)
Group by k
Group by k
(k1, (v, v, v))
(k1, (v, v, v, v)) Reduce
Reduce
Ou
tpu
td
ata
set
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 5 / 24
Example 1
Assume you are reading the monthly average temperatures for each of the12 months of a year for a bunch of cities, i.e., each input is the name ofthe city (key) and the average monthly temperature. Compute the averageannual temperature for each city.
I Map:Input: 〈City, Month, MonthAvgTemp〉
1. Create key/value pairs
Output: 〈City, MonthAvgTemp〉I Reduce:
Input: 〈City, MonthAvgTemp〉1. Sort to get 〈City, list(MonthAvgTemp)〉 (i.e., it combines in a list
the monthly average temperatures for a given city)2. Compute average over list(MonthAvgTemp)
Output: 〈City, AnnualAvgTemp〉
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 6 / 24
Example 2
I Consider EMP (ENAME, TITLE, CITY)
I Query:
SELECT CITY , COUNT(∗ )FROM EMPWHERE ENAME LIKE ”\%Smith ”GROUP BY CITY
I Map:
I n p u t : ( TID , emp ) , Output : ( CITY , 1 )i f emp .ENAME l i k e ”\%Smith ” return ( CITY , 1 )
I Reduce:
I n p u t : ( CITY , l i s t ( 1 ) ) , Output : ( CITY ,SUM( l i s t ( 1 ) ) )return ( CITY ,SUM( 1∗ ) )
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 7 / 24
MapReduce Architecture
Scheduler
Master
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Group Module
Reduce Module
Output Module
Reduce Process
Worker
Group Module
Reduce Module
Output Module
Reduce Process
Worker
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 8 / 24
Execution Flow with ArchitectureMapReduce: Simplified Data Processing on Large Clusters
7. When all map tasks and reduce tasks have been completed, the mas-ter wakes up the user program. At this point, the MapReduce callin the user program returns back to the user code.
After successful completion, the output of the mapreduce executionis available in the R output files (one per reduce task, with file namesspecified by the user). Typically, users do not need to combine these Routput files into one file; they often pass these files as input to anotherMapReduce call or use them from another distributed application thatis able to deal with input that is partitioned into multiple files.
3.2 Master Data StructuresThe master keeps several data structures. For each map task andreduce task, it stores the state (idle, in-progress, or completed) and theidentity of the worker machine (for nonidle tasks).
The master is the conduit through which the location of interme-diate file regions is propagated from map tasks to reduce tasks. There -fore, for each completed map task, the master stores the locations andsizes of the R intermediate file regions produced by the map task.Updates to this location and size information are received as map tasksare completed. The information is pushed incrementally to workersthat have in-progress reduce tasks.
3.3 Fault ToleranceSince the MapReduce library is designed to help process very largeamounts of data using hundreds or thousands of machines, the librarymust tolerate machine failures gracefully.
Handling Worker FailuresThe master pings every worker periodically. If no response is receivedfrom a worker in a certain amount of time, the master marks the workeras failed. Any map tasks completed by the worker are reset back to theirinitial idle state and therefore become eligible for scheduling on otherworkers. Similarly, any map task or reduce task in progress on a failedworker is also reset to idle and becomes eligible for rescheduling.
Completed map tasks are reexecuted on a failure because their out-put is stored on the local disk(s) of the failed machine and is thereforeinaccessible. Completed reduce tasks do not need to be reexecutedsince their output is stored in a global file system.
When a map task is executed first by worker A and then later exe-cuted by worker B (because A failed), all workers executing reducetasks are notified of the reexecution. Any reduce task that has notalready read the data from worker A will read the data from worker B.
MapReduce is resilient to large-scale worker failures. For example,during one MapReduce operation, network maintenance on a runningcluster was causing groups of 80 machines at a time to become unreach-able for several minutes. The MapReduce master simply re executed thework done by the unreachable worker machines and continued to makeforward progress, eventually completing the MapReduce operation.
Semantics in the Presence of FailuresWhen the user-supplied map and reduce operators are deterministicfunctions of their input values, our distributed implementation pro-duces the same output as would have been produced by a nonfaultingsequential execution of the entire program.
split 0
split 1
split 2
split 3
split 4
(1) fork
(3) read(4) local write
(1) fork(1) fork
(6) write
worker
worker
worker
Master
UserProgram
outputfile 0
outputfile 1
worker
worker
(2)assignmap
(2)assignreduce
(5) remote
(5) read
Inputfiles
Mapphasr
Intermediate files(on local disks)
Reducephase
Outputfiles
Fig. 1. Execution overview.
COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 109
From: J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. ACM, 2008.
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 9 / 24
CharacteristicsI Flexibility
I User can write any map and reduce function codeI No need to know how to parallelize
I ScalabilityI Elastic scalabilityI Automatic load balancing
I EfficiencyI Simple: parallel scanI No database loading
I Fault toleranceI Worker failure
I Master pings workers periodically; assumes failure if no responseI Tasks (both map and reduce) on failed workers scheduled on a different
worker nodeI Master failure
I Checkpoints of master stateI Recovery after failure → progress can halt
I Replication on distributed data store
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 10 / 24
HadoopI Most popular MapReduce implementation – developed by Yahoo!I Two components
I Processing engineI HDFS: Hadoop Distributed Storage System – others possibleI Can be deployed on the same machine or on different machines
I ProcessesI Job tracker: hosted on the master node and implements the scheduleI Task tracker: hosted on the worker nodes and accepts tasks from job tracker
and executes themI HDFS
I Name node: stores how data are partitioned, monitors the status of datanodes, and data dictionary
I Data node: Stores and manages data chunks assigned to it
Task Tracker Job Tracker Task Tracker
Data Node Name Node Data Node
Worker 1 Name Node Worker n
MapReduce
HDFS
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 11 / 24
Hadoop UDF FunctionsPhase Name Function
Map
InputFormat::getSplit Partition the input data into differentsplits. Each split is processed by a map-per and may consist of several chunks.
RecordReader::next Define how a split is divided into items.Each item is a key/value pair and used asthe input for the map function.
Mapper::map Users can customize the map function toprocess the input data. The input dataare transformed into some intermediatekey/value pairs.
WritableComparable::compareTo The comparison function for the key/-value pairs.
Job::setCombinerClass Specify how the key/value pair are aggre-gated locally.
Shuffle Job::setPartitionerClass Specify how the intermediate key/valuepairs are shuffled to different reducers.
ReduceJob::setGroupingComparatorClass Specify how the key/value pairs are
grouped in the reduce phase.Reducer::reduce Users write their own reduce functions
to perform the corresponding jobs.
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 12 / 24
MapReduce Implementations
Name Language File System Index Master Server MultipleJobSupport
Hadoop Java HDFS No Name Node and JobTracker
Yes
Disco PythonandErlang
DistributedIndex
DiscoServer
No No
Skynet Ruby MySQL orUnix FileSystem
No Any node in the cluster No
FileMap Shelland PerlScripts
Unix FileSystem
No Any node in the cluster No
Twister Java Unix FileSystem
No One master node in brokernetwork
Yes
Cascading Java HDFS No Name Node and JobTracker
Yes
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 13 / 24
MapReduce Languages
Hive Pig
Language Declarative SQL-like Dataflow
Data model Nested Nested
UDF Supported Supported
Data partition Supported Not supported
Interface Command line, web,JDBC/ODBC server
Command line
Query optimization Rule based Rule based
Metastore Supported Not supported
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 14 / 24
MapReduce Implementations of Database Operators
I Select and Project can be easily implemented in the map function
I Aggregation is not difficult (see next slide)
I Join requires more work
MapReduce join implementations
θ-join
Equi-join
Repartitionjoin
Semi-join Map-only join
Broadcast joinPartition
join
Similarity join Multi-way join
MultipleMapReduce
jobs
Replicatedjoin
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 15 / 24
Aggregation
Key Value
1 R1
2 R2
map
Aid Value
1 R1
2 R2
Mapper 1
R
Extracting aggrega-tion attribute (Aid)
Key Value
3 R3
4 R4
map
Aid Value
1 R3
2 R4
Mapper 2
R
Map Phase
Par
titi
onin
gby
Aid
(Rou
nd
Rob
in)
Aid Value
1R1
R3 red
uce Result
1, f(R1, R3)
Reducer 1
Grouping byAid
Applying the aggregationfunction for the tuples withthe same Aid
Aid Value
2R2
R4 red
uce Result
2, f(R2, R4)
Reducer 2
Reduce Phase
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 16 / 24
θ-Join
Baseline implementation of R(A,B) on S(B,C)
1. Partition R and assign each partition to mappers
2. Each mapper takes 〈a, b〉 tuples and converts them to a list ofkey/value pairs of the form (b, 〈a,R〉)
3. Each reducer pulls the pairs with the same key
4. Each reducer joins tuples of R with tuples of S
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 17 / 24
θ-JoinI If θ equals = (i.e., equijoin)
I Repartition joinI Semijoin-based joinI Map-only join
I If θ equals 6=
Key Value
1 R1
2 R2
3 R3
4 R4
map
Bid Tuple
1 ‘R’,1,R1
4 ‘R’,2,R2
3 ‘R’,3,R3
2 ‘R’,2,R4
Mapper 1
R
Randomly assigningbucket ID (Bid)
Key Value
1 S1
2 S2
3 S3
4 S4
map
Bid Tuple
4 ‘S’,1,S1
2 ‘S’,2,S2
3 ‘S’,3,S3
1 ‘S’,2,S4
Mapper 1
S
Map Phase
Par
titi
onin
gby
Bid
(Rou
nd
Rob
in) Bid Tuple
1‘R’,1,R1
‘S’,4,S4
3‘R’,3,R3
‘S’,3,S3
red
uce
Origin Tuple
‘R’ 1,R1
‘R’ 4,R4
Origin Tuple
‘S’ 3,S3
‘S’ 4,S4
Result
R1 S3
R1 S4
R4 S3
Reducer 1
Grouping byBid
Aggregatetuples basedon origins
Local θ-join (θ is‘ 6=′)
Reducer 2
Reduce Phase
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 18 / 24
Repartition Join
Key Value
1 R1
2 R2
3 R3
4 R4
map
Key Value
1 ‘R’,R1
2 ‘R’,R2
3 ‘R’,R3
4 ‘R’,R4
Mapper 1
R
Tagging origins
Key Value
1 S1
2 S2
3 S3
4 S4
map
Key Value
1 ‘S’,S1
2 ‘S’,S2
3 ‘S’,S3
4 ‘S’,S4
Mapper 1
S
Map Phase
Par
titi
onin
gby
key
(Rou
nd
Rob
in)
Key Tuple
1‘R’,R1
‘S’,S1
3‘R’,R3
‘S’,S3
red
uce
Result
R1 S1
R3 S3
Reducer 1
Grouping by keys Local join
Key Tuple
2‘R’,R2
‘S’,S2
4‘R’,R4
‘S’,S4
red
uce
Result
R2 S2
R4 S4
Reducer 2
Reduce Phase
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 19 / 24
Semijoin-based Join
Key Value
1 R1
3 R2
1 R3
4 R4
Map
Red
uce Key
1
3
4
R
Job 1Full MapReduce job
Extracting join keys
Key 1 3 4
Key Value
1 S1
2 S2
map
Key Value
1 S1
Mapper 1
S
Broadcasting keys of R to all thesplits of S and join S with keys of R
Key 1 3 4
Key Value
3 S3
4 S4
map
Key Value
3 S3
4 S4
Mapper 2
S
Job 2Map-only job
Key Value
1 S1
3 S3
4 S4
Key Value
1 R1
2 R2
map
Result
R1 S1
R2 S3
Mapper 1
S
R
Broadcasting the results of theprevious job (S) to all the splits
of R, and locally joining R with S
Mapper 2
Job 3Map-only job
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 20 / 24
Map-only Join
I Broadcast join: If inner relation � outer relation → no shufflingI Map phase similar to third job of semijoin-based joinI Each mapper loads the full inner table to build an in-memory hash;
scan the outer relation
I Trojan join: Relations are already co-partitioned on the join key → alltuples of both relations are co-located on the same node
I Co-partitioning implemented by one jobI Scheduler loads co-partitioned data chunks in the same mapper to
perform a local join
HR DataR HS DataS HR DataR HS DataS
Co-group Co-groupFooter· · ·
Co-partitioned Split
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 21 / 24
Multiway Join
I Multiple MapReduce jobsI R on S on T → (R on S) on TI Each join implemented in one MapReduce jobI Join ordering problem
I Replicated joinI Single MapReduce job
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 22 / 24
Replicated Join
Rid Value
1 C1
2 C2
map
Key Value
1, null ‘C’,C1
2, null ‘C’,C2
R
Mapper 1
Generating keys and tagging origins
Rid Sid Value
1 1 L1
1 2 L2
2 2 L3
map
Key Value
1,1 ‘L’,L1
1,2 ‘L’,L2
2,2 ‘L’,L3
T
Mapper 2
Sid Value
1 O1
2 O2
map
Key Value
null,1 ‘O’,O1
null,2 ‘O’,O2
S
Mapper 3
Map Phase
Par
titi
onin
gby
key
(Rou
nd
Rob
in)
Sh
uffl
ing
tup
les
tom
ult
iple
red
uce
rsif
nec
essa
ry
Key Value
1,null ‘C’, C1
1,1 ‘L’, L1
null,1 ‘O’, O1
red
uce Result
C1,L1,O1
Reducer 1
Joining the three tables locally
Key Value
1,null ‘C’, C1
1,2 ‘L’, L2
null,2 ‘O’, O2
red
uce Result
C1,L2,O2
Reducer 2
Key Value
2,null ‘C’, C2
null,1 ‘O’, O1 red
uce
Result
Reducer 3
Key Value
2,null ‘C’, C2
2,2 ‘L’, L3
null,2 ‘O’, O2
red
uce Result
C2,L2,O2
Reducer 4
Reduce Phase
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 23 / 24
DBMS on MapReduce
HadoopDB Llama Cheetah
Language SQL-like Simple interface SQL
Storage Row store Column store Hybrid store
Data com-pression
No Yes Yes
Data parti-tion
Horizontally parti-tioned
Vertically partitioned Horizontally par-titioned at chunklevel
Indexing Local index in eachdatabase instance
No index Local index for eachdata chunk
Query op-timization
Rule based optimiza-tion plus local op-timization by Post-greSQL
Column-based op-timization, latematerialization andprocessing multiwayjoin in one job
Multi-query optimiza-tion, materializedviews
CS742-Distributed & Parallel DBMS M. Tamer Ozsu 24 / 24