map-reduce-merge: simplified relational data processing on large clusters hung-chih yang(yahoo!),...
TRANSCRIPT
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters
Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
SIGMOD 2007 (Industrial)
Presented by Kisung Kim
2010. 7. 14
Contents Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Case Study Conclusion
Introduction New challenges of data processing
– A vast amount of data collected from the entire WWW
Solutions of search engine companies – Customized parallel data processing systems – Use large clusters of shared-nothing commodity nodes– Ex) Google’s GFS, BigTable, MapReduce
Ask.com’s Neptune Microsoft’s Dryad Yahoo!’s Hadoop
Introduction Properties of data-intensive systems
– Simple Adopt only a selected subset of database principles
– Sufficiently generic and effective– Parallel data processing system deployed on large clusters of
shared-nothing commodity nodes– Refactoring of data processing into two primitives:
Map function Reduce function
Map-Reduce allow users not to worry about the nuisance details of:– Coordinating parallel sub-tasks– Maintaining distributed file storage\
This abstraction can greatly increase user productivity
Introduction Map-Reduce framework is best at handling homoge-
neous datasets– Ex) Joining multiple heterogeneous datasets does not quite fit
into the Map-Reduce framework
Extending Map-Reduce to process heterogeneous datasets simultaneously– Processing data relationships is ubiquitous– Join-enabled Map-Reduce system can provide a highly parallel
yet cost effective alternative– Include relational algebra in the subset of the database principles
Relational operators can be modeled using various com-binations of the three primitives: Map, Reduce, and Merge
Map-Reduce Input dataset is stored in GFS Mapper
– Read splits of the input dataset– Apply map function to the input records– Produce intermediate key/value sets– Partition the intermediate sets into # of
reducers sets Reducer
– Read their part of intermediate sets from mappers
– Apply reduce function to the values of a same key
– Output final resultssplit split split split
mapper mapper mapper
reducer reducer reducer
map: (k1, v1) [(k2, v2)]reduce: (k2, [v2]) [v3]
Signatures of Map, Reduce Func-tion
Input
Intermediate Sets
Final Re-sults
Join using Map-Reduce: Use homogenization procedure
– Apply one map/reduce task on each dataset– Insert a data-source tag into every value– Extract a key attribute common for all heterogeneous
datasets– Transformed datasets now have two common attributes
Key and data-source
Problems– Take lots of extra disk space and incur excessive map-reduce
communications– Limited only to queries that can be rendered as equi-joins
Join using Map-Reduce: Homogeniza-tion
Key Value
10 1, “Value1”
85 1, “Value2”
320 1, “Value3”
Key Value
10 2, “Value4”
54 2, “Value5”
320 2, “Value6”
map
reduce
map
reduce
map
reduce
Dataset 1 Dataset 2
Collect records with same key
Map-Reduce-Merge Signatures
– α, β, γ represent dataset lineages– Reduce function produces a key/value list instead of just val-
ues– Merge function reads data from both lineages
These three primitives can be used to implement the parallel version of several join algorithm
map: (k1, v1) [(k2, v2)]reduce: (k2, [v2]) [v3]
Map-Re-duce
Merge Modules Merge function
– Process two pairs of key/values Processor function
– Process data from one source only– Users can define two processor functions
Partition selector– Determine from which reducers this merger retrieves its in-
put data based on the merger number Configurable iterator
– A merger has two logical iterators– Control their relative movement against each others
Merge Modules
Partition Selector
Processor Processor
Iterator
Merge
Re-ducer Output
Re-ducer Output
Re-ducer Output
Re-ducer Output
Re-ducer Output
Re-ducer Output
Reducers for 1st Dataset Reducers for 2nd DatasetRe-
ducer Output
Re-ducer Output
Re-ducer Output
Re-ducer Output
Re-ducer Output
Re-ducer Output
Applications to Relational Data Processing
Map-Reduce-Merge can be used to implement primi-tive and some derived relational operators– Projection– Aggregation– Generalized selection– Joins– Set union– Set intersection– Set difference– Cartesian product– Rename
Map-Reduce-Merge is relationally complete, while being load-balanced, scalable and parallel
Example: Hash Join
split split split split
mapper mapper mapper
reducer reducer reducer
merger merger merger
split split split split
mapper mapper mapper
reducer reducer reducer
Use a hash partitioner
Read from every mapper for one designated parti-tion
• Read from two sets of reducer outputs that share the same hashing buckets
• One is used as a build set and the other probe
Case Study: TPC-H Query 2 Involves 5 tables, 1 nested query, 1 aggregate
and group by clause, and 1 order by
Case Study: TPC-H Query 2 Map-Reduce-Merge workflow
13 passes of Map-Reduce-Merge10 mappers, 10 reducers, and 4 mergers
6 passes of Map-Reduce-Merge5 mappers, 4 reduce-merge-map-pers, 1 reduce-mapper and 1 re-ducer
Combining phases
Conclusion Map-Reduce-Merge programming model
– Retain Map-Reduce’s many great features– Add relational algebra to the list of database principles it up-
holds– Contains several configurable components that enable many
data-processing patterns
Next step– Develop an SQL-like interface and an optimizer to simplify
the process of developing a Map-Reduce-Merge workflow– This work can readily reuse well-studied RDBMS techniques