![Page 1: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/1.jpg)
Map-Reduce-Merge: Simplified Relational Data Processing on
Large ClustersH.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA)
Shimin ChenBig Data Reading
Group Presentation
![Page 2: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/2.jpg)
Motivation
Map-Reduce framework Compared to relational DBMS “simplified” for data processing in search engines
Problem: join multiple heterogeneous datasets Not quite fit into map-reduce Ad-hoc solutions: map-reduce on one data set while
reading data from the other dataset on the fly
![Page 3: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/3.jpg)
Contribution
Goal: support relational algebra primitives without sacrificing existing generality and simplicity
Proposal: map-reduce-merge
![Page 4: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/4.jpg)
Outline
Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion
![Page 5: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/5.jpg)
Let’s Refresh Our Memory
Functional programming model
![Page 6: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/6.jpg)
Comments
Low-cost unreliable commodity hardware Failure often occurs during each map/reduce task Coordinator re-run mapper or reducer
Homogenization: for equi-join Transform each dataset into (join key, payload) Then apply map-reduce to merge entries from different
datasets Problem: only equi-joins may take lots of extra disk space, incur excessive
communications
![Page 7: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/7.jpg)
Outline
Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion
![Page 8: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/8.jpg)
Map-Reduce-Merge Primitiveskey
join
![Page 9: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/9.jpg)
Focusing on Merge
Two sets of inputs generated by multiple reducers: Which α reducers and β reducers match? How to get the next key-value pair? Customized preprocessing for inputs? Merging algorithm?
All of these are customizable
![Page 10: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/10.jpg)
Focusing on Merge
Two sets of inputs generated by multiple reducers: Partition Selector: Which α reducers and β reducers match?
Iterator: How to get the next key-value pair? Processor: Customized preprocessing for inputs? Merger: Merging algorithm?
All of these are customizable
![Page 11: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/11.jpg)
Example: Emp & Dept
Employee Department
![Page 12: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/12.jpg)
Partition Selector
LHS: reduce key:dept-id, emp-id partition key: dept-id RHS: reduce key:dept-id, partition key: dept-id Assuming #reducer is the same, LHS reducer K matches
RHS reducer K
![Page 13: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/13.jpg)
Processor
Pre-processing for each input E.g. building hash table for hash join
This example is sort-merge Processor is empty
![Page 14: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/14.jpg)
Iterator for sort-merge
![Page 15: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/15.jpg)
Merger
![Page 16: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/16.jpg)
Other Iterators
Nested-loop: For each (k,v) of the first input, get all the second input Then rewind the second input and process the next (k,v) of
the first input
Hash join: Read all of one input, then read all of the other input
![Page 17: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/17.jpg)
Outline
Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion
![Page 18: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/18.jpg)
Relation
Relation R with an attribute set A A is broken down into a key part K, and a
value part V
![Page 19: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/19.jpg)
Relational Operators Generalized selection: choosing a subset of records
Filtering can be done in mapper/reducer/merger Projection: choosing a subset of attributes
User-defined mapper (k,v)(k’,v’) Aggregation
Group-by is performed before reduce Easy to implement aggregation in reducer
Joins (set union, intersection, difference, cartesian product) Sort-merge, hash join, nested-loop
Rename
![Page 20: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/20.jpg)
Outline
Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion
![Page 21: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/21.jpg)
Partition Selector
In general: LHS has R1 reducers, RHS has R2 reducers, performing cartesian-product like operator
Suppose R1 R2, use R1 merger, where merger j selects: Input from LHS reducer j Input from RHS all reducers Remote reads: R1*(1+R2) = R1 + R1*R2
Natural equi-join case: Let R1==R2==R, use R merger, where merger j selects: LHS reducer j and RHS reducer j Remote reads: 2*R
![Page 22: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/22.jpg)
Combining Phases
Entire workflow consists of multiple map-reduce-merge To avoid remote copying:
ReduceMap, MergeMap:co-locate next mapper with previous reducer or merger
ReduceMerger:co-locate merger with one of the reducer
ReduceMergeMap
![Page 23: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/23.jpg)
Map-Reduce-Merge Library
Put common merge implementations into a library Joins Common iterators etc.
![Page 24: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/24.jpg)
Configuration API for building a Customized Workflow
Map/reduce
Map/reduce/merge Multiple Map/reduce/merge
![Page 25: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/25.jpg)
Outline
Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion
![Page 26: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/26.jpg)
Webgraphs
Each row: (URL, in-links, out-links) Potentially large number of links Only a few are needed for many operations Store each column of the table in a separate file
Reconstruct the table by join E.g. compute the intersection of in-links and out-
links
![Page 27: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/27.jpg)
TPC-H Query 2
![Page 28: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/28.jpg)
After Combining Phases
![Page 29: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data](https://reader036.vdocument.in/reader036/viewer/2022062718/56649eb35503460f94bb9e64/html5/thumbnails/29.jpg)
Conclusion
Extend map-reduce Support relational operators However, the merge step seems quite
complicated