efficient skyline computation in mapreduce

20
Efficient Skyline Computation in MapReduce Kasper Mullesgaard, Jens Laurits Pedersen, Hua Lu Aalborg University Yongluan Zhou University of Southern Denmark

Upload: suki-reid

Post on 31-Dec-2015

41 views

Category:

Documents


0 download

DESCRIPTION

Efficient Skyline Computation in MapReduce. Kasper Mullesgaard , Jens Laurits Pedersen, Hua Lu Aalborg University Yongluan Zhou University of S outhern Denmark. Skyline Query. Application: multi-criteria decision Tuple dominance: t1 dominates t2 (t1 ⊰ t2) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient Skyline Computation in  MapReduce

Efficient Skyline Computation in MapReduce

Kasper Mullesgaard, Jens Laurits Pedersen, Hua Lu

Aalborg University

Yongluan Zhou

University of Southern Denmark

Page 2: Efficient Skyline Computation in  MapReduce

Skyline Query

• Application: multi-criteria decision• Tuple dominance: t1 dominates t2 (t1 ⊰ t2)– Iff t1 is not worse than t2 in all dimensions, and– t1 is better than t2 in at least one dimension

• Skyline query:– Given a dataset, returns all tuples that are not

dominated by others

Page 3: Efficient Skyline Computation in  MapReduce

Scaling Skyline Computation

• Customized solutions:– Require arbitrary inter-node communication– Need software stacks to hardness a large cluster– Unproved scalability– Lack of fault tolerance

• General MapReduce platforms– Availability of scalable systems, such as Hadoop– A strict communication/synchronization model

Page 4: Efficient Skyline Computation in  MapReduce

MapReduce

Page 5: Efficient Skyline Computation in  MapReduce

Challenges of Skyline Computation using MapReduce

• To maximize parallelization• Push more work to mappers, i.e. let mappers filter out

more non-skyline points• Ability to utilize multiple reducers

• However, global skylines cannot be determined by local information• Without global information, Mappers have very limited

capabilities to filter out non-skyline points

Page 6: Efficient Skyline Computation in  MapReduce

Grid Partitioning and Bit String Representation

Partition Dominance: pi ⊰ pj iff pi.max ⊰ pj.min

2 5 8

1 4 7

0 3 6

BSR = 011110100

Page 7: Efficient Skyline Computation in  MapReduce

Bit String Generation

Page 8: Efficient Skyline Computation in  MapReduce

Determining Partitions Per Dimension (PPD)

• PPD is too high → very few tuples in each partition and too many partitions

• PPD is too low → too many tuples in each partition and less effective pruning

• Idea: generate bit strings for PPD from 2 to

– then choose the one with the most desirable number of tuples per partition

Page 9: Efficient Skyline Computation in  MapReduce

Single Reducer

Page 10: Efficient Skyline Computation in  MapReduce

Multi-Reducer

• The single reducer still performs significant work for detecting global skyline – limits the degree of parallelization

• Idea: independent partition group– Anti-Dominating Region (ADR):

– Independent Partition Group: A set of partitions Pi is an IPG iff holds

– One reducer is responsible for each IPG.

Page 11: Efficient Skyline Computation in  MapReduce

Multi-Reducer

Page 12: Efficient Skyline Computation in  MapReduce

Generation of I.P.G.

• Idea: a partition pm is a maximum partition iff ∀p, pm ∉ p.ADR

• Procedure:1. Find a maximum partition pm

2. Generate IPG = {pm} U pm.ADR

3. Remove pm and repeat 1

Page 13: Efficient Skyline Computation in  MapReduce

Implementation Issues

• More independent groups than #reducers– Need allocate them to the reducers, two options:1. Load balancing 2. Minimizing duplicate data transmission

• Elimination of duplicated skyline outputs– A grid partition appears in multiple IPGs– Designate one IPG as the responsible group• Load balancing

Page 14: Efficient Skyline Computation in  MapReduce

Experimental Setup

• 13 commodity machines• Datasets with independent and anti-

correlated distribution • Comparisons:– MR-BNL– MR-Angle

Page 15: Efficient Skyline Computation in  MapReduce

#Dimensions

independent data, cardinality: 1×105

Page 16: Efficient Skyline Computation in  MapReduce

#Dimensions

Anti-correlated data, cardinality: 1×105

Page 17: Efficient Skyline Computation in  MapReduce

Cardinality (independent data)

Dimensions: 3 Dimensions: 8

Page 18: Efficient Skyline Computation in  MapReduce

Cardinality (Anti-corr. data)

Dimensions: 3 Dimensions: 8

Page 19: Efficient Skyline Computation in  MapReduce

Number of Reducers

Page 20: Efficient Skyline Computation in  MapReduce

Summary

• Grid partitioning and bit strings– Choose an appropriate # partitioning

• Exploit independent groups to enable multiple reducers – Good for cases with large # skylines– Merging independent groups– Eliminate duplicate outputs