parallelizing multiple group by query in shared-nothing ... · introduction related works multiple...

39
Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Fu Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case PAN Jie 1 Yann LE BIANNIC 2 Fr´ e e ric MAGOULES 1 1 Ecole Centrale Paris-Applied Mathematics and Systems Department (MAS) 2 SAP BusinessObjects Innovation Center June 22, 2010 1 / 39

Upload: others

Post on 18-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Parallelizing Multiple Group by Query in Shared-nothingEnvironment: A MapReduce Study Case

PAN Jie1 Yann LE BIANNIC 2 Frederic MAGOULES 1

1Ecole Centrale Paris-Applied Mathematics and Systems Department (MAS)

2SAP BusinessObjects Innovation Center

June 22, 2010

1 / 39

Page 2: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Outline

1 Introduction

2 Related works

3 Multiple Group by Query

4 Data Preparation

5 MapReduce jobs

6 Experimental Results

7 Cost Estimation Model

8 Conclusions and Future works

2 / 39

Page 3: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Outline

1 Introduction

2 Related works

3 Multiple Group by Query

4 Data Preparation

5 MapReduce jobs

6 Experimental Results

7 Cost Estimation Model

8 Conclusions and Future works

3 / 39

Page 4: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Introduction

Objective:

Cheap solution for multidimensional data analysis applications

Challenges:

Large data volume

Short response time (ex. within 5 seconds)

Use cheap commodity computers → scalability and fault-tolerance

Advantages of MapReduce:

Automatic scalability and fault-tolerance

Minimize the need of locking

Wide range of applicative environments: cluster, multi-core hardware,

4 / 39

Page 5: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Outline

1 Introduction

2 Related works

3 Multiple Group by Query

4 Data Preparation

5 MapReduce jobs

6 Experimental Results

7 Cost Estimation Model

8 Conclusions and Future works

5 / 39

Page 6: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Related works

Related works

MapReduce’s implementations: GridGain, Hadoop...

Data query languages based on MapReduce: PigLatin, HiveQL

MapReduce-based analytic data warehouses: Greenplum, Aster

Combination of MapReduce and parallel databases: HadoopDB

This work focuses on:

MapReduce-based multiple group by query

6 / 39

Page 7: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Outline

1 Introduction

2 Related works

3 Multiple Group by Query

4 Data Preparation

5 MapReduce jobs

6 Experimental Results

7 Cost Estimation Model

8 Conclusions and Future works

7 / 39

Page 8: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Multiple Group by Query

Multiple group-by query’s form:

select X, SUM(*), from R where condition group by X

Example:

select B,sum(A) from R where I>i group by B;

select C,sum(A) from R where I>i group by C;

select D,sum(A) from R where I>i group by D;

select E,sum(A) from R where I>i group by E;

select F,sum(A) from R where I>i group by F;

...

8 / 39

Page 9: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Outline

1 Introduction

2 Related works

3 Multiple Group by Query

4 Data Preparation

5 MapReduce jobs

6 Experimental Results

7 Cost Estimation Model

8 Conclusions and Future works

9 / 39

Page 10: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Dataset used in multidimensional data analysis applications

10 / 39

Page 11: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Dataset used in multidimensional data analysis applications

11 / 39

Page 12: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Data partitioning

Horizontal Partitioning

Putting different entire rows into different partitions.

In our work : 1 horizontal partition = a certain number of entire rows (13Dimensions + 2 Measures).

Vertical Partitioning

Putting different columns into different partitions.

In our work : 1 vertical partition = 1 Dimension + 2 Measures .

12 / 39

Page 13: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Data Distribution

With horizontal partitioning:

Each worker node holds equal number of partitions (in this work, 1, 2, 10and 20 partitions/worker)

With vertical partitioning:

Small worker number: replicate all partitions over every worker node

Large worker number:1 partitions are divided into regions2 worker nodes are grouped into regions3 replicate the vertical partitions over nodes within region (hybrid partitioning)

13 / 39

Page 14: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Data Distribution under Vertical Partitioning

Figure: 3 partitions in 1 region vs 3 partitions in 3 regions

14 / 39

Page 15: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Indexation: Lucene Inverted Index

Figure: Lucene Index

15 / 39

Page 16: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Outline

1 Introduction

2 Related works

3 Multiple Group by Query

4 Data Preparation

5 MapReduce jobs

6 Experimental Results

7 Cost Estimation Model

8 Conclusions and Future works

16 / 39

Page 17: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

GridGain’s MapReduce Framework

Multiple mappers one reducer

Figure: GridGain’s MapReduce framework

17 / 39

Page 18: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

GridGain MapReduce Data Flow

Figure: Data flow of GridGain MapReduce

18 / 39

Page 19: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Master’s Processings: Start-up and Closure

Figure: Start-up and closure processing on Master node

19 / 39

Page 20: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Workers’ Processings

Figure: The processing on worker node

20 / 39

Page 21: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Mapper’s Calculations under Horizontal Partitioning

Mapper: Filter + Aggregate over all the group by columns ⇀ List<FacetCollector1>

11 FacetCollector: aggregated values over 1 column’s distinct values21 / 39

Page 22: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Mapper’s Calculations under Horizontal Partitioning

Mapper: Filter + Aggregate over all the group by columns ⇀ List<FacetCollector1>

11 FacetCollector is a set of aggregated measure values over 1 dimension’s distinct values22 / 39

Page 23: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Reducer’s Calculations under Horizontal Partitioning

Reducer: aggregate over List<List<FacetCollector>>

23 / 39

Page 24: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Mapper’s Calculations under Vertical Partitioning

Filter + Aggregate over one group by column ⇀ FacetCollector

24 / 39

Page 25: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Reducer’s Calculations under Vertical Partitioning

Reducer: aggregate over List<FacetCollector>

25 / 39

Page 26: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Outline

1 Introduction

2 Related works

3 Multiple Group by Query

4 Data Preparation

5 MapReduce jobs

6 Experimental Results

7 Cost Estimation Model

8 Conclusions and Future works

26 / 39

Page 27: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Environment Configuration

Softwares:GridGain version 2.1.1Java version 1.6.0JVM heap 1536Mmaximum

Hardwares: cluster ofaGrid5000’s Sophia site

CPU 2 single-coreat 2.0GHz

Memory 2GB RAMDisk 80GB

aAn infrastructure distributed in 9 sites around France,

for research in large-scale parallel and distributed systems(https://www.grid5000.fr/)

Data set1.2GB10 000 000 rows15 columns = 13 D + 2 M

Data preparationsHorizontal partitioning or verticalpartitioningIndexed with Lucene

MG-Query selectivities:1.05%, 9.96%, 18.6%, 43.1%

Aggregate over:5 group by columns

27 / 39

Page 28: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Speed-up Measurements: MapReduce-based IMPL. over HorizontallyPartitioned Data

Figure: Speed-up with horizontally partitioned data withoutcombiner.

Query SequentialExecution TimeQuery Executionselectivity time(ms)0.0105 8690.0996 12590.186 20970.431 4098

Observations:lrg. slectivity> sm. selectivity

sm. nbjob/node >lrg. nbjob/node

28 / 39

Page 29: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Speed-up Measurements: MapCombineReduce-based IMPL. overHorizontally Partitioned data

Figure: Speed-up with horizontally partitioned data withcombiner.

About combiner:Function: pre-aggregating

Advantage: reducecommunication cost

Observations:For sm. job no. (eg. 1, 2),

MR > MCR

For lrg. job no. (eg. 10, 20),

MCR > MR

29 / 39

Page 30: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Speed-up measurements: Vertically Partitioned Data

Figure: Speed-up with vertically partitioned data.

30 / 39

Page 31: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Speed-up measurements: Vertically Partitioned Data

Observations:

speed-up ↗ when worker nb ↗MR-based IMPL. > MCR-based IMPL.

lrg. selectivity queries > sm. selectivity queries

31 / 39

Page 32: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Performance Affecting Factors

Selectivity decides the workload of aggregation.When // aggregation, lrg. selectivity queries > sm. selectivity queries

nbjob/nodeRunning multiple mappers on one node may degrade performance because ofresource contention.

Job number 1 Mapper’son Average Execution

1 node time (ms)1 1702 2083 3544 4175 537

Table:

Query selectivity=0.01, vertical partitioning, 3 regions

32 / 39

Page 33: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Outline

1 Introduction

2 Related works

3 Multiple Group by Query

4 Data Preparation

5 MapReduce jobs

6 Experimental Results

7 Cost Estimation Model

8 Conclusions and Future works

33 / 39

Page 34: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Cost Estimation Model (MapReduce-based IMPL.)

Startup cost (Master)Mapping jobs to nodes: Cm × nbmapper

Serialization: Cs × sizemapper × nbmapper

Coststartup = Cm · nbmapper + Cs · sizemapper · nbmapper

Each Mapper Job (Worker)De-serialization: Cd × sizemapper

Mapper cost: Costmapper

Serialization: Cs · sizeintermediate

Costworker = f (nbmapper/node) · (Cd · sizemapper + Costmapper + Cs · sizeintermediate)

Closure cost (Master)

De-serialization: Cd ·Pnbmapper

i=1 sizeintermediate i

Reducer: Costreducer

Costclosure = Cd ·Pnbmapper

i=1 sizeintermediate i + Costreducer

CommunicationCostcomm = Cn · (nbmapper · sizemapper +

Pnbmapper

i=1 sizeintermediate i )

34 / 39

Page 35: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Cost Estimation for Horizontal Partitioning-based IMPL.

Mapper CostCostmapper = Selectivity · sizeblock · [Cslct + (D + M) · Cread + M · nbGB · Cadd ]

Intermediate output sizesizeintermediate =

PnbGBi=1 nbDVi · sizeaggregate

Reducer costCostreducer = Cadd ·M · nbblock ·

PnbGBi=1 nbDVi

Total costCosthp = Selectivity · sizeblock · f (nbmapper/node) · [Cslct +(M +D)Cread +M ·nbGB ·Cadd ]+

PnbGBi=1 nbDVi ·[f (nbmapper/node)·(Cs +Cd +Cn)·sizeaggregate+Cadd ·M ·nbblock ]

35 / 39

Page 36: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Cost Estimation for Vertical Partitioning-based IMPL.

Mapper costCostmapper = Selectivity · sizeregion · [Cslct + (M + 1)Cread + M · Cadd ]

Intermediate output sizesizeintermediate = nbDVcurrentGB · sizeaggregate

Reducer costCostreducer = Cadd ·M · nbregion ·

PnbGBi=1 nbDVi

Total costCostvp = f (nbmapper/node) · Selectivity · sizeregion · [Cslct + (M + 1) · Cread + M ·Cadd)]f (nbmapper/node) · Cs · nbDVcurrentGB · sizeaggregate + (Cd + Cadd ·M + Cn) ·nbregion ·

PnbGBi=1 nbDVi

36 / 39

Page 37: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Outline

1 Introduction

2 Related works

3 Multiple Group by Query

4 Data Preparation

5 MapReduce jobs

6 Experimental Results

7 Cost Estimation Model

8 Conclusions and Future works

37 / 39

Page 38: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Conclusions

We realized multiple group by query with MapReduce (GridGain).

Prepare data by horizontal and vertical partitioning and indexing.

Measure speedup performance on Grid’5000.

Formal cost analysis.

Larger job number 6= better performance.

Performance affecting factors: selectivity and nbjob/node

38 / 39

Page 39: Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation Model Conclusions and Future works

Future works

Find more performance affecting factors (e.g. filtered data distribution).

Intermediate data compression.

Separate the index searching from filter and aggregate; Determine mappernumber on the fly, considering query selectivity.

nbjob/node factor to be examined on multi-core hardware.

39 / 39