experimental evaluation of multi-round matrix multiplication on

24
Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions Experimental Evaluation of Multi-Round Matrix Multiplication on MapReduce Matteo Ceccarello 1 Francesco Silvestri 1 Dip. di Ingegneria dell’Informazione, Universit` a di Padova, Italy 5/1/2015 - ALENEX 2015

Upload: buithien

Post on 02-Feb-2017

218 views

Category:

Documents


1 download

TRANSCRIPT

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Experimental Evaluation of Multi-Round MatrixMultiplication on MapReduce

Matteo Ceccarello1 Francesco Silvestri1

Dip. di Ingegneria dell’Informazione,Universita di Padova, Italy

5/1/2015 - ALENEX 2015

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Why matrix multiplication?

I MapReduce is a common choice when dealing with Big Data

I The computation is organized in rounds

Map → Shuffle → Reduce

Monolithic algorithms

Execute in a constant numberof rounds

Multi-round algorithms

The number of roundsdepends on the input size

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Why matrix multiplication?

Not a performance benchmark

We are interested in investigating the behaviour of multi-roundalgorithms w.r.t monolithic ones.

Why multi round?

Service market Start/stop computations waiting for better prices

Resource requirements Monolithic algorithms may require toomany resources

Fault tolerance Easier checkpointing

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Outline

1. MapReduce

2. Our results

3. Algorithm

4. ImplementationI Map and ReduceI Partitioner

5. ExperimentsI In-house clusterI Amazon EMR

6. Conclusions

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Previous work

I Original workI “MapReduce: Simplified Data Processing on Large Clusters”

[Dean, Ghemawat 2004]

I ModelsI “A model of computation for MapReduce” [Karloff, Suri,

Vassilvitskii 2010]I “Sorting, searching, and simulation in the MapReduce

framework” [Goodrich, Sitchinava, Zhang 2011]I “Space-round tradeoffs for MapReduce computations”

[Pietracaprina, Pucci, Riondato, Silvestri, Upfal 2012]

I Experimental workI “HAMA: An Efficient Matrix Computation with the

MapReduce Framework“ [Sangwon, Yoon, Jaehong,Seongwook, Jin-Soo, Seungryoul]

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

MapReduce - Model

MR(m, ρ) [Pietracaprina+ 2012]

I m local memory: constraints the amount of local work

I ρ replication: constrains the communication in a single round

I The complexity is measured in number of rounds

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

MapReduce - Round

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Our results

I M3 - Matrix Multiplication on MapReduceI Hadoop library for dense and sparse matrix multiplication

I Multi-round algorithms have performance comparable withmonolithic ones

I Concentrating only on round number for performanceevaluation may not be the best strategy

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Algorithm: 3D dense

Data representation

I Input matrix →√

n ×√

n, blocks of size√

m ×√

m

I Input represented as a collection of pairs

〈(i , `, j); Aij〉 for 0 ≤ i , j <√

n/m

Algorithm

I Cij =∑√n/m−1

h=0 Aih · Bih

I There are (n/m)3/2 products, partitioned into√

n/m groups

I Round r : groups G` for rρ ≤ ` < (r + 1)ρ

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Algorithm: 3D dense

I Cij =∑√n/m−1

h=0 Aih · Bih

I (n/m)3/2 products ⇒√

n/m groups

I Round r : groups ∈[rρ, (r + 1)ρ

)

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Complexity

Theorem 2 [Pietracaprina+ 2012]

The above MR(m, ρ) algorithm multiplies two√

n ×√

n densematrices in

R =

√n

ρ√

m+ 1

rounds.

ρ =

√n

m⇒ monolithic algorithm

ρ = 1 ⇒ multi-round algorithm

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Algorithm: 2D dense (HAMA)

I Ci ,j = Ai · Bj

I The round complexity is higher

R =n

ρmvs.

√n

ρ√

m+ 1

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Algorithm: 3D sparse

I Applies to Erdos-Renyi matrices

I The density is δ

I Adaptation of the 3D algorithm

I Can be extended to generic sparse matrices

TheoremThe above algorithm requires

R =δn3/4

(ρ√

m)+ 1

rounds, the expected shuffle size is 3ρδ2n3/2, and the expectedreducer size is 3m.

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Implementation - Partitioner

reducer ID

coun

t

10

20

30

40

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

3D partitioner naive

(i , h, j)→ z = iρn

m+ jρ+ (h mod ρ) where

{ρ = replication

m = subproblem size

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Questions

Q1: How does the subproblem size affect the performance?

Q2: Replication affects performance?

Q3: Which are the major factors affecting the running time?

Q4: Does the algorithm scale efficiently?

Q5: Is 3D better than 2D?

Q6: Does the sparse algorithm efficiently exploit the input sparsity?

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Experimental setup

In-house cluster

I 16 machines

I 4-Core Intel I7 Nehalem @ 3.07GHz

I 12 GB RAM

I 6 disks of 1TB and 7200 RPM in RAID0

Amazon EMR

I c3.8xlargeI Compute-optimized instance

I i2.xlargeI Storage-optimized instance

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Q1: Time vs. subproblem size

Increasing the subproblem size reduces the time

1000 1500 2000 2500 3000 3500 4000memory

0

1000

2000

3000

4000

5000

6000

time

(s)

Local memory size

max 16000max 32000min 16000min 32000

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Q2: Time vs. replication

Multi round algorithms are comparable with monolithic ones

1 2 4 8replication

0

500

1000

1500

2000

2500

time

(s)

Replication vs. time (32k x 32k matrices)

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Q3: Components cost

The most expensive component is communication

1 2 4 8replication (inverse of the number of rounds)

0

500

1000

1500

2000

2500

time

(s)

Components (32k x 32k matrices)

component

communicationcomputationinfrastructure

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Q4: Scalability

The algorithm scales well with the number of machines

4 6 8 10 12 14 16machines

100

200

300

400

500

600

700

800

900

1000

time

(s)

Scalability (16k x 16k matrices)

replication

124

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Q5: 3D vs. 2D

The 3D algorithm is faster than the 2D approach, as predicted by the analysis

0 2 4 6 8 10 12 14 16replication

0

500

1000

1500

2000

2500

time

(s)

Comparison between 2D and 3D (16k x 16k matrices)

algorithm

2D3D

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Q6: Sparse

We can effectively leverage the sparsity of input matrices

0 2 4 6 8 10 12 14 16replication

0

200

400

600

800

1000

time

(s)

Sparse matrices

Dimension

2^202^222^24

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

A sneak peek at the future

Changing the platform makes space/round tradeoffs even more evident

1 2 4 8 16replication

0

1000

2000

3000

4000

5000

6000

7000

8000

time

(s)

Multiplication of 64k x 64k matrices on spark

Introduction MapReduce Our results Algorithm Implementation Experiments Conclusions

Conclusions

Results

I Multi-round algorithms can be comparable with monolithicones

I Focusing only on round number may not lead to the bestperformance if it implies a large amount of communication

I Open source implementation:I M3 - Matrix Multiplication on MapReduceI http://www.dei.unipd.it/m3

Ongoing work

I Test algorithms on other MapReduce implementations (Spark)