processing graph/relational data with map-reduce and bulk synchronous parallel

36
Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel v. 1.1 Tomasz Chodakowski, 1 st Bristol Hadoop Workshop, 08-11-2010

Upload: chodakowski

Post on 11-Nov-2014

5.221 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Processing graph/relational datawith

Map-Reduceand

Bulk Synchronous Parallelv. 1.1

Tomasz Chodakowski,

1st Bristol Hadoop Workshop, 08-11-2010

Page 2: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Irregular Algorithms

● Map-reduce – a simplified model for “embarasingly parallel” problems

– Easily separable into independent tasks

– Captured by static dependence graph

● Most graph algorithms are irregular, ie.:

– Dependencies between tasks arise during execution

– “don't care non-determinism” - tasks can be executed in arbitrary order yet still yield correct results.

Page 3: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Irregular Algorithms

● Often operate on data structures with complex topologies:

– Graphs, trees, grids, ...

– Where “data elements” are connected by “relations”

● Computations on such structures depend strongly on relations between data elements

– primary source of dependencies between tasks

more in [ADP] “Amorphous Data-parallelism in Irregular Algorithms”

Page 4: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Relational Data

● Example relations between elements:

– social interactions (co-authorship, friendship)

– web links, document references

– linked data or semantic network relations

– geo-spatial relations

– ...● Different from a relational model

– in that relations are arbitrary

Page 5: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Graph Algorithms Rough Classification

● Aggregation, feature extraction

– Not leveraging latent relations● Network analysis (matrix-based, single relational)

– Geodesic (radius, diameter etc.)

– Spectral (eigenvector-based, centrality)● Algorithmic/node-based algorithms

– Recommender systems, belief/label propagation

– Traversal, path detection, interaction networks, etc.

Page 6: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Iterative Vertex-based Graph Algorithms

● Iteratively:

– Compute local function of a vertex that depends on the vertex state and local graph structure (neighbourhood)

– and/or Modify local state

– and/or Modify local topology

– pass messages to neighbouring nodes

● -> “vertex-based computation”Amorphous Data-Parallelism [ADP] operator formulation:

“repeated application of neighbourhood operators in a specific order”

Page 7: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Recent applications/developments

● Google work on graph-based YouTube recommendations:

– Leveraging latent information

– Diffusing interest in sparsely labeled video clips

● User profiling, sentiment analysis

– Facebook likes, Hunch, Gravity, MusicMetric ...

Page 8: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

00

P1P1 P2P2 P1P1 P2P2workwork

Directed graph Directed graph labelled with labelled with positive integerspositive integers

This time-space This time-space view shows view shows workload and workload and communication communication between between partitionspartitions

Graph structure Graph structure split into two split into two partitions (P1, P2)partitions (P1, P2)

TimeTime

Turquoise Turquoise rectangles show rectangles show computational computational work load for a work load for a partition (work)partition (work)

Page 9: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

00 0+0+66

0+0+11

0+0+99

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

Thick green lines Thick green lines show, costly, inter show, costly, inter partition partition communicationscommunications

Active vertices Active vertices are in turquoiseare in turquoise

Signals being Signals being passed along passed along relations are in relations are in light greenlight green

Page 10: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

00 0+0+66

0+0+11

0+0+99

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

Vertical grey line Vertical grey line is a barrier is a barrier synchronisation to synchronisation to avoid race avoid race conditionsconditions

Page 11: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

0066

11

99

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

workwork

Work,comm,barrier Work,comm,barrier form a BSP superstepform a BSP superstep

Vertices become Vertices become active upon receiving active upon receiving signal in a previous signal in a previous superstepsuperstep

Page 12: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

0066

11

991+1+11

1+1+33

6+6+22

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

After performing After performing local computation local computation they send signals to they send signals to their neighbouring their neighbouring verticesvertices

Page 13: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

0066

11

991+1+11

1+1+33

6+6+22

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

Page 14: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

0044

11

9988

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

workwork

Page 15: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

0044

11

9988

4+4+22

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

commcomm

workwork

Page 16: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

0044

11

9988

4+4+22

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

Page 17: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

0044

11

9966

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

workwork

Page 18: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

0044

11

9966

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

commcommworkwork

barrierbarrier

Computation ends when Computation ends when there are no active there are no active vertices leftvertices left

Page 19: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Bulk Synchronous Parallel

P1P1 P2P2 ...... PnPnsuperstepsuperstep

00

11

22

33

......

...... ...... ...... ......

superstep n cost =

wn + hn + ln

w0

h0

h1

h2

h3

l1

l0

l2

l3

w1

w2

w3

Time to finish work on slowest partition + cost of bulk communication + barrier synchronization time

Page 20: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Bulk Synchronous Parallel

● Advantages

– Simple and portable execution model

– Clear cost model

– No concurrency control, no data races, deadlocks, etc.

● Disadvantages

– Coarse grained● Depends on a large “parallel slack”

– Requires well-partitioned problem space for efficiency (well balanced partitions)

more in [BSP] “A bridging model for parallel computation”

Page 21: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Bulk Synchronous Parallel - extensions

● Combiners

– minimizing inter-node communication (h factor)

● Aggregators

– Computing global state (ex. map/reduce)

And other extensions...

Page 22: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Sample code public void superStep() {public void superStep() {

int minDist = this.isStartingElement() ? 0 : Integer.MAX_VALUE;int minDist = this.isStartingElement() ? 0 : Integer.MAX_VALUE;

forfor(DistanceMessage msg: messages()) { (DistanceMessage msg: messages()) { // Choose min. proposed distance// Choose min. proposed distance

minDist = Math.min( minDist, msg.getDistance() );minDist = Math.min( minDist, msg.getDistance() );

}} ifif( minDist < this.getCurrentDistance() ) { ( minDist < this.getCurrentDistance() ) { //If improves the path, store and propagate//If improves the path, store and propagate this.setCurrentDistance(minDist);this.setCurrentDistance(minDist);

IVertex v = this.getElement();IVertex v = this.getElement(); forfor(IEdge r: v.getOutgoingEdges(DemoRelationshipTypes.KNOWS) ) {(IEdge r: v.getOutgoingEdges(DemoRelationshipTypes.KNOWS) ) { IElement recipient = r.getOtherElement(v);IElement recipient = r.getOtherElement(v); int rDist = this.getLengthOf(r);int rDist = this.getLengthOf(r); this.sendMessage( new DistanceMessage(minDist+rDist, recipient.getId()) );this.sendMessage( new DistanceMessage(minDist+rDist, recipient.getId()) ); } }} }

Page 23: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

SSSP - Map-Reduce Naive

● Idea [DPMR]:

– In map phase:● emit both signals and local vertex

structure and state– In reduce phase:

● gather signals and local vertex structure messages

● reconstruct vertex structure and state

Page 24: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

SSSP - Map-Reduce Naive

def map(Id nId, Node N):

//emit state and structure

emit(nId, N.graphStateAndStruct)

if(N.isActive)

for(nbr :N.adjacencyL)

//local computation

dist:= N.currDist+DistToNbr

//emit signals

emit(nbr.id, dist)

def reduce(Id rId, {m1,m2,..} ):

new M; M.deActivate

minDist = MAX_VALUE

for(m in {m1,m2,..})

if(m is Node) M:=m //state

else if(m is Distance) //signals

minDist = min( minDist, m )

if(M.currDist > minDist)

M.currDist:=minDist;

M.activate

emit(rId, M)

Page 25: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

SSSP - Map Reduce Naive - issues

● Cost associated with marshaling intermediate <k,v> pairs for combiners (which are optional)

– -> in-line combiner

● Need to pass the whole graph state and structure around

– -> “Shimmy trick” -- pin down the structure

● Partitions verticies without regard to graph topology

– -> cluster highly connected components together

Page 26: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Inline Combiners

● In job configure:

– Initialize a map<NodeId, Distance>;● In job map operation:

– Do not emit interm. pairs ( emit(nbr.id, dist) ) ;

– Store them in the local map;

– Combine values in the same slots.● In job close:

– Emit a value from each slot in the map to a corresponding neighbour

● emit(nbr.id, map[nbr.id])

Page 27: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

“Shimmy trick”

● Store graph structure in a file system (no shuffle)

● Inspired by a parallel merge join

p1p1 p1p1

p2p2 p2p2

p3p3 p3p3

partitionpartition

sorted by join keysorted by join key sorted and partitioned by join keysorted and partitioned by join key

Page 28: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

“Shimmy trick”

● Assume:

– Graph G representation sorted by node ids;

– G partitioned into n parts: G1, G

2, .., G

n

– Use the same partitioner as in MR

– Set number of reducers to n● The above gives us:

– Reducer Ri, receives the same intermediate

keys as those in Gi graph partition (in

sorted order).

Page 29: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

“Shimmy trick”

def reduce(Id rId, {m1,m2,..} ):

repeat:

(id nId, node N) <- P.read()

if (nId != rId): N.deact; emit(nId, N)

until: nId == rId

minDist = MAX_VALUE

for(m in {m1,m2,..}):

minDist = min( minDist, m )

if(N.currDist > minDist)

N.currDist:=minDist;

N.activate

emit(rId, N)

def configure( ):

P.openGraphPartition()

def close( ):

repeat:

(id nId, node N) <-P.read()

N.deactivate

emit(nId, N)

Page 30: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

“Shimmy trick”

● Improvements:

– Files containing graph structure reside on dfs

– Reducers arbitrarily assigned to cluster machines

● -> remote reads.

● -> change the scheduler to assign key ranges to the same machines consistently.

Page 31: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Topology-aware Partitioner

● Choose a partitioner that:

– minimizes inter-block traffic;

– maximizes intra-block traffic;

– places adjacent nodes in the same block

● Difficult to achieve particularly with many real world datasets:

– Power-law distributions

– Reported that state of the art partitioners (ex. parmetis) fail for such cases (???)

Page 32: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

MR Graph Processing Design Pattern

● [DPMR] reports 60% 70% improvement over naive implementation

● Solution closely resembles the BSP model

Page 33: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

BSP (inspired) implementations

● Google Pregel:– classic BSP, C++, production

● CMU GraphLab– inspired by BSP, java, multi-core

– consistency models, custom schedulers

● Apache Hama– scientific computation package that runs on top of

Hadoop, BSP, MS Dryad (?)

● Signal/Collect (Zurich University)– Scala, not yet distributed

● ...

Page 34: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Open questions

● What problems are particularly suitable for MR and which ones for BSP – where are the boundaries?

– Topology-based centrality algorithms (PageRank):

● Algebraic, matrix-based methods vs. vertex-based ones?

● When considering graph algorithms:

– MR user base vs. BSP ergonomy?

– Performance overheads?● Relaxing the BSP synchronous schedule -->

“Amorphous data parallelism”

Page 35: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

POC, Sample Code

● Project Masuria (early stages, 2011-02)

– http://masuria-project.org/– As much POC of BSP framework as it is

(distributed) OSGI playground.

● Sample code:

– https://github.com/tch/Cloud9 *– [email protected]:tch_sandbox.git

– RunSSSPNaive.java

– RunSSSPShimmy.java *

* - expect (my) bugs

Based on Jimmy Lin and Michael Schatz Cloud9 library

Page 36: Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

References

● [ADP] “Amorphous Data-parallelism in Irregular Algorithms”, Keshav Pingali et al.

● [BSP] “A bridging model for parallel computation”, Leslie G. Valiant

● [DPMR] “Design Patterns for Efficient Graph Algorithms in MapReduce”, Jimmy Lin and Michael Schatz