design patterns for efficient graph algorithms in...

29
Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and Michael Schatz Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Upload: others

Post on 15-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Design Patterns for Efficient Graph Algorithms in MapReduceAlgorithms in MapReduce

Jimmy Lin and Michael SchatzJimmy Lin and Michael SchatzUniversity of Maryland

Tuesday, June 29, 2010

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Page 2: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

@lintool

Page 3: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Talk OutlineGraph algorithms

Graph algorithms in MapReduceG ap a go t s ap educe

Making it efficient

Experimental resultsExperimental results

Page 4: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

What’s a graph?G = (V, E), where

V represents the set of vertices (nodes)E represents the set of edges (links)Both vertices and edges may contain additional information

Graphs are everywhere:Graphs are everywhere:E.g., hyperlink structure of the web, interstate highway system, social networks, etc.

Graph problems are everywhere:E.g., random walks, shortest paths, MST, max flow, bipartite matching clustering etcmatching, clustering, etc.

Page 5: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Source: Wikipedia (Königsberg)

Page 6: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Graph RepresentationG = (V, E)

Typically represented as adjacency lists:yp ca y ep ese ted as adjace cy stsEach node is associated with its neighbors (via outgoing edges)

1

2

1: 2, 41

3

,2: 1, 3, 43: 1

4

4: 1, 3

Page 7: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

“Message Passing” Graph AlgorithmsLarge class of iterative algorithms on sparse, directed graphs

At each iteration:Computations at each vertexPartial results (“messages”) passed (usually) along directed edgesComputations at each vertex: messages aggregate to alter state

Iterate until convergenceIterate until convergence

Page 8: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

A Few Examples…Parallel breadth-first search (SSSP)

Messages are distances from sourceEach node emits current distance + 1Aggregation = MIN

PageRankPageRankMessages are partial PageRank massEach node evenly distributes mass to neighborsAggregation = SUM

DNA Sequence assemblyMichael Schatz’s dissertation

Page 9: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

PageRank in a nutshell….Random surfer model:

User starts at a random Web pageUser randomly clicks on links, surfing from page to pageWith some probability, user randomly jumps around

PageRankPageRank…Characterizes the amount of time spent on any given pageMathematically, a probability distribution over pages

Page 10: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

PageRank: DefinedGiven page x with inlinks t1…tn, where

C(t) is the out-degree of tα is probability of random jumpN is the total number of nodes in the graph

⎞⎛ n tPR )(1 ∑=

−+⎟⎠⎞

⎜⎝⎛=

i i

i

tCtPR

NxPR

1 )()()1(1)( αα

X

t1

tt2

tn

Page 11: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Sample PageRank Iteration (1)

n2 (0.2)

0 1

n2 (0.166)Iteration 1

n1 (0.2) 0.1

0.1

0.1 0.1

0.066 0.0660.066

n1 (0.066)

n4 (0.2)

n3 (0.2)n5 (0.2)

0.2 0.2

n4 (0.3)

n3 (0.166)n5 (0.3)

Page 12: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Sample PageRank Iteration (2)

n2 (0.166)

0 033 0 083

n2 (0.133)Iteration 2

n1 (0.066)0.033

0.033

0.083 0.083

0.1 0.10.1

n1 (0.1)

n4 (0.3)

n3 (0.166)n5 (0.3)

0.3 0.166

n4 (0.2)

n3 (0.183)n5 (0.383)

Page 13: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

PageRank in MapReduce

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

Mapn2 n4 n3 n5 n1 n2 n3n4 n5

Map

n2 n4n3 n5n1 n2 n3 n4 n5

Reduce

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

Page 14: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

PageRank Pseudo-Code

Page 15: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Why don’t distributed algorithms scale?

Page 16: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Source: http://www.flickr.com/photos/fusedforces/4324320625/

Page 17: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Three Design PatternsIn-mapper combining: efficient local aggregation

Smarter partitioning: create more opportunitiesS a te pa t t o g c eate o e oppo tu t es

Schimmy: avoid shuffling the graph

Page 18: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

In-Mapper CombiningUse combiners

Perform local aggregation on map outputDownside: intermediate data is still materialized

Better: in-mapper combiningPreserve state across multiple map calls, aggregate messages in buffer, emit buffer contents at endDownside: requires memory management

buffer

configure

map

close

Page 19: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Better PartitioningDefault: hash partitioning

Randomly assign nodes to partitions

Observation: many graphs exhibit local structureE.g., communities in social networksBetter partitioning creates more opportunities for local aggregation

Unfortunately… partitioning is hard!Sometimes chick and eggSometimes, chick-and-eggBut in some domains (e.g., webgraphs) take advantage of cheap heuristicsFor webgraphs: range partition on domain-sorted URLs

Page 20: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Schimmy Design PatternBasic implementation contains two dataflows:

Messages (actual computations)Graph structure (“bookkeeping”)

Schimmy: separate the two data flows, shuffle only the messagesmessages

Basic idea: merge join between graph structure and messages

both relations consistently partitioned and sorted by join key

S TS1 T1 S2 T2 S3 T3

both relations consistently partitioned and sorted by join key

Page 21: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Do the Schimmy!Schimmy = reduce side parallel merge join between graph structure and messages

Consistent partitioning between input and intermediate dataMappers emit only messages (actual computation)Reducers read graph structure directly from HDFSReducers read graph structure directly from HDFS

intermediate data(messages)

intermediate data(messages)

intermediate data(messages)

from HDFS(graph structure)

from HDFS(graph structure)

from HDFS(graph structure)

S1 T1 S2 T2 S3 T3

ReducerReducerReducer

Page 22: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

ExperimentsCluster setup:

10 workers, each 2 cores (3.2 GHz Xeon), 4GB RAM, 367 GB diskHadoop 0.20.0 on RHELS 5.3

Dataset:First English segment of ClueWeb09 collection50.2m web pages (1.53 TB uncompressed, 247 GB compressed)Extracted webgraph: 1.4 billion links, 7.0 GBDataset arranged in crawl order

Setup:Measured per-iteration running time (5 iterations)100 partitions

Page 23: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Results

“Best Practices”

Page 24: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Results

+18%1.4b

674m

Page 25: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Results

+18%

-15%

1.4b

674m

Page 26: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Results

+18%

-15%

1.4b

674m

-60%86m

Page 27: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Results

+18%

-15%

1.4b

674m

-60%-69%86m

Page 28: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Take-Away MessagesLots of interesting graph problems!

Social network analysisBioinformatics

Reducing intermediate data is keyLocal aggregationBetter partitioningLess bookkeeping

Page 29: Design Patterns for Efficient Graph Algorithms in MapReducejimmylin/presentations/Lin-Hadoop...Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy LinJimmy Lin and Michael

Complete details in Jimmy Lin and Michael Schatz. Design Patterns for Efficient Graph Algorithms in MapReduce. Proceedings of the 2010 Workshop on Mining and Learning with Graphs Workshop (MLG-2010), July 2010, Washington, D.C.

htt // d /http://mapreduce.me/

Source code available in Cloud9

htt // l d9lib /http://cloud9lib.org/

@lintool