2010-pregel

51
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkwoski Google, Inc. SIGMOD ’10 7 July 2010 Taewhi Lee

Upload: binzhao

Post on 11-May-2015

1.912 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 2010-Pregel

Pregel: A System for Large-Scale Graph Processing

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajk-woskiGoogle, Inc.SIGMOD ’10

7 July 2010Taewhi Lee

Page 2: 2010-Pregel

2 / 40

Outline

Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work

Page 3: 2010-Pregel

3 / 40

Outline

Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work

Page 4: 2010-Pregel

4 / 40

Introduction (1/2)

Source: SIGMETRICS ’09 Tutorial – MapReduce: The Programming Model and Practice, by Jerry Zhao

Page 5: 2010-Pregel

5 / 40

Introduction (2/2) Many practical computing problems concern large

graphs

MapReduce is ill-suited for graph processing

– Many iterations are needed for parallel graph processing

– Materializations of intermediate results at every MapReduce iteration harm performance

Large graph dataWeb graph

Transportation routes

Citation relation-ships

Social networks

Graph algorithmsPageRank

Shortest pathConnected compo-

nentsClustering tech-

niques

Page 6: 2010-Pregel

6 / 40

Single Source Shortest Path (SSSP) Problem

– Find shortest path from a source node to all target nodes

Solution

– Single processor machine: Dijkstra’s algorithm

Page 7: 2010-Pregel

7 / 40

Example: SSSP – Dijkstra’s Algorithm

0

10

5

2 3

2

1

9

7

4 6

Page 8: 2010-Pregel

8 / 40

Example: SSSP – Dijkstra’s Algorithm

0

10

5

10

5

2 3

2

1

9

7

4 6

Page 9: 2010-Pregel

9 / 40

Example: SSSP – Dijkstra’s Algorithm

0

8

5

14

7

10

5

2 3

2

1

9

7

4 6

Page 10: 2010-Pregel

10 / 40

Example: SSSP – Dijkstra’s Algorithm

0

8

5

13

7

10

5

2 3

2

1

9

7

4 6

Page 11: 2010-Pregel

11 / 40

Example: SSSP – Dijkstra’s Algorithm

0

8

5

9

7

10

5

2 3

2

1

9

7

4 6

Page 12: 2010-Pregel

12 / 40

Example: SSSP – Dijkstra’s Algorithm

0

8

5

9

7

10

5

2 3

2

1

9

7

4 6

Page 13: 2010-Pregel

13 / 40

Single Source Shortest Path (SSSP) Problem

– Find shortest path from a source node to all target nodes

Solution

– Single processor machine: Dijkstra’s algorithm

– MapReduce/Pregel: parallel breadth-first search (BFS)

Page 14: 2010-Pregel

14 / 40

MapReduce Execution Overview

Page 15: 2010-Pregel

15 / 40

Example: SSSP – Parallel BFS in MapReduce Adjacency matrix

Adjacency ListA: (B, 10), (D, 5)

B: (C, 1), (D, 2)

C: (E, 4)

D: (B, 3), (C, 9), (E, 2)

E: (A, 7), (C, 6)

0

10

5

2 3

2

1

9

7

4 6

A

B C

D E

A B C D E

A 10 5

B 1 2

C 4

D 3 9 2

E 7 6

Page 16: 2010-Pregel

16 / 40

Example: SSSP – Parallel BFS in MapReduce

0

10

5

2 3

2

1

9

7

4 6

A

B C

D E

Map input: <node ID, <dist, adj list>><A, <0, <(B, 10), (D, 5)>>>

<B, <inf, <(C, 1), (D, 2)>>>

<C, <inf, <(E, 4)>>>

<D, <inf, <(B, 3), (C, 9), (E, 2)>>>

<E, <inf, <(A, 7), (C, 6)>>>

Map output: <dest node ID, dist><B, 10> <D, 5>

<C, inf> <D, inf>

<E, inf>

<B, inf> <C, inf> <E, inf>

<A, inf> <C, inf>

<A, <0, <(B, 10), (D, 5)>>>

<B, <inf, <(C, 1), (D, 2)>>>

<C, <inf, <(E, 4)>>>

<D, <inf, <(B, 3), (C, 9), (E, 2)>>>

<E, <inf, <(A, 7), (C, 6)>>>Flushed to local disk!!

Page 17: 2010-Pregel

17 / 40

Example: SSSP – Parallel BFS in MapReduce Reduce input: <node ID, dist><A, <0, <(B, 10), (D, 5)>>>

<A, inf>

<B, <inf, <(C, 1), (D, 2)>>>

<B, 10> <B, inf>

<C, <inf, <(E, 4)>>>

<C, inf> <C, inf> <C, inf>

<D, <inf, <(B, 3), (C, 9), (E, 2)>>>

<D, 5> <D, inf>

<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, inf>

0

10

5

2 3

2

1

9

7

4 6

A

B C

D E

Page 18: 2010-Pregel

18 / 40

Example: SSSP – Parallel BFS in MapReduce Reduce input: <node ID, dist><A, <0, <(B, 10), (D, 5)>>>

<A, inf>

<B, <inf, <(C, 1), (D, 2)>>>

<B, 10> <B, inf>

<C, <inf, <(E, 4)>>>

<C, inf> <C, inf> <C, inf>

<D, <inf, <(B, 3), (C, 9), (E, 2)>>>

<D, 5> <D, inf>

<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, inf>

0

10

5

2 3

2

1

9

7

4 6

A

B C

D E

Page 19: 2010-Pregel

19 / 40

Example: SSSP – Parallel BFS in MapReduce Reduce output: <node ID, <dist, adj list>>

= Map input for next iteration<A, <0, <(B, 10), (D, 5)>>>

<B, <10, <(C, 1), (D, 2)>>>

<C, <inf, <(E, 4)>>>

<D, <5, <(B, 3), (C, 9), (E, 2)>>>

<E, <inf, <(A, 7), (C, 6)>>>

Map output: <dest node ID, dist><B, 10> <D, 5>

<C, 11> <D, 12>

<E, inf>

<B, 8> <C, 14> <E, 7>

<A, inf> <C, inf>

0

10

5

10

5

2 3

2

1

9

7

4 6

A

B C

D E

<A, <0, <(B, 10), (D, 5)>>>

<B, <10, <(C, 1), (D, 2)>>>

<C, <inf, <(E, 4)>>>

<D, <5, <(B, 3), (C, 9), (E, 2)>>>

<E, <inf, <(A, 7), (C, 6)>>>

Flushed to DFS!!

Flushed to local disk!!

Page 20: 2010-Pregel

20 / 40

Example: SSSP – Parallel BFS in MapReduce Reduce input: <node ID, dist><A, <0, <(B, 10), (D, 5)>>>

<A, inf>

<B, <10, <(C, 1), (D, 2)>>>

<B, 10> <B, 8>

<C, <inf, <(E, 4)>>>

<C, 11> <C, 14> <C, inf>

<D, <5, <(B, 3), (C, 9), (E, 2)>>>

<D, 5> <D, 12>

<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, 7>

0

10

5

10

5

2 3

2

1

9

7

4 6

A

B C

D E

Page 21: 2010-Pregel

21 / 40

Example: SSSP – Parallel BFS in MapReduce Reduce input: <node ID, dist><A, <0, <(B, 10), (D, 5)>>>

<A, inf>

<B, <10, <(C, 1), (D, 2)>>>

<B, 10> <B, 8>

<C, <inf, <(E, 4)>>>

<C, 11> <C, 14> <C, inf>

<D, <5, <(B, 3), (C, 9), (E, 2)>>>

<D, 5> <D, 12>

<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, 7>

0

10

5

10

5

2 3

2

1

9

7

4 6

A

B C

D E

Page 22: 2010-Pregel

22 / 40

Example: SSSP – Parallel BFS in MapReduce Reduce output: <node ID, <dist, adj list>>

= Map input for next iteration<A, <0, <(B, 10), (D, 5)>>>

<B, <8, <(C, 1), (D, 2)>>>

<C, <11, <(E, 4)>>>

<D, <5, <(B, 3), (C, 9), (E, 2)>>>

<E, <7, <(A, 7), (C, 6)>>>

… the rest omitted …

0

8

5

11

7

10

5

2 3

2

1

9

7

4 6

A

B C

D E

Flushed to DFS!!

Page 23: 2010-Pregel

23 / 40

Outline

Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work

Page 24: 2010-Pregel

24 / 40

Computation Model (1/3)

Input

Output

Supersteps(a sequence of iterations)

Page 25: 2010-Pregel

25 / 40

Computation Model (2/3) “Think like a vertex” Inspired by Valiant’s Bulk Synchronous Parallel model

(1990)

Source: http://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Page 26: 2010-Pregel

26 / 40

Computation Model (3/3) Superstep: the vertices compute in parallel

– Each vertex

Receives messages sent in the previous superstep

Executes the same user-defined function

Modifies its value or that of its outgoing edges

Sends messages to other vertices (to be received in the next su-

perstep)

Mutates the topology of the graph

Votes to halt if it has no further work to do

– Termination condition

All vertices are simultaneously inactive

There are no messages in transit

Page 27: 2010-Pregel

27 / 40

Example: SSSP – Parallel BFS in Pregel

0

10

5

2 3

2

1

9

7

4 6

Page 28: 2010-Pregel

28 / 40

Example: SSSP – Parallel BFS in Pregel

0

10

5

2 3

2

1

9

7

4 6

10

5

Page 29: 2010-Pregel

29 / 40

Example: SSSP – Parallel BFS in Pregel

0

10

5

10

5

2 3

2

1

9

7

4 6

Page 30: 2010-Pregel

30 / 40

Example: SSSP – Parallel BFS in Pregel

0

10

5

10

5

2 3

2

1

9

7

4 6

11

7

12

814

Page 31: 2010-Pregel

31 / 40

Example: SSSP – Parallel BFS in Pregel

0

8

5

11

7

10

5

2 3

2

1

9

7

4 6

Page 32: 2010-Pregel

32 / 40

Example: SSSP – Parallel BFS in Pregel

0

8

5

11

7

10

5

2 3

2

1

9

7

4 6

9

14

13

15

Page 33: 2010-Pregel

33 / 40

Example: SSSP – Parallel BFS in Pregel

0

8

5

9

7

10

5

2 3

2

1

9

7

4 6

Page 34: 2010-Pregel

34 / 40

Example: SSSP – Parallel BFS in Pregel

0

8

5

9

7

10

5

2 3

2

1

9

7

4 6

13

Page 35: 2010-Pregel

35 / 40

Example: SSSP – Parallel BFS in Pregel

0

8

5

9

7

10

5

2 3

2

1

9

7

4 6

Page 36: 2010-Pregel

36 / 40

Differences from MapReduce Graph algorithms can be written as a series of

chained MapReduce invocation

Pregel

– Keeps vertices & edges on the machine that performs com-putation

– Uses network transfers only for messages

MapReduce

– Passes the entire state of the graph from one stage to the next

– Needs to coordinate the steps of a chained MapReduce

Page 37: 2010-Pregel

37 / 40

Outline

Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work

Page 38: 2010-Pregel

38 / 40

C++ API Writing a Pregel program

– Subclassing the predefined Vertex class

Override this!

in msgs

out msg

Page 39: 2010-Pregel

39 / 40

Example: Vertex Class for SSSP

Page 40: 2010-Pregel

40 / 40

Outline

Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work

Page 41: 2010-Pregel

41 / 40

System Architecture Pregel system also uses the master/worker model

– Master Maintains worker Recovers faults of workers Provides Web-UI monitoring tool of job progress

– Worker Processes its task Communicates with the other workers

Persistent data is stored as files on a distributed stor-age system (such as GFS or BigTable)

Temporary data is stored on local disk

Page 42: 2010-Pregel

42 / 40

Execution of a Pregel Program

1. Many copies of the program begin executing on a cluster of machines

2. The master assigns a partition of the input to each worker

– Each worker loads the vertices and marks them as active

3. The master instructs each worker to perform a superstep

– Each worker loops through its active vertices & computes for each vertex

– Messages are sent asynchronously, but are delivered before the end of the superstep

– This step is repeated as long as any vertices are active, or any messages are in transit

4. After the computation halts, the master may instruct each worker to save its portion of the graph

Page 43: 2010-Pregel

43 / 40

Fault Tolerance Checkpointing

– The master periodically instructs the workers to save the state of their partitions to persistent storage

e.g., Vertex values, edge values, incoming messages

Failure detection

– Using regular “ping” messages

Recovery

– The master reassigns graph partitions to the currently avail-able workers

– The workers all reload their partition state from most recent available checkpoint

Page 44: 2010-Pregel

44 / 40

Outline

Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work

Page 45: 2010-Pregel

45 / 40

Experiments Environment

– H/W: A cluster of 300 multicore commodity PCs

– Data: binary trees, log-normal random graphs (general graphs)

Naïve SSSP implementation

– The weight of all edges = 1

– No checkpointing

Page 46: 2010-Pregel

46 / 40

Experiments SSSP – 1 billion vertex binary tree: varying # of

worker tasks

Page 47: 2010-Pregel

47 / 40

Experiments SSSP – binary trees: varying graph sizes on 800

worker tasks

Page 48: 2010-Pregel

48 / 40

Experiments SSSP – Random graphs: varying graph sizes on 800

worker tasks

Page 49: 2010-Pregel

49 / 40

Outline

Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion & Future Work

Page 50: 2010-Pregel

50 / 40

Conclusion & Future Work Pregel is a scalable and fault-tolerant platform with

an API that is sufficiently flexible to express arbitrary graph algorithms

Future work

– Relaxing the synchronicity of the model

Not to wait for slower workers at inter-superstep barriers

– Assigning vertices to machines to minimize inter-machine communication

– Caring dense graphs in which most vertices send messages to most other vertices

Page 51: 2010-Pregel

Thank You!Any questions or comments?