apache giraph - centrum wiskunde & informatica · apache giraph large-scale graph processing on...

44
Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <[email protected]> @claudiomartella

Upload: others

Post on 17-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Apache Giraph Large-scale Graph Processing on Hadoop

Claudio Martella

<[email protected]> @claudiomartella

Page 2: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

2

Page 3: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Graphs are simple

3

Page 4: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

A computer network

4

Page 5: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

A social network

5

Page 6: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

A semantic network

6

Page 7: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

A map

7

Page 8: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Predicting break ups

8

Aggregation approach Graph approach

Page 9: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Graphs are nasty.

9

Page 10: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Each vertex depends

on its neighbours,

recursively.

10

Page 11: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Recursive problems

are nicely solved

iteratively.

11

Page 12: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

12

Page 13: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

PageRank in

MapReduce

• Record: < v_i, pr, [ v_j, ..., v_k ] >

• Mapper: emits < v_j, pr / #neighbours >

• Reducer: sums the partial values

13

Page 14: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

MapReduce dataflow

14

Page 15: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Drawbacks

• Each job is executed N times

• Job bootstrap

• Mappers send PR values and structure

• Extensive IO at input, shuffle & sort,

output

15

Page 16: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

16

Page 17: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Timeline

• Inspired by Google Pregel (2010)

• Donated to ASF by Yahoo! in 2011

• Top-level project in 2012

• 1.0 release in January 2013

• 1.1 release in November 2014

17

Page 18: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Plays well with

Hadoop

18

Page 19: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Vertex-centric API

19

Page 20: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Shortest Paths

20

Page 21: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Shortest Paths

21

Page 22: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Shortest Paths

22

Page 23: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Shortest Paths

23

Page 24: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Shortest Paths

24

Page 25: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Code def compute(vertex, messages):

minValue = Inf # float(‘Inf’)

for m in messages:

minValue = min(minValue, m)

if minValue < vertex.getValue():

vertex.setValue(minValue)

for edge in vertex.getEdges():

message = minValue + edge.getValue()

sendMessage(edge.getTargetId(), message)

vertex.voteToHalt()

25

Page 26: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

26

Page 27: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

27

Page 28: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

28

Page 29: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

29

Page 30: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

BSP & Giraph

30

Page 31: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Advantages

• No locks: message-based

communication

• No semaphores: global synchronization

• Iteration isolation: massively

parallelizable

31

Page 32: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Designed for

iterations

• Stateful (in-memory)

• Only intermediate values (messages)

sent

• Hits the disk at input, output, checkpoint

• Can go out-of-core

32

Page 33: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Giraph job lifetime

33

Page 34: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Architecture

34

Page 35: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Composable API

35

Page 36: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Checkpointing

36

Page 37: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

No SPoFs

37

Page 38: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Giraph scales

38

ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-

edges/10151617006153920

Page 39: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Giraph is

fast

• 100x over MR (Pr)

• jobs run within minutes

• given you have resources

;-)

39

Page 40: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Serialised objects

40

Page 41: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Primitive types

• Autoboxing is expensive

• Objects overhead (JVM)

• Use primitive types on your own

• Use primitive types-based libs (e.g.

fastutils)

41

Page 42: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Sharded aggregators

42

Page 43: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Okapi

• Apache Mahout for graphs

• Graph-based

recommenders: ALS, SGD,

SVD++, etc.

• Graph analytics: Graph

partitioning, Community

Detection, K-Core, etc.

43

Page 44: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella  @claudiomartella . 2 . Graphs are

Thank you

<[email protected]> @claudiomartella

http://giraph.apache.org