replication-based fault-tolerance for large-scale graph processing

Replication-based Fault-tolerance for Large-scale Graph Processing

Peng Wang, Kaiyuan Zhang, Rong Chen, Haibo Chen, Haibing Guan

Shanghai Jiao Tong University

Graph

• Useful information in graph

• Many applications– SSSP– Community Detection……

Graph computing

• Graphs are large– Require a lot of machines

• Fault tolerance is important

How graph computing works

1

2

3

1

W1 W2

Compute Compute

SendMsg SendMsg

EnterBarrier

Commit Commit

LeaveBarrier

2

3

1

PageRank(i) // compute its own rank total = 0 foreach ( j in in_neighbors(i)) : total = total + R[j] * Wji

R[i] = 0.15 + total

// trigger neighbors to run again if R[i] not converged then activate all neighbors

LoadGraph LoadGraph

1

1

Master

Replica

Related work about fault tolerance

• Simple re-execution (MapReduce)– Complex data dependency

• Coarse-grained FT (Spark)– Fine-grained update on each vertex

• State-of-the-art fault tolerance for graph computing– Checkpoint– Trinity, PowerGraph, Pregel, etc.

How checkpoint works

1 53 7

2 64 1 3

2 6 1 5

4 6

1 5

4 6Crash

Loading Graph Iter X Iter X+1

Iter XPartition && Topology

W1

W2

Recovery

DFS

recovery

checkpoint

global barrier

Problems of checkpoint

• Large execution overhead– Large amount of states to write– Synchronization overhead

NO 1 2 40

50

100

150

200

ckptsynccommcomp

Exec

ution

tim

e (s

ec)

Checkpoint period

PageRank on LiveJournal

Problems of checkpoint

• Large overhead

• Slow recovery– A lot of I/O operations– Require standby node

avg-time 1 iteration 2 iterations 3 iterations0

10

20

30

40

50

60

70

Recovery Time

tot

Seco

nd

w/o CKPT

Checkpoint Period

Observation and motivation• Reuse existing replicas to provide fault tolerance

• Reuse existing replicas small overhead

• Replicas distributed in different machines fast recovery

GWeb LJournal Wiki SYN-GL DBLP RoadCA0%

4%

8%

12%

16%

0.84% 0.96% 0.26% 0.13%

Verti

ces w

ithou

t rep

lica

Almost all the vertices have replicas

Contribution

• Imitator: a replication based fault tolerance system for graph processing

• Small overhead– Less than 5% for all cases

• Fast recovery– Up to 17.5 times faster than the checkpoint one

Outline

• Execution flow

• Replication management

• Recovery

• Evaluation

• Conclusion

Normal execution flowLoadGraph

Compute Compute

SendMsg SendMsg

EnterBarrier

Commit Commit

LeaveBarrier

1. Adding FT support

2. Extending normal synchronization message

LoadGraph

Failure before barrier

Compute Compute

SendMsg SendMsg

EnterBarrier

Commit Commit

LeaveBarrier

enterBarrier

Compute

SendMsg

EnterBarrier

Commit

LeaveBarrier

Rollback && RecoveryRecovery

Newbie joinsCrash

LoadGraph LoadGraph

Failure during barrier

Compute Compute

SendMsg SendMsg

EnterBarrier

Commit Commit

LeaveBarrier

leaveBarrier

Compute

SendMsg

EnterBarrier

Commit

LeaveBarrier

Recovery

Recovery

Newbie boot

Crash

LoadGraph LoadGraph

Management of replication

• Fault tolerance replicas– every vertex has at least f replicas to tolerate f failures

• Full state replica (mirror)– Existing replica lacks meta information– Such as replication location

1

4

75

2

4

3

12

5

1 5

4 2

3

6

6

7

Node1

Node2

Node3

Master

Replica

Vertex: 5Master: n2Replicas: n1 | n3

Optimization: selfish vertices

• States of selfish vertices have no consumer• Their states may only decided by their neighbors• Opt: get their states by re-computation

How to recover

• Challenges– Parallel recovery

– Consistent state after recovery

Problems of recovery

1

4

75

2

4

3

12

5

1 5

4 2

3

66 7Node1 Node2 Node3

1

1

1

CrashMaster

Mirror

Replica

• Which vertices have crashed?• How to recover without a central coordinator?

Rules:1. Master recovers replicas2. If master crashed, mirror recovers master and replicas

Replication Location

Vertex 3:Master: n3Mirror: n2

Rebirth

1

4

75

2

4

3

12

5

1

6 7

Node1 Node2Newbie3

64 35

2

Rule:1. Master recovers replicas2. If master crashed, mirror recovers master and replicas

1

4

75

2

4

3

12

5

1 5

4 2

3


1

1

1

Crash

Master

Mirror

Replica

Problems of Rebirth

• Standby machine

• A single newbie machine

Migrate tasks to surviving machines

Migration

1

4

75

2

4

3

12

56 7

Node1Node2

6

1

4

75

2

4

3

12

5

1 5

4 2

3


1

1

1

Master

Mirror

Replica

Crash

Procedure:1. Mirrors upgrade to masters and broadcast2. Reload missing graph structure and reconstruct

Inconsistency after recovery

1

2

3

1

W1 W2

Compute Compute

SendMsg SendMsg

EnterBarrier

Commit Commit

LeaveBarrier

2

3

Replica 2 on W1

Rank

Activated false0.10.2

Master 2 on W2

Rank 0.1

Activated falsetrue

Replay Activation

1

2

3

1

W1 W2

2

3

Replica 2 on W1

Rank

Activated false

Master 2 on W2

Rank 0.1

Activated

Master 1 on W1

Rank 0.2

Activated false

ActNgbs true

Replica 1 on W2

Rank

Activated false

ActNgbs

falsetrue0.20.1

falsetrue

0.20.4

Evaluation

• 50 VMs (10G memory 4cores)

• HDFS (3 Replications)

• Applications Application Graph Vertices Edge

PageRankGWeb 0.87M 5.11M

LJournal 4.85M 70.0MWiki 5.72M 130.1M

ALS SYN-GL 0.11M 2.7MCD DBLP 0.32M 1.05M

SSSP RoadCA 1.97M 5.53M

Speedup over Hama• Imitator is based on Hama, a open source clone of Pregel

– Replication for dynamic computing [Distributed Graphlab, VLDB’12]

• Evaluated systems– Baseline: Imitator without fault tolerance– REP: Baseline + Replication based FT– CKPT: Baseline + Checkpoint based FT

GWeb LJournal Wiki SYN-GL DBLP RoadCA0

1

2

3

4

Speedup

Normal execution overhead

Replication has negligible execution overhead

Communication Overhead

Performance of recovery

41

Exec

ution

Tim

e (S

econ

d) 56

GWeb LJournal Wiki SYN-GL DBLP RoadCA02468

101214161820

CKPTRebirthMigration

Recovery Scalability

10 20 30 40 500

10

20

30

40

50

60

RebirthMigration

Reco

very

Tim

e (S

econ

d)

The more machines, the faster the recovery

Simultaneous failure

One Two Three0

5

10

15

20

25

30

35

Recovery Time

RebirthRecovery

GWeb LJournal Wiki SYN-GL DBLP RoadCA95%

97%

99%

101%

103%

105%

107%

109%

111%

Overhead

OneTwoThree

Exec

ution

Tim

e (S

econ

d)Add more replicas to tolerate more than 1 machines simultaneous failure

Case study

– Application: PageRank on the dataset of LiveJournal– A checkpoint for every 4 iterations– A Failure is injected between the 6th iteration and the 7th iteration

0 20 40 60 80 100 120 140 1600

2

4

6

8

10

12

14

16

18

20

BASE

CKPT/4

REP

CKPT/4 + 1 Failure

Rebirth + 1 Failure

Migration + 1 Failure

Execution time (Second)

Fini

shed

ite

ratio

ns

Detect Failure45 8.8

2.6

Replay

Conclusion

• Imitator: a graph engine which supports fault tolerance

• Imitator’s execution overhead is negligible because it leverages existing replicas

• Imitator’s recovery is fast because of its parallel recovery approach

Backup

Memory Consumption

Partition Impact

replication-based fault-tolerance for large-scale graph processing

Documents

fault tolerance system

f replicas

total rj

rank total

total trigger neighbors

master recovers replicas2

consumertheir states

haibo chen