1 fast failure recovery in distributed graph processing systems yanyan shen, gang chen, h.v....

Fast Failure Recovery in Distributed Graph Processing Systems

Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor

Graph analytics• Emergence of large graphs

– The web, social networks, spatial networks, …

• Increasing demand of querying large graphs– PageRank, reverse web link analysis over the web graph– Influence analysis in social networks– Traffic analysis, route recommendation over spatial graphs

Distributed graph processing

MapReduce-like systems

Pregel-like systems GraphLab-related systems

Others

Failures of compute nodes

Increasing graph size

More compute nodes

Increase in the number of failed

1 10 100 10000

0.8Failure probability when avg. failure

time of a compute node is ~200 hours

# of compute nodes

• Failure rate– # of failures per unit of time– 1/200(hours)

• Exponential failure probability

Outline• Motivation & background• Failure recovery problem

– Challenging issues– Existing solutions

• Solution– Reassignment generation– In-parallel recomputation– Workload rebalance

• Experimental results• Conclusions

Pregel-like distributed graph processing systems• Graph model

– G=(V,E)– P: partitions

• Computation model– A set of supersteps– Invoke compute function for each active vertex– Each vertex can

•Receive and process messages•Send messages to other vertices•Modify its value, state(active/inactive), its outgoing edges

𝐍𝟏 𝐍𝟐

I D EB F

Vertex Subgraph

Failure recovery problem• Running example

– All the vertices compute and send messages to all neighbors in all the supersteps

– N1 fails when the job executes in superstep 12

– Two states: record each vertex completes which superstep when failure occurs (Sf)and failure is recovered (Sf*)

• Problem statement– For a failure F(Nf, sf), recover vertex states from Sf to Sf*

𝐍𝟏 𝐍𝟐

Sf Sf*

A-F: 10; G-J: 12 A-J: 12

Challenging issues• Cascading failures

– New failures may occur during the recovery phase– How to handle all the cascading failures if any?

•Existing solution: treat each cascading failure as an individual failure and restart from the latest checkpoint

• Recovery latency– Re-execute lost computations to achieve state S*– Forward messages during recomputation– Recover cascading failures– How to perform recovery with minimized latency?

Existing recovery mechanisms• Checkpoint-based recovery

– During normal execution•all the compute nodes flush its own graph-related information to a reliable storage at the beginning of every checkpointing superstep (e.g., C+1, 2C+1, …, nC+1).

– During recovery•let c+1 be the latest checkpointing superstep•use healthy nodes to replace failed ones; all the compute nodes rollback to the latest checkpoint and re-execute lost computations since then (i.e., from superstep c+1 to sf)

Simple to implement! Can handle cascading failures!

Replay lost computations over whole graph!

Ignore partially recovered workload!

Existing recovery mechanisms• Checkpoint + log

– During normal execution:•besides checkpoint, every compute node logs its outgoing messages at the end of each superstep

– During recovery•Use healthy nodes (replacements) to replace failed one•Replacements:

– redo lost computation and forward messages among each other;

– forward messages to all the nodes in superstep sf

•Healthy nodes:–holds their original partitions and redo the lost computation by forwarding locally logged messages to failed vertices

Existing recovery mechanisms• Checkpoint + log

– Suppose latest checkpoint is made at the beginning of superstep 11; N1 (A-F) fails at superstep 12

– During recovery•superstep 11: A-F perform computation and send messages to each other; G-J send messages to A-F•superstep 12:A-F perform computation and send messages along their outgoing edges; G-J send messages to A-F

𝐍𝟏 𝐍𝟐

Less computation and communication cost!

Overhead of locally logging! (negligible)

Limited parallelism: replacements handle all the lost computation!

Outline• Motivation & background• Problem statement

Our solution• Partition-based failure recovery

– Step 1: generate a reassignment for the failed partitions– Step 2: recompute failed partitions

•Every node is informed of the reassignment•Every node loads its newly assigned failed partitions from the latest checkpoint; redoes lost computations

– Step 3: exchange partitions•Re-balance workload after recovery

Recompute failed partitions• In superstep , every compute node iterates through its

active vertices. For each vertex , we:– perform computation for vertex only if:

•its state after the failure satisfies:

– forward a message from to only if:•; or,

Intuition: will need this message to perform computation in superstep i+1

Example• N1 fails in superstep 12

– Redo superstep 11, 12

𝐍𝟏 𝐍𝟐

(1) reassginment (2) recomputation

Less computation and communication cost!

Handling cascading failures• N1 fails in superstep 12

• N2 fails superstep 11 during recovery

𝐍𝟏 𝐍𝟐

(1) reassginment

(2) recomputation

No need to recover A and B since they have been recovered!

Same recovery algorithm can be used to recovery any failure!

Reassignment generation• When a failure occurs, how to compute a good

reassignment for failed partitions?– Minimize the recovery time

• Calculating recovery time is complicated because it depends on:– Reassignment for the failure– Cascading failures– Reassignment for each cascading failure

No knowledge about cascading failures!

Our insight• When a failure occurs (can be cascading failure), we

prefer a reassignment that can benefit the remaining recovery process by considering all the cascading failures that have occurred

• We collect the state S after the failure and measure the minimum time Tlow to achieve Sf*

– Tlow provides a lower bound of remaining recovery time

Estimation of Tlow

– Ignore downtime (similar over different recovery methods)

• To estimate computation and communication time, we need to know:– Which vertex will perform computation – Which message will be forwarded (across different nodes)

• Maintain relevant statistics in the checkpoint

Reassignment generation problem• Given a failure, find a reassignment that minimizes

– Problem complexity: NP-hard– Different from graph partitioning problem

•Assignment partitioning•Not a static graph, but depends on runtime vertex states and messages•No “balance” requirement

• Greedy algorithm– Start with a random reassignment for failed partitions and

achieve a better one (with less ) by “moving” the failed partitions

Outline• Motivation & background• Problem statement

Experimental evaluation• Experiment settings

– In-house cluster with 72 nodes, each of which has one Intel X3430 2.4GHz processor, 8GB of memory, two 500GB SATA hard disks and Hadoop 0.20.203.0, and Giraph-1.0.0.

• Comparisons– PBR(our proposed solution), CBR(checkpoint-based)

• Benchmark Tasks– K-means– Semi-clustering– PageRank

• Datasets– Forest– LiveJournal– Friendster

PageRank results

Logging Overhead Single Node Failure

PageRank results

Multiple Node Failure Cascading Failure

PageRank results (communication cost)

Multiple Node Failure Cascading Failure

Conclusions• Develop a novel partition-based recovery method to

parallelize failure recovery workload for distributed graph processing

• Address challenges in failure recovery– Handle cascading failures– Reduce recovery latency

• Reassignment generation problem• Greedy strategy

Thank You!

1 fast failure recovery in distributed graph processing systems yanyan shen, gang chen, h.v....

recovery phasehow

forward messages

failure fnf

individual failure

sfand failure

outgoing messages

superstep c

compute nodes rollback

Documents

a survey of error analysis - h.v. george

h.v. sudarshan h.v. sudarshan international civil aviation...

h.v. sudarshan international civil aviation organization

mar 27, 2008 christiano santiago1 making database systems...

tax: a tree algebra for xml h.v. jagadish laks v.s....

idistance -- indexing the distance idistance -- indexing the...

baton: a balanced tree structure for peer-to-peer...

the patthana vol 4 bihkku jagadish kashyap

performance framework for efficiency and safety of air...

jnvgirsomnath.orgjnvgirsomnath.org/jnvstupload.xls · xls...

making database systems usable h.v. jagadish univ. of...

making database systems usable h.v. jagadish adriane chapman...

panda: facilitating usable ai development - arxiv.org ·...

data power architectural patterns - jagadish vemugunta

mcafee suyang “ sean ” hong , yanyan xu

h.v. evatt and the establishment of israel

global aviation safety plan h.v. sudarshan h.v. sudarshan...

fr:grishin f. to:grishko h.v. - central intelligence...

section 3 h.v switchboard and related equipment

undergraduate architectural portfolio jagadish sharma