‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery
Florin Dinu T. S. Eugene NgRice University
2
is Widely Used
Image Processing
Protein Sequencing
Web Indexing
Machine Learning
Advertising Analytics
Log Storage and Analysis
*
* Source: http://wiki.apache.org/hadoop/PoweredBy
2010
Recent research
work
3
Compute-Node Failures Are Common
“ ... typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours”
Jeff Dean – Google I/O 2008
“ 5.0 average worker deaths per job” Jeff Dean – Keynote I – PACT 2006
Revenue Reputation User experience
4
Compute-node failures are common and damaging
is widely used
How does behave under compute-node failures?
vs
Inflated, variable and unpredictable job running times. Sluggish failure detection.
What are the design decisions responsible? Answer in this work.
5
Focus of This WorkTask Tracker failures
• Loss of intermediate data• Loss of running tasks• Data Nodes not failed
Types of failures• Task Tracker process fail-stop failures• Task Tracker node fail-stop failures
Single failures• Expose mechanisms and their
interactions• Findings also apply to multiple
failures
NameNode
JobTracker
Task Tracker
Mapper
Reducer
Data Node
6
Declaring a Task Tracker Dead
Time
Heartbeats from Task Tracker to Job Tracker
Usually every 3s
Job Tracker checks if heartbeats not sent
for at least 600s
200s Restart running tasksRestart completed maps
Conservative design
<200 <400 <600 >600
7
Declaring a Task Tracker Dead
Time
Variable failure detection time
<200 <400 <600 >600
<200 <400 <600 >600
Detection time ~ 800s
Detection time ~ 600s
Time
8
• Uses notifications from running reducers to Job Tracker• A message that a specific map output is unavailable
• Restart map M to re-compute its lost output• #notif(M) > (0.5* #running reducers) and #notif(M) > 3
Declaring Map Output Lost
JobTracker
Time
X
Conservative design Static parameters <200 <400 <600 >600
9
Reducer NotificationsSignals a specific map output is unavailable
On connection error (R1)• re-attempt connection• send notification when
nr of attempts % 10 = 0• exponential wait between attempts
wait = 10*(1.3)^(nr_failed_attempts)
• usually 416s needed for 10 attempts
On read error (R2)• send notification immediately
M5
R1
R2
XX
JobTracker
Conservative design Static parameters
10
Declaring a Reducer Faulty• Reducer faulty if (simplified version):
#shuffles failed > 0.5* #shuffles attempted and
#shuffles succeeded < 0.5* #shuffles necessary or
reducer stalled for too long
Ignores cause of failed shuffles. Static parameters
X
11
Experiment: Methodology• 15-node, 4-rack testbed in the OpenCirrus* cluster
• 14 compute nodes, 1 reserved for Job Tracker and Name Node
• Sort job, 10GB input, 160 maps, 14 reducers, 200 runs/experiment
• Job takes 220s in the absence of failures
• Inject single Task Tracker process failure randomly between 0 and 220s
* https://opencirrus.org/ the HP/Intel/Yahoo! Open Cloud Computing Research Testbed
13Large variability in job running times
Experiment: Results
Group G2
Group G6 Group G7
Group G3
Group G5
Group G1
Group G4
14
Group G1 – few reducers impacted
Slow recovery when few reducers impacted
M1
R1
M1 copied by all reducers before failure.
R1_1X
JobTracker
After failure R1_1 cannot access M1.R1_1 needs to send 3 notifications ~ 1250sTask Tracker declared dead after 600-800s
M2M3 Notif
(M1)
R2
R3
15
Group G2 – timing of failure
Timing of failure relative to Job Tracker checks impacts job running time
TimeG1
G2
170s
170sTime
Job end
600s
600s
200s
200s difference between G1 and G2.
16
Group G3 – early notifications
Early notifications increase job running time variability
• G1 notifications sent after 416s
• G3 early notifications => map outputs declared lostCauses:
• Code-level race conditions• Timing of a reducer’s shuffle attempts
0 1 2 3 4 5 6
0 1 2 3 4 5 6
Regular notification (416s)
Early notification (<416s)
17
Group G4 & G5 – many reducers impacted
Job running time under failure varies with nr of reducers impacted
R1_1
X
JobTracker
G4 - Many reducers send notifications after 416s - Map output is declared lost before the Task Tracker is declared deadG5 - Same as G4 but early notifications are sent
Notif(M1,M2,M3,M4,M5)
M1
R1M2M3
R2
R3
18
Induced Reducer Death Reducer faulty if (simplified version):
#shuffles failed ------------------------------ > 0.5 #shuffles attempted
and
#shuffles succeeded ------------------------------ < 0.5 or stalled for too long #shuffles necessary
• If failed Task Tracker is contacted among first Task Trackers => the reducer dies
• If failed Task Tracker is attempted too many times => the reducer dies
A failure can induce other failures in healthy reducers.CPU time and network bandwidth are unnecessarily wasted.
X
19
56 vs 14 Reducers
Job running times are spread out even moreIncreased chance for induced reducer death or early notifications
CDF
20
Simulating Node Failure
Without RST packets all affected tasks wait for Task Tracker to be declared dead.
CDF
21
Lack of Adaptivity
Recall: • Notification sent after 10 attempts
Inefficiency:• A static, one size fits all solution cannot handle all situations
Efficiency varies with number of reducers
A way forward:• Use more detailed information about current job state
22
Conservative Design
Recall:• Declare a Task Tracker dead after at least 600s• Send a notification after 10 attempts and 416 seconds
Inefficiency:• Assumes most problems are transient• Sluggish response to permanent compute-node failure
A way forward:• Additional information should be leveraged
• Network state information• Historical information of compute-node behavior [OSDI ‘10]
23
Simplistic Failure Semantics
• Lack of TCP connectivity = problem with tasks
Inefficiency:• Cannot distinguish between multiple causes for lack of connectivity
• Transient congestion• Compute-node failure
A way forward:• Decouple failure recovery from overload recovery
• Use AQM/ECN to provide extra congestion information• Allow direct communication between application and
infrastructure
24
Thank you
Company and product logos from company’s website.Conference logos from the conference websites.
Links to images:http://t0.gstatic.com/images?q=tbn:ANd9GcTQRDXdzM6pqTpcOil-k2d37JdHnU4HKue8AKqtKCVL5LpLPV-2http://www.scanex.ru/imgs/data-processing-sample1.jpghttp://t3.gstatic.com/images?q=tbn:ANd9GcQSVkFAbm-scasUkz4lQ-XlPNkbDX9SVD-PXF4KlGwDBME4ugxchttp://criticalmas.com/wp-content/uploads/2009/07/the-borg.jpghttp://www.brightermindspublishing.com/wp-content/uploads/2010/02/advertising-billboard.jpghttp://www.planetware.com/i/photo/logs-stacked-in-port-st-lucie-fla513.jpg
25
Group G3 – early notifications
Early notifications increase job running time variability
• G1 notifications sent after 416s
• G3 early notifications => map outputs declared lostCauses:
• Code-level race conditions• Timing of a reducer’s shuffle attempts
R2
X
M5R2
X
0 1 2 3 4 5 6
M5-1M6-1
M5-2M6-2
M5-3M6-3
M5-4M6-4
M6-1 M5-1M6-2
M5-2M6-3
M5-3M6-4
0 1 2 3 4 5 6
M5
M6
M6
M5-4M6-5