raid shield characterizing, monitoring, and proactively ...€¦ · raid shield characterizing,...
TRANSCRIPT
1 © Copyright 2015 EMC Corporation. All rights reserved.
RAIDShield: Characterizing, Monitoring, and
Proactively Protecting Against Disk Failures
Ao Ma
Fred Douglis
Guanlin Lu
Darren Sawyer
EMC
Surendar Chandra
Windsor Hsu
Datrium
2 © Copyright 2015 EMC Corporation. All rights reserved.
Disk failures are commonplace
• Whole-disk failure
• Partial failure
RAID is widely deployed
• Protect data against failures with redundancy
Pervasive RAID Protection
3 © Copyright 2015 EMC Corporation. All rights reserved.
Storage system is evolving
• Escalated use of less reliable drives causes more
whole-disk failures
• Increasing disk capacity results in more sector errors
Solution
• Add extra redundancy (RAID5, RAID6, …)
– Ensure data reliability at the cost of storage efficiency
RAID Overview
Is adding extra redundancy an efficient solution?
4 © Copyright 2015 EMC Corporation. All rights reserved.
Analyzed 1 million SATA disks and revealed
• Failure modes degrading RAID reliability
• Reallocated sectors reflect disk reliability deterioration
• Disk failure is predictable
Built RAIDSHIELD, an active defense mechanism
• Reconstruct failing disk before it’s too late!
• PLATE: single-disk proactive protection
– Deployment eliminates 70% of RAID failures
• ARMOR: disk group proactive protection
– Recognize vulnerable RAID groups
What We Did
5 © Copyright 2015 EMC Corporation. All rights reserved.
Background
Disk failure analysis
RAIDSHIELD:
• Identify failure indicator
• Reallocated Sector (RS) characterization
• Single disk proactive protection
• Disk group proactive protection
5
Outline
6 © Copyright 2015 EMC Corporation. All rights reserved.
Disk failure does not follow a fail-stop model
The production systems studied define failure as
• Connection is lost
• An operation exceeds the timeout threshold
• Write fails
Whole-disk Failure Definition
7 © Copyright 2015 EMC Corporation. All rights reserved.
• Each disk drive model is denoted as <family-capacity>
• Relative sizes within a family are ordered by the capacity number – E.g. A-2 is larger than A-1
Disk Model Population (Thousands)
First Deployment
Log Length (Months)
A-1 34 06/2008 60
A-2 165 11/2008 60
B-1 100 06/2008 48
C-1 93 10/2010 36
C-2 253 12/2010 36
D-1 384 09/2011 21
Disk Data Collection
8 © Copyright 2015 EMC Corporation. All rights reserved.
What Do Real Disk Failures Look Like?
9 © Copyright 2015 EMC Corporation. All rights reserved.
0 0 0 1 4
24
46
20
3 1 0
10
20
30
40
50
0-6 12-18 24-30 36-42 48-54
A-2
Month
Distribution of Lifetime of Failed Drives
0 0 1 2 4
15
34 29
11
3
0
10
20
30
40
0-6 12-18 24-30 36-42 48-54
A-1
Month
A large fraction of failed drives are found at a similar age
P
erc
en
tage
(%)
P
erc
en
tage
(%)
10 © Copyright 2015 EMC Corporation. All rights reserved.
The number of affected disks keep growing
• About 10% of disks get sector errors at the 3rd year
Sector error numbers increase continuously
• Average error count increases 25% to 300% year
over year
Increasing Frequency of Sector Errors
11 © Copyright 2015 EMC Corporation. All rights reserved.
Drive failing at a similar age
• Failure rate is not constant
• A high risk of multiple simultaneous failures
Increasing frequency of sector errors
• Exacerbate risk of reconstruction failures
Passive Redundancy is Inefficient
Ensuring reliability in the worst case requires
adding considerable extra redundancy, making it
unattractive from a cost perspective
12 © Copyright 2015 EMC Corporation. All rights reserved.
Motivation
• Ensure data safety with minimal redundancy
• Proactively recognize impending failures and
migrate vulnerable data in advance
Methodology
• Identify indicator of impending failure
• Indicator characterization
• Proactive protection
RAIDSHIELD, The Proactive Protection
13 © Copyright 2015 EMC Corporation. All rights reserved.
Potential indicators
• Various disk errors
Criteria of a good indicator
• It happens much more frequently on failed
disks rather than working disks
Approach
• Quantify the discrimination between error
value on failed disks and working ones
– Deciles comparison is used
Identify Failure Indicator
14 © Copyright 2015 EMC Corporation. All rights reserved.
Failed disks have more media errors than working ones
The discrimination is not significant enough
Media Error Comparison
2 3 5 9 15
22 32
47
86
1 1 1 2 3 4 7 13
30
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9
failed disk working disk
Deciles
A-2 M
ed
ia E
rro
r C
ou
nt
15 © Copyright 2015 EMC Corporation. All rights reserved.
2 23 87 187 327
522 812
1242
2025
0 0 0 0 0 1 2 6 29
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9
failed disk working disk
A-2
Deciles
Re
allo
cate
d S
ect
or
Co
un
t
A-2
Deciles
RS is strongly correlated with disk failures
Reallocated Sector (RS) Comparison
16 © Copyright 2015 EMC Corporation. All rights reserved.
Most failed drives tend to have a larger number of
RS than working ones
RS is strongly correlated with whole-disk failures,
followed by media errors, pending sector errors
and uncorrectable sector errors
Correlation Between Sector Errors
And Whole-disk Failure
RS is a strong indicator of impending disk failure
17 © Copyright 2015 EMC Corporation. All rights reserved.
Larger RS count implies higher failure rate in
two-month observation window
Disk Failure Rate Given Different RS Count
1.7
67 75 80 83 85 86 89 90 90 91 92 93 93 94 95
0
20
40
60
80
100
0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600
Dis
k Fa
ilure
Rat
e (
%)
RS count
A-2
RS Characterization (1)
18 © Copyright 2015 EMC Corporation. All rights reserved.
Larger RS count, faster to fail
10%
25%
median
75%
90%
RS Count
Time
Margin
(Days)
Disk Failure Time Given Different RS Count
RS Characterization (2)
19 © Copyright 2015 EMC Corporation. All rights reserved.
RS count indicates the degree of disk
reliability deterioration
Use the RS count to predict impending
disk failure in advance
PLATE: Single Disk Proactive Protection
20 © Copyright 2015 EMC Corporation. All rights reserved.
70.1 66.6 64 61.8 59.9 52.1
47 42.6 39 36.9
4.5 2.8 2.1 1.7 1.4 0.8 0.7 0.4 0.3 0.27 0
10
20
30
40
50
60
70
80
90
100
20 40 60 80 100 200 300 400 500 600
failures predicted
false positive
Pe
rce
nta
ge (
%)
Both the predicted failure and false positive rates decrease as the threshold increases
Simulation Result: Failures Captured Rate Given Different RS Threshold
RS threshold
21 © Copyright 2015 EMC Corporation. All rights reserved.
5 5 15 15
80
10
70
0
20
40
60
80
100
Without Proactive Protection With Proactive Protection
Hardware Failures Others Triple Failures Eliminated Triple Failures
Single proactive protection reduces about 70% of RAID failures, equivalent to 88% of the triple-disk failures
PLATE Deployment Result: Causes of Recovery Incidents
Pe
rce
nta
ge (
%)
22 © Copyright 2015 EMC Corporation. All rights reserved.
10% remaining triple failures
• PLATE misses RAID failures caused by multiple less reliable
drives, whose RS counts haven’t exceed the threshold
Triage
• Prioritize disk groups with highest risk
Motivation of ARMOR:
The RAID Group Proactive Protection
23 © Copyright 2015 EMC Corporation. All rights reserved.
1-1
1-2
1-3
1-4
2-1
2-2
2-3
2-4
3-1
3-2
3-3
3-4
4-1
4-2
4-3
4-4
?
X
X
X X
?
?
?
?
Threat of Failure
Imminent Failure
Good Disk
Healthy DG1
Imminent Failure of DG2
Protected DG3
Possible Failure of DG4
Single disk protection: Replace 2-3, 2-4, 3-4 (PLATE) Can’t identify DG4 nor the difference between DG2 and DG3
Group protection: Replace DG4 or increase redundancy (ARMOR) Protect DG4 and recognize the difference between DG2 and DG3
Disk Group Protection Example
24 © Copyright 2015 EMC Corporation. All rights reserved.
Calculate the single disk failure probability
• Conditional probability through Bayes Theorem
Calculate the probability of a vulnerable RAID
• Combination of those single disk probabilities through joint probability
ARMOR Methodology
25 © Copyright 2015 EMC Corporation. All rights reserved.
The discrimination shows ARMOR is effective to recognize endangered DGs
In practice, it identifies most DG failures that are not predicted by PLATE
Pro
bab
ility
Deciles distribution
Evaluation
0.25 0.33
0.39 0.44 0.46 0.5
0.63 0.73
0.93
0.15 0.2 0.23 0.25 0.27 0.28 0.3 0.31 0.32
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9
Groups with more than one failure Groups without failure
26 © Copyright 2015 EMC Corporation. All rights reserved.
Google reports SMART metrics such as reallocated sector strongly suggest an impending failure, but they also determine that half of the failed disks show no such errors [Pinheiro’07]
• Different workload and RAID rewrite
Disk failure prediction
• Average maximum latency [Goldszmidt’12]
• SMART failure prediction [Murray’05, Hughes’02]
Related Work
27 © Copyright 2015 EMC Corporation. All rights reserved.
We analyzed 1 million SATA drives
• Observe failure modes degrading RAID reliability
• Reveal RS count reflects the disk reliability deterioration
• Disk failure is predictable
We built RAIDSHIELD, an active defense mechanism
• PLATE: single disk proactive protection – Deployment eliminates 70% of RAID failures
• ARMOR: disk group proactive protection – Recognize vulnerable RAID groups
– Hope to deploy in future
Is adding extra redundancy an efficient solution?
• Use as much redundancy as needed to ensure availability
• Proactive replacement should decrease the level needed
Summary
28 © Copyright 2015 EMC Corporation. All rights reserved.
RAIDShield: Characterizing, Monitoring, and
Proactively Protecting Against Disk Failures
Questions?
Acknowledgement Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau
Data Domain engineer team, members of AD and CTO office, Stephen Manley
Contact: [email protected]