resilience at scale: the importance of real world data bianca schroeder computer science department...
TRANSCRIPT
Resilience at Scale:The importance of real world data
Bianca Schroeder
Computer Science DepartmentUniversity of Toronto
Bianca Schroeder © Apr 21, 20232
Reliability is important
• Failures are frustrating and expensive.• Might get worse in the future with increasing
scale & component count
• Why has there not been more progress?
Bianca Schroeder © Apr 21, 20233
Failures are not very well understood
• “Much academic and corporate research is based on anecdotes and back of the envelope calculations” [Schwarz06].
• “Most papers use simplistic assumptions about component failures ..” [Patterson99].
• Why?No publicly available data on failures in real systems.
• Types of failures covered:
• Cluster node outages (records of more than 23,000 outages)
• Storage failures (data covering more than 100,000 drives)
• DRAM errors
Examples from real-world data
[FAST 07] Joint w/ Gibson. Best paper award.
[SciDAC 07] Joint w/ Gibson.
[FAST 08] Joint w/ Bairavasundaram. Best paper award.
[TOS 08] Joint w/Bairavasundaram et al.
[DSN 06] Joint w/ Gibson.
[TDSC 08] Joint w/ Gibson
[Sigmetrics 09] Join w/ Pinheiro, Weber. Best presentation award.
Bianca Schroeder © Apr 21, 20235
The data:
• Hard drive failures• Data covers > 100,000 drives• SATA, FC, SCSI• Enterprise and HEC environment
• Errors in DRAM• Data written differently from how it was written
– Both correctable & uncorrectable, soft & hard• Data covers all of Google’s fleet• DDR1, DDR2, FBDIMM• 5 different manufacturers, 6 hardware platforms
Bianca Schroeder © Apr 21, 20236
Frequency of errors in today’s systems
• Example 1: [Sigmetrics’09]
DRAM errors in the field
SheetData
Field
• Example 2: [FAST’06,TOS’07]
HDD replacements in the field
Num
ber
of C
Es
/ yea
r
A B C D E F
010
0030
00
N/A
Hardware Platform
Correctable errors (CEs)
• Accelerated lab tests and vendor data sheets are not enough• Need real field data!
Field
Lab tests
SATA
• Dominated by hard errors, not soft errors• Not getting worse with newer generations
• SATA not less reliable than SCSI & FC
Bianca Schroeder © Apr 21, 20237
Effect of age?
Nominal lifetime – 5 years
• Theory:
Little effect during nominal lifetime
• Practice: [FAST’06,sigmetrics’09]
Surprisingly early wear-out
Infant mortality no concern
HDD replacements
DRAM errors
Bianca Schroeder © Apr 21, 20238
Effect of temperature?• Theory:
Effect known from lab experiments
• Practice: [FAST’06,sigmetrics’09]
Unclear effect in the field
HDDreplacements
Time
Err
or r
ate
DRAM errors
•Similar results for latent sector errors in hard drives
Bianca Schroeder © Apr 21, 20239
Statistical properties?• Theory:
Poisson process
- independent failures
- exponential time between failures
• Practice: [FAST’06,sigmetrics’09]
Correlations
- autocorrelation
- long-range dependence
Long tails in time between failures.
Bianca Schroeder © Apr 21, 20239
Expe
cted
num
ber o
f fa
ilure
s in
a w
eek
Data4
3
2
1
0
5
Independence
SMALL MEDIUM LARGE
Number of failures in previous week
Bianca Schroeder © Apr 21, 202310
Failures are not very well understood
• Failures often look different from common assumptions• Even for basic properties, such as frequency.
• Impact of factors such as age, workload, environmental factors, etc.
• Statistical properties
• Found this to be true for various types of errors:• Hard drive replacements
• Memory errors
• Cluster node outages
• Latent sector errors
• Data corruption
• Does it matter?
Bianca Schroeder © Apr 21, 202311
Probability of a RAID failure
• Depends on probability of second failure during reconstruction.
• Approach 1: Use datasheet MTTF and exponential distribution.
0.00E+00
1.00E-03
2.00E-03
3.00E-03
4.00E-03
5.00E-03
6.00E-03
1h 3h 6h
Appr. 1 Data Appr. 3 Data
4
2
1
0
3
5
6x 10 -3
Pro
bab
ility
(%
)
Reconstruction time
Bianca Schroeder © Apr 21, 202312
0.00E+00
1.00E-03
2.00E-03
3.00E-03
4.00E-03
5.00E-03
6.00E-03
1h 3h 6h
Appr. 1 Data Weibull Data
4
2
1
0
3
5
6x 10 -3
Pro
bab
ility
(%
)
• Depends on probability of second failure during reconstruction.
• Approach 1: Use datasheet MTTF and exponential distribution.
Probability of a RAID failure
Reconstruction time
Bianca Schroeder © Apr 21, 202313
0.00E+00
1.00E-03
2.00E-03
3.00E-03
4.00E-03
5.00E-03
6.00E-03
1h 3h 6h
Appr. 1 Appr. 2 Data Datax 10 -3
4
2
1
0
3
5
6
Pro
bab
ility
(%
)
• Depends on probability of second failure during reconstruction.
• Approach 1: Use datasheet MTTF and exponential distribution.• Approach 2: Use measured MTTF and exponential distribution.
Probability of a RAID failure
Reconstruction time
Bianca Schroeder © Apr 21, 202314
0.00E+00
1.00E-03
2.00E-03
3.00E-03
4.00E-03
5.00E-03
6.00E-03
1h 3h 6h
Appr. 1 Appr. 2 Appr. 3 Datax 10 -3
4
2
1
0
3
5
6
Pro
bab
ility
(%
)
• Depends on probability of second failure during reconstruction.
• Approach 1: Use datasheet MTTF and exponential distribution.• Approach 2: Use measured MTTF and exponential distribution.• Approach 3: Use Weibull distribution fit to data.
Probability of a RAID failure
Reconstruction time
Bianca Schroeder © Apr 21, 202315
0.00E+00
2.00E-03
4.00E-03
6.00E-03
8.00E-03
1.00E-02
1.20E-02
1.40E-02
1.60E-02
1h 3h 6h
Appr. 1 Appr. 2 Appr. 3 Data
1.21.0
0.60.40.2
0
0.8
1.41.6
x 10 -2
Pro
bab
ility
(%
)
• Depends on probability of second failure during reconstruction.
• Approach 1: Use datasheet MTTF and exponential distribution.• Approach 2: Use measured MTTF and exponential distribution.• Approach 3: Use Weibull distribution fit to data.
Probability of a RAID failure
Conclusion
• Failures often not well understood• It matters when designing systems!• Need real world data!
Bianca Schroeder © Apr 21, 202317
The computer failure data repository (CFDR)
• Gather & publish real failure data
• Community effort• Usenix clearinghouse
• Data on all aspects of system failure
• Anonymized as needed
• www.cfdr.usenix.org
Do you have any data to contribute?