resilience at scale: the importance of real world data bianca schroeder computer science department...

17
Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Upload: mavis-agnes-rich

Post on 17-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Resilience at Scale:The importance of real world data

Bianca Schroeder

Computer Science DepartmentUniversity of Toronto

Page 2: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 20232

Reliability is important

• Failures are frustrating and expensive.• Might get worse in the future with increasing

scale & component count

• Why has there not been more progress?

Page 3: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 20233

Failures are not very well understood

• “Much academic and corporate research is based on anecdotes and back of the envelope calculations” [Schwarz06].

• “Most papers use simplistic assumptions about component failures ..” [Patterson99].

• Why?No publicly available data on failures in real systems.

Page 4: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

• Types of failures covered:

• Cluster node outages (records of more than 23,000 outages)

• Storage failures (data covering more than 100,000 drives)

• DRAM errors

Examples from real-world data

[FAST 07] Joint w/ Gibson. Best paper award.

[SciDAC 07] Joint w/ Gibson.

[FAST 08] Joint w/ Bairavasundaram. Best paper award.

[TOS 08] Joint w/Bairavasundaram et al.

[DSN 06] Joint w/ Gibson.

[TDSC 08] Joint w/ Gibson

[Sigmetrics 09] Join w/ Pinheiro, Weber. Best presentation award.

Page 5: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 20235

The data:

• Hard drive failures• Data covers > 100,000 drives• SATA, FC, SCSI• Enterprise and HEC environment

• Errors in DRAM• Data written differently from how it was written

– Both correctable & uncorrectable, soft & hard• Data covers all of Google’s fleet• DDR1, DDR2, FBDIMM• 5 different manufacturers, 6 hardware platforms

Page 6: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 20236

Frequency of errors in today’s systems

• Example 1: [Sigmetrics’09]

DRAM errors in the field

SheetData

Field

• Example 2: [FAST’06,TOS’07]

HDD replacements in the field

Num

ber

of C

Es

/ yea

r

A B C D E F

010

0030

00

N/A

Hardware Platform

Correctable errors (CEs)

• Accelerated lab tests and vendor data sheets are not enough• Need real field data!

Field

Lab tests

SATA

• Dominated by hard errors, not soft errors• Not getting worse with newer generations

• SATA not less reliable than SCSI & FC

Page 7: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 20237

Effect of age?

Nominal lifetime – 5 years

• Theory:

Little effect during nominal lifetime

• Practice: [FAST’06,sigmetrics’09]

Surprisingly early wear-out

Infant mortality no concern

HDD replacements

DRAM errors

Page 8: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 20238

Effect of temperature?• Theory:

Effect known from lab experiments

• Practice: [FAST’06,sigmetrics’09]

Unclear effect in the field

HDDreplacements

Time

Err

or r

ate

DRAM errors

•Similar results for latent sector errors in hard drives

Page 9: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 20239

Statistical properties?• Theory:

Poisson process

- independent failures

- exponential time between failures

• Practice: [FAST’06,sigmetrics’09]

Correlations

- autocorrelation

- long-range dependence

Long tails in time between failures.

Bianca Schroeder © Apr 21, 20239

Expe

cted

num

ber o

f fa

ilure

s in

a w

eek

Data4

3

2

1

0

5

Independence

SMALL MEDIUM LARGE

Number of failures in previous week

Page 10: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 202310

Failures are not very well understood

• Failures often look different from common assumptions• Even for basic properties, such as frequency.

• Impact of factors such as age, workload, environmental factors, etc.

• Statistical properties

• Found this to be true for various types of errors:• Hard drive replacements

• Memory errors

• Cluster node outages

• Latent sector errors

• Data corruption

• Does it matter?

Page 11: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 202311

Probability of a RAID failure

• Depends on probability of second failure during reconstruction.

• Approach 1: Use datasheet MTTF and exponential distribution.

0.00E+00

1.00E-03

2.00E-03

3.00E-03

4.00E-03

5.00E-03

6.00E-03

1h 3h 6h

Appr. 1 Data Appr. 3 Data

4

2

1

0

3

5

6x 10 -3

Pro

bab

ility

(%

)

Reconstruction time

Page 12: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 202312

0.00E+00

1.00E-03

2.00E-03

3.00E-03

4.00E-03

5.00E-03

6.00E-03

1h 3h 6h

Appr. 1 Data Weibull Data

4

2

1

0

3

5

6x 10 -3

Pro

bab

ility

(%

)

• Depends on probability of second failure during reconstruction.

• Approach 1: Use datasheet MTTF and exponential distribution.

Probability of a RAID failure

Reconstruction time

Page 13: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 202313

0.00E+00

1.00E-03

2.00E-03

3.00E-03

4.00E-03

5.00E-03

6.00E-03

1h 3h 6h

Appr. 1 Appr. 2 Data Datax 10 -3

4

2

1

0

3

5

6

Pro

bab

ility

(%

)

• Depends on probability of second failure during reconstruction.

• Approach 1: Use datasheet MTTF and exponential distribution.• Approach 2: Use measured MTTF and exponential distribution.

Probability of a RAID failure

Reconstruction time

Page 14: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 202314

0.00E+00

1.00E-03

2.00E-03

3.00E-03

4.00E-03

5.00E-03

6.00E-03

1h 3h 6h

Appr. 1 Appr. 2 Appr. 3 Datax 10 -3

4

2

1

0

3

5

6

Pro

bab

ility

(%

)

• Depends on probability of second failure during reconstruction.

• Approach 1: Use datasheet MTTF and exponential distribution.• Approach 2: Use measured MTTF and exponential distribution.• Approach 3: Use Weibull distribution fit to data.

Probability of a RAID failure

Reconstruction time

Page 15: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 202315

0.00E+00

2.00E-03

4.00E-03

6.00E-03

8.00E-03

1.00E-02

1.20E-02

1.40E-02

1.60E-02

1h 3h 6h

Appr. 1 Appr. 2 Appr. 3 Data

1.21.0

0.60.40.2

0

0.8

1.41.6

x 10 -2

Pro

bab

ility

(%

)

• Depends on probability of second failure during reconstruction.

• Approach 1: Use datasheet MTTF and exponential distribution.• Approach 2: Use measured MTTF and exponential distribution.• Approach 3: Use Weibull distribution fit to data.

Probability of a RAID failure

Page 16: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Conclusion

• Failures often not well understood• It matters when designing systems!• Need real world data!

Page 17: Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto

Bianca Schroeder © Apr 21, 202317

The computer failure data repository (CFDR)

• Gather & publish real failure data

• Community effort• Usenix clearinghouse

• Data on all aspects of system failure

• Anonymized as needed

• www.cfdr.usenix.org

Do you have any data to contribute?