free powerpoint templates page 1 free powerpoint templates advanced topics in storage systems disk...

25
Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to you? - Bianca Schroeder and Garth A. Gibson. FAST 2007 Failure Trends in a Large Disk Drive Population - Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso, Google Inc. FAST 2007 Presented by : Yaroslav Kagansky

Upload: ralf-stokes

Post on 17-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 1

Free Powerpoint Templates

Advanced Topics in Storage Systems

Disk Filures

Based on:

• Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to you? - Bianca Schroeder and Garth A. Gibson. FAST 2007

• Failure Trends in a Large Disk Drive Population - Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso, Google Inc. FAST 2007Presented by : Yaroslav Kagansky

Page 2: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 2

Lecture Contents

• Research methodology in this field.

• MTTF && AFR – widely used yet not so precise.

• Various factors that affect disc’s life time.

• SMART Data analysis and their ability to predict future disc failures.

• Conclusions and my point of view.

Page 3: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 3

Few words about the papers

• Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to you?• Focuses on MTTF, AFR accuracy and other common

assumptions in the field of disc failures.• Based on hardware replacement and warranty service

logs.• Examines various rotation speeds and interfaces (i.e.

SATA, SCASI, FC).• Data was collected from different organizations.

• Failure Trends in a Large Disk Drive Population• Focuses on building a prediction disk failure prediction

Model.• Data was collected using a ‘software demon’ that was

running on Google's servers.• Examines cheap discs only (i.e. 5400/7200 STATA

drives)• Based on data from Google only.

Page 4: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 4

Research methodology

• How should we define a ‘disc failure’?• Both of the paper define a failure event as drive is

considered to have failed if it was replaced as part of a repairs procedure.

• Hard drive is a very complicated system• Large amounts of data are needed in order to come to

quality conclusions.

• How was the data collected?• Google’s system (next slide)• Hardware replacement and warranty service logs.

• Ignoring bad batches

Page 5: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 5

The complicity of a storage system

Page 6: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 6

Google’s data collection system

The demon collects various types of information form Google's servers.The data is being stored at a central repository for future analysis (GFS format).The data is analyzed with Mapreduce framework

Page 7: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 7

Reliability metrics

• Annualized Failure Rate (AFR)• The percentage of disk drives in

a population that fail in a test scaled to a per year estimation

• Typically based on extrapolation from accelerated life test data of small populations or from returned unit databases – Provided by the vendors

• Accelerated life tests doesn’t take into account• Environmental factors.

• Poor predictors of actual failure rates.

• Mean Time To Failure (MTTF)• The MTTF is estimated as the number of power on

hours per year divided by the AFR

Page 8: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 8

AFR inaccuricy

shows a significant discrepancy betweenthe observed ARR and the datasheet AFR for all datasets. While the datasheet AFRs are between 0.58% and0.88%, the observed ARRs range from 0.5% to as highas 13.5%. That is, the observed ARRs by data set andtype, are by up to a factor of 15 higher than datasheetAFRs

Page 9: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 9

Cumulative operating time

Failure rates of hardwareproducts typically follow a “bathtub curve” with highfailure rates at the beginning (infant mortality) and theend (wear-out) of the lifecycle. The Figure above shows the failure rate pattern that is expected for the life cycle of hardDrives.

Page 10: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 10

Age-dependent replacement rates

• Replacement rates in all years (except the first) are larger than the data sheet.

• Replacement rates are rising significantly over the years

Page 11: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 11

Age-dependent replacement rates

• Steadily increasing replacement rate doesn’t come along with the common assumption that after the first year the replacement rate stays steady.

• By observing the figure from the pervious slide we see that early onset of wear-out seems to have

a much stronger impact on lifecycle replacement rates than ‘infant mortality’.

Page 12: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 12

Utilization

• We define ‘utilization’ as the fraction of time the drive is active out of the time it is powered on

• We expect to notice very strong correlation between high utilization and higher failure rates.

• But the results appear to paint more complex picture that that..

Page 13: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 13

Utilization

• Only very young and very old disc groups appear to show the expected behavior• It’s possible that failure modes that

associated with higher utilization are more prominent early in drive’s lifetime. the drives that survive the infant mortality phase are the least susceptible to that failure mode

• high correlation between utilization and failures has been based on extrapolations from manufacturers’ accelerated life experiments. Those experiments arelikely to better model early life failure characteristics and as such they agree with the trend we observe for the young age groups

Page 14: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 14

Temperature

• Temperature is often quoted as the most important environmental factor affecting disk drive reliability.• Previous studies have indicated that

temperature deltas as low as 15C can nearly double disk drive failure rates.

• But again, we get very surprising results..

Page 15: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 15

Temperature

• Temperature effects only for the high end of our range and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates.

• We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do.

Page 16: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 16

Failure rates && Poisson proocess

• The Poisson assumption implies that the number of failures during a given time interval (e.g. a week or a month) is distributed according to the Poisson distribution (Poisson process)• Key property of this distribution is independence of

failures • Time between time between failures also doesn’t

fit exponential distribution.

• The researchers found strong correlation between failures in consecutive weeks and months.

The correlation coefficient between consecutive weeks is 0.72, and the correlation coefficient between consecutive months is 0.79.

Page 17: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 17

Correlation between failure

Number of disk replacements in aweek depending on the number of disk replacements in the previous week.

• The fact that failure rates aren’t steady over the lifetime of the system may cause the poor fit to Poisson process

Page 18: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 18

Using SMART data to predict failures

• SMART• Self-Monitoring Analysis and

Reporting Technology

• The researchers tried to build a disc failure prediction model according to data the can be acquired from disc’s SMART parameters.

• They tried to find the SMART parameters that have the strongest correlation with future failures.

• Can we build a reliable failure prediction model based on SMART only?

Page 19: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 19

Scan Errors

• Large scan error counts can be indicative of surface defects, and therefore are believed to be indicative of lower reliability.

• They found that the group of drives with scan errors are ten times more likely to fail than the group with no errors

• It was found that the amount of errors decreases the chance of a disc to survive.

Page 20: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 20

Reallocation Counts

• When the drive’s logic believes that a sector is damaged

(typically as a result of recurring soft errors or a hard error) it can remap the faulty sector number to a new physical sector drawn from a pool of spares.

• Reallocation counts reflect the number of times this has happened, and is seen as an indication of drive surface wear

• The researchers found that After their first reallocation, drives are over 14 times more likely to fail within 60 days than drives without reallocation counts.

Page 21: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 21

Probational Counts

• Disk drives put suspect bad sectors “on probation” until they either fail permanently and are reallocated or

continue to work without problems. Probational counts therefore, can be seen as a softer error indication

Drives with non zero probational counts are 16 times more likely to fail within 60 days than drives with zero probational counts.

Page 22: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 22

Other parameters that were studied• The researchers also examined other parameters but

they didn’t find strong correlation between them and disc failures• Seek Errors - Seek errors occur when a disk drive

fails to properly track a sector and needs to wait for another revolution to read or write from or to a sector.

• For some manufacturers, there is no correlation between failure rates and seek errors.

• Power Cycles - The power cycles indicator counts the number of times a drive is powered up and down.

• For 2 years old discs there is no significant  correlation between failures and high power cycles count, But for drives 3 years and older, higher power cycle counts can increase the absolute failure rate by over 2%

Page 23: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 23

Predictive Power of SMART Parameters

• Given how strongly correlated some SMART parameters were found to be with higher failure rates, they were hopeful that accurate predictive failure models based on SMART signals could be created.

• However..• Out of all failed drives, over 56% of them have

no count in any of the four strong SMART signals, namely scan errors, reallocation count, and probational count.

• In other words, models based onlyon those signals can never predict more than half of the failed drives.

Page 24: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 24

Conclusions

• It is very difficult to conduct a serious research in the field of disc failures

• A lot of data is needed to be collected.• There isn’t much related work that was done in

this field. Mostly vendor’s technical papers.

• AFFR, MTTR and some common assumptions about disc failures tend to be incorrect.

• The affect of temperature on the fail rate.• Correlation between disc failures.

• SMART parameters can be used for building a disc failure prediction model

• Even the most indicative parameters that were presented here couldn't predict nearly half of the failures.

• It is possible, however, that models that use parameters beyond those provided by SMART could achieve significantly better accuracies.

Page 25: Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What

Free Powerpoint TemplatesPage 25

My point of view

• Research in this field is very important• A lot of resources can be save if we will be able

to predict disc failure.

• How can we make a research in this field easier?

• Non of the papers present a good prediction model.• Both of them only critic the current situation.

A good continuation for both of the papers would be presenting a prediction model and examining it’s achievements.

• Not a enough details about the software aspects of the machines that were tested.

• (i.e. which OS and programs were those servers running)

• What about home users and small organizations??• Maybe the MTTF/ AFR is more accurate  when

it comes those users