disk failures in the real world

Upload: tylerdu

Post on 02-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Disk failures in the real world

    1/17

    Disk failures in the real world:

    What does an MTTF of 1,000,000 hours mean to you?

    Bianca Schroeder, Garth A. Gibson

    CMU-PDL-06-111

    September 2006

    Parallel Data Laboratory

    Carnegie Mellon University

    Pittsburgh, PA 15213-3890

    Abstract

    Component failure in large-scale IT installations such as cluster supercomputers or internet service providers is becoming an ever

    larger problem as the number of processors, memory chips and disks in a single cluster approaches a million.

    In this paper, we present and analyze field-gathered disk replacement data from five systems in production use at three organizations,

    two supercomputing sites and one internet service provider. About 70,000 disks are covered by this data, some for an entire lifetime

    of 5 years. All disks were high-performance enterprise disks (SCSI or FC), whose datasheet MTTF of 1,200,000 hours suggest a

    nominal annual failure rate of at most 0.75%.

    We find that in the field, annual disk replacement rates exceed 1%, with 2-4% common and up to 12% observed on some systems.

    This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF, and that it can

    be quite variable installation to installation.We also find evidence that failure rate is not constant with age, and that rather than a significant infant mortality effect, we see a

    significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often

    assumed not to set in until after 5 years of use.

    In our statistical analysis of the data, we find that time between failure is not well modeled by an exponential distribution, since the

    empirical distribution exhibits higher levels of variability and decreasing hazard rates. We also find significant levels of correlation

    between failures, including autocorrelation and long-range dependence.

    Acknowledgements: We thank the members and companies of the PDL Consortium (including APC, EMC, Equallogic, Hewlett-Packard,

    Hitachi, IBM, Intel, Microsoft, Network Appliance, Oracle, Panasas, Seagate, and Sun) for their interest, insights, feedback, and support.

  • 8/10/2019 Disk failures in the real world

    2/17

    Keywords:Disk failure data, failure rate, lifetime data, disk reliability, mean time to failure (MTTF),

    annualized failure rate (AFR).

  • 8/10/2019 Disk failures in the real world

    3/17

    1 Motivation

    Despite major efforts, both in industry and in academia, high reliability remains a major challenge in running

    large-scale IT systems, and disaster prevention and cost of actual disasters make up a large fraction of the

    total cost of ownership. With ever larger server clusters, reliability and availability are a growing problem

    for many sites, including high-performance computing systems and internet service providers. A particu-

    larly big concern is the reliability of storage systems, for several reasons. First, failure of storage can not

    only cause temporary data unavailability, but in the worst case lead to permanent data loss. Second, many

    believe that technology trends and market forces may combine to make storage system failures occur more

    frequently in the future [19]. Finally, the size of storage systems in modern, large-scale IT installations has

    grown to an unprecedented scale with thousands of storage devices, making component failures the norm

    rather than the exception [5].

    Large-scale IT systems, therefore, need better system design and management to cope with more fre-

    quent failures. One might expect increasing levels of redundancy designed for specific failure modes [2],

    for example. Such designs and management systems are based on very simple models of component failure

    and repair processes [18]. Researchers today require better knowledge about statistical properties of storage

    failure processes, such as the distribution of time between failures, in order to more accurately estimate the

    reliability of new storage system designs.Unfortunately, many aspects of disks failures in real systems are not well understood, as it is just human

    nature not to advertise the details of ones failures. As a result, practitioners usually rely on vendor specified

    mean-time-to-failure (MTTF) values to model failure processes, although many are skeptical of the accuracy

    of those models [3, 4,27]. Too much academic and corporate research is based on anecdotes and back of

    the envelope calculations, rather than empirical data [22].

    The work in this paper is part of a broader research agenda with the long-term goal of providing a better

    understanding of failures in IT systems by collecting, analyzing and making publicly available a diverse set

    of real failure histories from large-scale production systems. In our pursuit, we have spoken to a number of

    large production sites and were able to convince three of them to provide failure data from several of their

    systems.

    In this paper, we provide an analysis of five data sets we have collected, with a focus on storage-related failures. The data sets come from five different large-scale production systems at three different

    sites, including two large high-performance computing sites and one large internet services site. The data

    sets vary in duration from 1 month to 5 years and cover in total a population of more than 70,000 drives

    from four different vendors. All disk drives included in the data were either SCSI or fibre-channel drives,

    commonly represented as the most reliable types of disk drives.

    We analyze the data from three different aspects. We begin in Section3by asking how disk failure

    frequencies compare to that of other hardware component failures. In Section4, we provide a quantitative

    analysis of disk failure rates observed in the field and compare our observations with common predictors

    and models used by vendors. In Section5, we analyze the statistical properties of disk failures. We study

    correlations between failures and identify the key properties of the statistical distribution of time between

    failures, and compare our results to common models and assumptions on disk failure characteristics.

    2 Methodology

    2.1 Data sources

    Table1 provides an overview over the five data sets used in this study. Data sets HPC1 and HPC2 were

    collected in two large cluster systems at two different organizations using supercomputers. Data sets COM1,

    COM2, and COM3 were collected at three different cluster systems at a large internet service provider. In

    1

  • 8/10/2019 Disk failures in the real world

    4/17

  • 8/10/2019 Disk failures in the real world

    5/17

    2.2 Statistical methods

    We characterize an empirical distribution using two import metrics: the mean, and the squared coefficient

    of variation (C2). The squared coefficient of variation is a measure of the variability of a distribution and is

    defined as the squared standard deviation divided by the squared mean. The advantage of using the squared

    coefficient of variation as a measure of variability, rather than the variance or the standard deviation, is that

    it is normalized by the mean, and hence allows comparison of variability across distributions with differentmeans.

    We also consider the empirical cumulative distribution function (CDF) and how well it is fit by four

    probability distributions commonly used in reliability theory1: the exponential distribution; the Weibull

    distribution; the gamma distribution; and the lognormal distribution. We parameterize the distributions

    through maximum likelihood estimation and evaluate the goodness of fit by visual inspection, the negative

    log-likelihood and the chi-square test.

    Since we are interested in correlations between disk failures we need a measure for the degree of

    correlation. The autocorrelation function (ACF) measures the correlation of a random variable with itself at

    different time lags l . The ACF can for example be used to determine whether the number of failures in one

    day is correlated with number of failures observed l days later. The autocorrelation coefficient can range

    between 1 (high positive correlation) and -1 (high negative correlation).Another aspect of the failure process that we will study is long-range dependence. Long-range depen-

    dence measures the memory of a process, in particular how quickly the autocorrelation coefficient decays

    with growing lags. The strength of the long-range dependence is quantified by the Hurst exponent. A series

    exhibits long-range dependence if the Hurst exponent H is 0.5

  • 8/10/2019 Disk failures in the real world

    6/17

  • 8/10/2019 Disk failures in the real world

    7/17

    HPC1 HPC2 COM1 COM2 COM30

    5

    10

    15

    20

    25

    AFR(%)

    AFR=0.88

    AFR=0.73

    Figure 1: Comparison of datasheet AFRs (solid and dashed line in the graph) and AFRs observed in the

    field (bars in the graph). Left-most bar in a set is the result of combining all types of disks in the data set.

    4 Disk failure rates

    4.1 Specifying disk reliability and failure frequency

    Drive manufacturers specify the reliability of their products in terms of two related metrics: the annualized

    failure rate (AFR), which is the percentage of disk drives in a population that fail in a test scaled to a per year

    estimation; and the mean time to failure (MTTF). The AFR of a new product is typically estimated based

    on accelerated life and stress tests or based on field data from earlier products [1]. The MTTF is estimated

    as the number of power on hours per year2 divided by the AFR. The MTTFs specified for todays highest

    quality disks range from 1,000,000 hours to 1,400,000 hours, corresponding to AFRs of 0.63% to 0.88%.

    The AFR and MTTF estimates of the manufacturer are included in a drives datasheet and we refer to

    them in the remainder as the datasheet AFRand thedatasheet MTTF. In contrast, we will refer to the AFR

    and MTTF computed from the data sets as theobserved AFRandobserved MTTF, respectively.

    4.2 Disk failures and MTTF

    In the following, we study how field experience with disk failures compares to datasheet specifications of

    disk reliability. Figure1shows the datasheet AFRs (horizontal solid and dashed line) and the observed AFR

    for each of the five data sets. For HPC1 and COM3, which cover different types of disks, the graph contains

    several bars, one for the observed AFR across all types of disk (left-most bar), and one for the AFR of each

    type of disk (remaining bars in the order of the corresponding entries in Table 1).

    We observe a significant discrepancy between the observed AFR and the datasheet AFR for all data

    sets. While the datasheet AFRs are either 0.73% or 0.88%, the observed AFRs range from from 1.1% to

    as high as 25%. That is the observed AFRs are by a factor of 1.5 to up to a factor of 30 higher than the

    datasheet AFRs.

    A striking observation in Figure 1 is the huge variation of AFRs across the systems, in particular theextremely large AFRs observed in system COM3. While 3,302 of the disks in COM3 were at all times less

    than 5 years old, 432 of the disks in this system were installed in 1998, making them at least 7 years old at

    the end of the data set. Since this is well outside the vendors nominal lifetime for disks, it is not surprising

    that the disks might be wearing out. But even without the 432 obsolete disks, COM3 has quite large AFRs.

    2A common assumption for enterprise drives is that they are 100% of the time powered on. Our data set providers all believe

    that their disks are 100% powered on.

    5

  • 8/10/2019 Disk failures in the real world

    8/17

    Figure 2: Life cycle failure pattern of hard drives [27].

    The data for HPC1 covers almost exactly 5 years, the nominal lifetime, and exhibits an AFR signifi-

    cantly higher than the datasheet AFR (3.4% instead of 0.88%). The data for COM2 covers the first 2 years

    of operation and has an AFR of 3.1%, also much higher than the datasheet AFR of 0.88%.

    It is interesting to observe that the only system that comes close to the datasheet AFR is HPC2, which

    with an observed AFR of 1.1%, deviates from the datasheet AFR by only 50%. After talking to peopleinvolved in running system HPC2, we identified as a possible explanation the potentially very low usage of

    the disks in HPC2. The disks in this data set are local disks on compute nodes, whose applications primarily

    use a separate, shared parallel file system, whose disks are not included in the data set. The local disks,

    which are included in the data set, are mostly used only for booting to the operating system, and fetching

    system executables/libs. Users are allowed to write only to a smallish /tmp area of the disks and are thought

    to do this rarely, and swapping almost never happens.

    Below we summarize the key observations of this section.

    Observation 1:Variance between datasheet MTTF and field failure data is larger than one might expect.

    Observation 2: For older systems (5-8 years of age), data sheet MTTFs can underestimate failure rates by

    as much as a factor of 30.

    Observation 3: Even during the first few years of a systems lifetime (

  • 8/10/2019 Disk failures in the real world

    9/17

    1 2 3 4 50

    20

    40

    60

    80

    100

    120

    140

    160

    180

    Years of operation

    Failures

    peryear

    1 2 3 4 50

    10

    20

    30

    40

    50

    60

    Years of operation

    Failuresperyear

    Figure 3: Number of failures observed per year over the first 5 years of system HPC1s lifetime, for the

    compute nodes (left) and the file system nodes (right).

    0 10 20 30 40 500

    5

    10

    15

    20

    25

    30

    35

    40

    Months of operation

    Failurespermonth

    0 10 20 30 40 500

    2

    4

    6

    8

    10

    Failurespermonth

    Months of operation

    Figure 4: Number of failures observed per month over the first 5 years of system HPC1s lifetime, for the

    compute nodes (left) and the file system nodes (right).

    7-12, and one for months 13-60.

    The goal of this section is to study, based on our field replacement data, how failure rates in large-scale

    installations vary over a systems life cycle.

    The best data set to study failure rates across the system life cycle is system HPC1. The reason is that

    this data set spans the entire first 5 years of operation of a large system. Moreover, in HPC1 the hard drive

    population is homogeneous, with all 3,406 drives in the system being nearly identical (except for having

    two different sizes, 17 vs 36 GB), and the population size remained the same over the 5 years, except for the

    small fraction of replaced components.

    We study the change of failure rates across system HPC1s lifecycle at two different granularities, on

    a per-month and a per-year basis, to make it easier to detect both short term and long term trends. Figure3

    shows the yearly failure rates for the disks in the compute nodes of system HPC1 (left) and the file system

    nodes of system HPC1 (right). We make two interesting observations. First, failure rates in all years, except

    for year 1, are dramatically larger than the datasheet MTTF would suggest. The solid line in the graph

    represents the number of failures expected per year based on the data sheet MTTF. In year 2, disk failure

    rates are 20% larger than expected for the file system nodes, and a factor of two larger than expected for thecompute nodes. In year 4 and year 5 (which are still within the nominal lifetime of these disks), the actual

    failure rates are 710 times higher than expected.

    The second observation is that failure rates are rising significantly over the years, even during early

    years in the lifecycle. Failure rates nearly double when moving from year 2 to 3 or from year 3 to 4.

    This observation suggests that wear-out may start much earlier than expected, leading to steadily increasing

    failure rates during most of a systems useful life. This is an interesting observation because it does not

    agree with the common assumption that after the first year of operation, failure rates reach a steady state for

    7

  • 8/10/2019 Disk failures in the real world

    10/17

    a few years, forming the bottom of the bathtub.

    Next, we move to the per-month view of system HPC1s failure rates, shown in Figure4. We observe

    that for the file system nodes, theres is no detectable infant mortality: there are no failures observed during

    the first 12 months of operation. In the case of the compute nodes, infant mortality is limited to the first

    month of operation and is not above the steady state estimate of the datasheet MTTF. Looking at the life-

    cycle after month 12, we again see continuously rising failure rates, instead of the expected bottom of thebathtub.

    Below we summarize the key observations of this section.

    Observation 4:Contrary to common and proposed models, hard drive failure rates dont enter steady state

    after the first year of operation. Instead failure rates seem to steadily increase over time.

    Observation 5: Early onset of wear-out seems to have a much stronger impact on lifecycle failure rates

    than infant mortality, even when considering only the first 3 or 5 years of a systems lifetime. Wear-out

    should therefore be a incorporated into new standards for disk drive reliability. The new standard suggested

    by IDEMA does not take wear-out into account.

    5 Statistical properties of disk failures

    In the previous sections, we have focused on aggregate failure statistics, e.g. the average failure rate in a

    time period. Often one wants more information on the statistical properties of the time between failures than

    just the mean. For example, determining the expected time to failure for a RAID system requires an estimate

    on the probability of experiencing a second disk failure in a short period, that is while reconstructing lost

    data from redundant data. This probability depends on the underlying probability distribution and maybe

    poorly estimated by scaling an annual failure rate down to a few hours.

    The most common assumption about the statistical characteristics of disk failures is that they form a

    Poisson process, which implies two key properties:

    1. Failures are independent.

    2. The time between failures follows an exponential distribution.

    The goal of this section is to evaluate how realistic the above assumptions are. We begin by providing

    statistical evidence that disk failures in the real world are unlikely to follow a Poisson process. We then

    examine in Section5.2and Section5.3each of the two key properties (independent failures and exponential

    time between failures) independently and characterize in detail how and where the Poisson assumption

    breaks. In our study, we focus on the HPC1 data set, since this is the only data set that contains precise

    failure time stamps (rather than just repair time stamps).

    5.1 The Poisson assumption

    The Poisson assumption implies that the number of failures during a given time interval (e.g. a week or a

    month) is distributed according to the Poisson distribution. Figure5 (left) shows the empirical CDF of the

    number of failures observed per month in the HPC1 data set, together with the Poisson distribution fit to the

    datas observed mean.

    We find that the Poisson distribution does not provide a good fit for the number of failures observed

    per month in the data, in particular for very small and very large numbers of failures. For example, under

    the Poisson distribution the probability of seeing 20 failures in a given month is less than 0.0024, yet we

    8

  • 8/10/2019 Disk failures in the real world

    11/17

    0 10 20 30 400

    0.2

    0.4

    0.6

    0.8

    1

    Number of failures per month

    Pr(X