the evolving stable product: a contradiction of terms? · cycle 1 cycle 2 cycle 3 data compare siop...
TRANSCRIPT
2
Agenda
�Process/Product Maturity Model
�Xyratex field reliability experience
�Overview of integration test methodology
�Effect of targeted testing
Xyratex Confidential
3
Over 30 years of experience in Storage Process Equipment design
IBM Disk Drive R&D in the UK
IBM Disk Drive Production & Process design in the UK
1972 19921984 1994 1997
IBM Storage System R&D & Production
in the UK
2006
IBM Experience Xyratex Experience
Xyratex Disk Drive Production Equipment design
Xyratex OEM Storage Systems
Over 22 years of continuous Storage System Process Design
Xyratex Confidential
4
Process/Product Maturity Model
Xyratex Confidential
Introduction
Growth
Maturity
Decline
Product ProductProcess
Release F/W V5
Pilot Python A (73G, 144G, 300G)
Pilot Cheetah 7 (73G, 144G, 300G)
Release firmware V7 & CPLD
Release F/W V10
Release F/W V11
Release F/W V12
Release F/W V13 & CPLD
Introduce Python A (73G, 144G, 300G)
Introduce Cheetah 7 (73G, 144G, 300G)
Release Immersion Tin raw card
Release F/W V14.30
Pilot F/W for Python APilot F/W for Cheetah 7Release F/W for Python A
Release F/W for Cheetah 7
Release F/W 16.01
Pilot Jake ATA Filer
Pilot Maxtor Calypso (250G)
Release F/W Package 21
Release F/W Package 22
Release F/W V14.30
Release F/W Package 26
Pilot Maxtor Sabre (250/320G)
Pilot Gen 1 Dongle
Pilot Hitachi K2 (500G)
Pilot Gen 2 Dongle
Introduction RoSH ATA Card
Pilot Maxtor Sablre "Flashless" (250/320G)
Introduction of IBM Specific ATA-Filer
Release F/W Package 27
Pilot Maxtor Grizzly (500G)
Release F/W Package 30 & CPLD
Release Maxtor Sabre G2 F/W
Pilot Seagate Tonka 1.5 (250G)
RoSH (R5) BIPPilot Seagate Tonka 2 (500G)
Release F/W Package 32
Reality Check…….process & product continually evolve….not all changes are validated, or know about.
What is the risk mitigation plan?
Vo
lum
e
5
Drive "X" DPPM
0
1000
2000
3000
4000
5000
6000
7000
Apr-07 May-07 Jun-07 Jul-07 Aug-07 Sep-07 Oct-07 Nov-07 Dec-07 Jan-08 Feb-08 Mar-08 Apr-08
DP
PM
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
To
tal D
PP
M
2X 0 0 0 0 0 0 1595 0 0 0 0 0 0
DNR 0 241 0 0 152 0 598 0 0 163 2465 264 261
Glist 0 0 178 0 0 0 0 0 0 0 0 0 131
Hardw are 0 0 178 0 0 83 0 0 0 0 0 0 0
Other 0 0 711 0 0 498 997 1228 5821 4412 4621 527 261
Process Induced 0 0 1245 0 0 0 1196 0 0 0 0 0 0
Recoverable 0 0 0 0 0 0 0 0 0 0 0 0 0
SMART 0 241 356 0 0 0 0 614 0 0 0 0 392
Unrecoverable 1413 2410 1245 1279 610 912 598 1842 0 490 1232 396 784
Total DPPM 1413 2892 2668 1279 762 1493 3788 3683 5821 5065 8318 1187 1828
Apr-2007 May-2007 Jun-2007 Jul-2007 Aug-2007 Sep-2007 Oct-2007 Nov-2007 Dec-2007 Jan-2008 Feb-2008 Mar-2008 Apr-2008
Mature product, failure mode introduction.
������������ ������������
Xyratex Confidential
6
Xyratex Hard Disk Drive Reliability : Failure rate comparison
Xyratex Confidential
Industry AFR experience
Annual Failure Rate (AFR) by drive class
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
AR
R (
%)
XYR ATA FR 0.92% 0.98% 1.04% 1.27% 2.19%
XYR Enterprise FR 0.46% 0.51% 0.73% 1.32% 0.68% 1.04% 1.10%
Google paper Base ~AFR 2.80% 1.80% 1.75% 8% 8.70% 6% 7.40%
3 Months 6 Months 1 Year 2 Years 3 Years 4 Years 5 Years
Enterprise target
ATA target
7
Example of the effect of CERT on FC HDD Field Reliability
� Drives subjected to our integration process with CERT, have a lower failure rate in the field.
Xyratex Confidential
Cumlative failure rate over time
Time, (years)
Cumulative failure rate (%
)
0 51 2 3 40%
5%
1%
2%
3%
4%
Drive X-CERTWeibull-2PMLE SRM MED FM
Data Points
Drive X-NO CERTWeibull-2PMLE SRM MED FM
Data Points
Drive Y-CERTWeibull-2PMLE SRM MED FM
Data Points
Drive Y-NO CERTWeibull-2PMLE SRM MED FM
Data Points
Drive X No CERT
Drive X with CERT
Drive Y with CERT
Drive Y No CERT
8
Example of the effect of CERT on ATA HDD Field Reliability
Xyratex Confidential
Cumulative rate over time
Time, (months)
Cumulative failure rate (%)
0 123 6 90.0%
1.5%
0.5%
1.0%
Drive Z-CERTWeibull-2PMLE SRM MED FM
Data Points
DRIVE Z-NO CERTWeibull-2PMLE SRM MED FM
Data Points
Drive Z No CERT
Drive Z with CERT
9
“Conventional” Process Flow……….backed by deep storage knowledge
� 4 Stage Test Process
� Configure product prior to test
� Testing at subsystem level
� Apply environmental stresses
� CERT is key
� Targeted Workloads
� Automation & Data Collection
Final Assembly
BATs
CERT
Functional
Test
Hi-Pot &
Safety Gnd
Config
ORTPack/Ship
Basic Power-On Check
Production configuration
and specification
validation
Reliability Test
Safety Check
Xyratex Confidential
10
� Test Function & Design Approach based on FMEA, Failure Analysis & Experience
� Identify faults associated with
� Design
� Process
� Quality
� Interoperability
Targeted Testing Design Approach
Stress Type Drive Servo Rd/Wr Hd. Disk
Motor /
Bearing Card Elec. PSU Fans I/O Card
Cabling /
Connectors
Plastics /
Mech Memory Card Elec. Processor
IOPS
H H L L H L L H L L H M H
Thru Put MB/s
H H H L H L L H L L H M H
Power Cycle
M M L H H H M H L L H H H
Thermal
Variation
H H M H H H H H H M H H H
Voltage
Variation
L L L M H M L H L L M H L
Vibration
H M L L L M L L M M L L L
Redundancy
Variation
L L L L L H H H L L L M L
Hard Disk Drive Other Storage Subsystem Components
CERT Profile
0
5
10
15
20
25
30
35
40
45
1 2 3 4 5 6 7 8 9 10
11
12
14
15
16
17
18
1 2 3 4 5 6 7 8 9 10
11
12
14
15
16
17
18
1 2 3
Stage
Tem
p (°C
), (line)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Cycle 1 Cycle 2 Cycle 3
Data Compare SIOP Banded Sequential Write/Read Compare Test Certify Test Unit : Verify Disk
Data Erasure Test Incremental Butterfly Write Test Simulated Workload SIOPs : File Server
Simulated Workload SIOPs : OLTP System Ready Test Zero Disk
Write NetApp Metadata Metadata Verification Certify Test Unit : Write Disk
Temperature Profile
Cycle 1 Cycle 2Cycle 3
Xyratex Confidential
11
Summary
� Can not become complacent with testing regime�Evolution of a product only stops when a product end of life is announced and the last product shipped.
� Infant mortality failures can be screened out�A combination of system stress testing, customer centric test suites and environmental stress can precipitate early life failure modes.
� Quoted MTBF rates can be achieved�Our data confirms a decrease in early life failures for drives subject to Combined Environmental Reliability Test.
� Nearline drives currently not as reliable as Entreprise�Nearline & Entreprise drives……the same difference?