25 oct [email protected] hepix1 measurements of hardware reliability in the fermilab farms...
TRANSCRIPT
25 Oct 2002 [email protected] HEPiX 1
Measurements of Hardware Reliability in the Fermilab Farms
HEPix/HEPNT, Oct 25, 2002
S. Timm
Fermilab Computing Division
Operating Systems Support Dept
Scientific Computing Support Group
25 Oct 2002 [email protected] HEPiX 2
Introduction
• Four groups of Linux nodes have made it through three year life cycle (186 machines).
• All from commodity “white box” vendors
• Our goal—to measure the hardware failure rate and calculate total cost of ownership.
25 Oct 2002 [email protected] HEPiX 3
Burn-in and Service
• All nodes are given 30-day burn-in• Test CPU with seti@home• Disk test with bonnie• Network test with nettest• Failures during burn-in period are vendor’s problem to fix
(parts and labor).• After burn-in period, there is 3 year warranty on parts,
Fermilab covers the labor through on-site service provider Decision One.
• Lemon law—any node down for 5 straight days or 5 separate instances must be completely replaced.
25 Oct 2002 [email protected] HEPiX 4
Definition of Hardware Fault
• Failure of hardware such that it makes the machine not usable
• Hardware changed out during burn-in period doesn’t count
• Fan replacements (routine maintenance) don’t count.
• Sometimes we replaced a disk and it didn’t solve the problem…that is counted
• Multiple service calls in same incident count as single hardware fault.
25 Oct 2002 [email protected] HEPiX 5
Infant Mortality• The routine hardware calls don’t count swap-outs
during the burn-in period• We expect and are prepared for initial quality
problems.• During install and burn-in, we have demanded and got
total swap-outs of– motherboards (2 different times)– Cases (once)– Racks (once)– Power supplies (twice)– System disks (twice in same group of nodes)
25 Oct 2002 [email protected] HEPiX 6
IDE/DMA errors• Serverworks LE chipset had broken IDE chipset• Observed in following Pentium III boards: Tyan
2510, 2518, Intel STL2, SCB2, Supermicro 370DLE, ASUS CUR-DLS—basically anything for sale in 2001. (Tyan 2518 best of a bad lot).
• Hardware fault observed both in Windows and Linux and with hardware logic analyzer—Chipset thought DMA was still on even though drive had finished transfer.
• System most sensitive when trying to write system disk and swap at the same time.
25 Oct 2002 [email protected] HEPiX 7
IDE/DMA errors, cont’d• Behavior varied by disk drive—Seagate disk drives—
file corruption, Western Digital drives, occasional hangs of system, IBM drives—OK (up to 2.4.9 kernel).
• Vendor did 2 complete system disk swaps, first WD, then IBM.
• Problem reappears with new 2.4.18 kernel “feature”, shuts down the drive and halts the machine if one of these errors happens.
• Most IDE/DMA errors not counted in error summary below.
25 Oct 2002 [email protected] HEPiX 8
CPU Power—Fermi Cycles
• CPU clock speed numbers not consistent between Intel PIII, Intel Xeon (P4), and AMD Athlon MP
• SPEC CPU2000 numbers don’t exist far enough back for historical comparison
• We define PIII 1 GHz = 1000 Fermi Cycles• Compilers that Fermi is tied to can’t give the full
performance promised for SPEC CPU2000 numbers—AMD MP1800+ faster than Xeon 2.0GHz.
• Performance is measured by real performance of our applications on systems.
25 Oct 2002 [email protected] HEPiX 9
Farms Buying History
Purch. Yr. Type # nodes Cost Fer. Cyc $ / FC
Jun 1998 PII 333 36 85128 21600 0.25
Sep 1999 PIII 500 150 409400 141900 0.34
Sep 2000 PIII 750 50 212955 75000 0.37
Jan 2001 PIII 800 40 110410 64000 0.57
Jun 2001 PIII 1000 136 341060 272000 0.8
Dec 2001 PIII 1000 32 61980 64000 0.96
Feb 2002 PIII 1266 16 33768 42176 1.24
Mar 2002 PIII 1266 32 77760 84352 1.08
Sep 2002 AMD2000 240 403000 810240 2.01
25 Oct 2002 [email protected] HEPiX 10
First Linux farm
36 nodes, ran from 1998-200132 hardware failures—25 system disks, six
power supplies, one memory.These nodes had only one disk, used for
system, staging, swap, everything, and swapped heavily due to low memory.
Failures correlated to power outagesRate—0.024 failures/machine-month.
25 Oct 2002 [email protected] HEPiX 11
Mini-tower farms, 1999
150 nodes, organized into 3 farms of 50.
CDF, D0, Fixed Target, 50 each.
Bought Sep 1999, just out of warranty now in Sep 2002.
140 nodes still in the farm, statistics based on them.
3 disks in each, one system and 2 data.
25 Oct 2002 [email protected] HEPiX 12
Mini-tower farms, cont’d.• Fixed target—50 nodes, only 5 service calls over 3
years.• 1 Memory problem, 1 bad data disk, 3 bad
motherboards (one caused from failed BIOS upgrade). • CDF—50 nodes, 19 service calls over 3 years• 5 system disk, 2 power supply, 9 data disk, 2
motherboard, 1 CPU. • D0—40 nodes, 18 service calls over 3 years• 9 system disk, 2 power supply, 3 data disk, 3
motherboard, 1 network card.
25 Oct 2002 [email protected] HEPiX 13
Failures per month by cluster
00.5
11.5
22.5
33.5
Se
p-9
9
Ma
r-0
0
Se
p-0
0
Ma
r-0
1
Se
p-0
1
Ma
r-0
2
Se
p-0
2
Fa
ilure
s fnpc201-250
fncdf1-50
fnd01-40
Frequency distribution of failures
0
10
20
30
40
0 1 2 3 4 5 6
Number of failures
Mo
nth
s
fnpc1-37
fnpc201-250
fncdf1-50
fnd01-40
25 Oct 2002 [email protected] HEPiX 14
Analysis• Four different failure rates:• Old farm—0.024 failures/machine month• FT farm—0.0028+/-0.0012 failures/machine month• CDF—0.0083+/- 0.0021 failures/machine month• D0—0.0130+/-0.0044 failures/machine month• Statistical analysis reveals the distributions are not
statistically consistent with each other, also not Poisson.
• CDF and D0 are identical hardware in same computer room.
25 Oct 2002 [email protected] HEPiX 15
Analysis continued
• Failure rate could depend on any of the following– Frequency of use (D0 farm typically loaded >
98%, others less)– Vigilance of system administrators in finding
and addressing hardware errors– Phase of moon.– Dependability of hardware.– Cooling efficiency
25 Oct 2002 [email protected] HEPiX 16
Residual value
• Latest farm purchase got us 2 Fermi cycles per dollar.
• Residual value of 140 nodes bought in 1999 is $70K—they could be replaced with 40 of the nodes we are buying today.
• Cost of electricity=180W*150 machines * 26280 hrs *.047$/kWh=$33.3K
25 Oct 2002 [email protected] HEPiX 17
Total Cost of Ownership
• Depreciation--$339K
• Maintenance--$20K (estimate)
• Electricity--$33K (estimate)
• Memory upgrades--$23K
• Total--$415K
• Personnel—2 FTE * 3 years—how much?
• (doesn’t count developer time, user time)
25 Oct 2002 [email protected] HEPiX 18
Lessons Learned
• Hitech has been out of business for more than a year
• Decision One was still able to get replacement parts from component vendors, at least for processors and disk drives
• Decision One identified replacement motherboard since initial one isn’t manufactured anymore.
• Conclusion—we can survive if a vendor doesn’t stay in business for the length of the 3 year warranty.
25 Oct 2002 [email protected] HEPiX 19
Cost forecast for 2U units
• Maintenance costs will be higher—– have already racked up $10K of maintenance in
1.5 years of deployment on 64 CDF nodes, for example.
– Dominated by memory upgrades and disk swaps.
25 Oct 2002 [email protected] HEPiX 20
2U Intel boards:• 50 2U nodes, D0, bought Sep. 00.• 9 PS replaced during burn-in.• Since then—1 system disk, 2 PS, 6 memory, 4 data
disk, 6 motherboard, 1 net.• Four nodes have been to shop > 3 times.• 0.016 failures/machine month• 23 nodes for CDF bought Jan ’01• 1 system disk, 11 power supplies, 1 data disk, 1
network card so far.• 0.031 failures/machine month
25 Oct 2002 [email protected] HEPiX 21
2U Supermicro boards• 64 nodes for CDF bought Jun ’01• 10 system disks, 2 data disks, 3 motherboards, 1
floppy, 2 batteries.• (not to mention total swap of system disks twice)• 0.010 failures/machine month• 40 nodes for FT bought Jun ’01• Only 1 problem so far, memory.• 0.002 failures/machine month.• Identical hardware in 2 groups but failure rate is
different by factor of five!
25 Oct 2002 [email protected] HEPiX 22
2U Tyan boards
• 32 bought for D0, arrived Dec 28, 2001 (after being sent back for new motherboards and cases).
• 3 hardware calls so far, all system disks.• 0.003 failures/machine month• 16 bought for KTeV, arrived March ’02• 1 hardware call so far, data disk• 0.009 failures/machine month• 32 bought for CDF, arrived April ’02• 2 hardware calls so far, system disk, CPU
25 Oct 2002 [email protected] HEPiX 23
SUMMARYFARM NAME Failures/machine-month
Old fnpc1-37 0.024
Fncdf1-50 0.010
Fnd01-40 0.012
Fnpc201-250 0.003
Fnd051-100 0.015
Fncdf51-73 0.027
Fncdf91-154 0.018
Fnpc51-90 0.002
Fnd0101-132 0.009
Fnpc1-16 0.008
Fncdf75-90,155-170 0.009
25 Oct 2002 [email protected] HEPiX 24
Hardware errors by typeHardware Errors by type
55
239
21
177 SYSDISK
PS
MEM
DATA
MB
OTH
25 Oct 2002 [email protected] HEPiX 25
Conclusions thus far
• We now format a disk and check for bad blocks before placing service call to replace—it can often rescue a disk.
• At moment, software-related hangs are much greater problem than hardware errors and more time consuming to diagnose.
• With 750 machines and 0.01 failures/machine month we can expect 8 hardware failures/month.
• GRAND TOTAL—10692 machine-months so far, 0.0122 failures per machine-month.
• Machines currently running are averaging 0.0105 failures per machine-month.
25 Oct 2002 [email protected] HEPiX 26
Cluster errors over timeCluster errors over time
02468
1012
Jul-9
8
Nov-9
8
Mar
-99
Jul-9
9
Nov-9
9
Mar
-00
Jul-0
0
Nov-0
0
Mar
-01
Jul-0
1
Nov-0
1
Mar
-02
Jul-0
2
Nov-0
2
Time
Err
ors
fncdf75-170
fnpc1-16
fnd0101-132
fnpc51-90
fncdf91-154
fncdf51-73
fnd051-100
fnd01-40
fncdf1-50
fnpc201-250
fnpc1-37
25 Oct 2002 [email protected] HEPiX 27
Machines vs time
0100200300400500600700
Jul-
98
No
v-9
8
Ma
r-9
9
Jul-
99
No
v-9
9
Ma
r-0
0
Jul-
00
No
v-0
0
Ma
r-0
1
Jul-
01
No
v-0
1
Ma
r-0
2
Jul-
02
No
v-0
2
Time
N o
f m
ac
hin
es
Failures per machine
00.005
0.010.015
0.020.025
0.03
Jul-
98
Jan
-99
Jul-
99
Jan
-00
Jul-
00
Jan
-01
Jul-
01
Jan
-02
Jul-
02
Failures permachine