25 oct [email protected] hepix1 measurements of hardware reliability in the fermilab farms...

25 Oct 2002 [email protected] HEPiX 1

Measurements of Hardware Reliability in the Fermilab Farms

HEPix/HEPNT, Oct 25, 2002

S. Timm

Fermilab Computing Division

Operating Systems Support Dept

Scientific Computing Support Group


Introduction

• Four groups of Linux nodes have made it through three year life cycle (186 machines).

• All from commodity “white box” vendors

• Our goal—to measure the hardware failure rate and calculate total cost of ownership.


Burn-in and Service

• All nodes are given 30-day burn-in• Test CPU with seti@home• Disk test with bonnie• Network test with nettest• Failures during burn-in period are vendor’s problem to fix

(parts and labor).• After burn-in period, there is 3 year warranty on parts,

Fermilab covers the labor through on-site service provider Decision One.

• Lemon law—any node down for 5 straight days or 5 separate instances must be completely replaced.

mailto:seti@home


Definition of Hardware Fault

• Failure of hardware such that it makes the machine not usable

• Hardware changed out during burn-in period doesn’t count

• Fan replacements (routine maintenance) don’t count.

• Sometimes we replaced a disk and it didn’t solve the problem…that is counted

• Multiple service calls in same incident count as single hardware fault.


Infant Mortality• The routine hardware calls don’t count swap-outs

during the burn-in period• We expect and are prepared for initial quality

problems.• During install and burn-in, we have demanded and got

total swap-outs of– motherboards (2 different times)– Cases (once)– Racks (once)– Power supplies (twice)– System disks (twice in same group of nodes)


IDE/DMA errors• Serverworks LE chipset had broken IDE chipset• Observed in following Pentium III boards: Tyan

2510, 2518, Intel STL2, SCB2, Supermicro 370DLE, ASUS CUR-DLS—basically anything for sale in 2001. (Tyan 2518 best of a bad lot).

• Hardware fault observed both in Windows and Linux and with hardware logic analyzer—Chipset thought DMA was still on even though drive had finished transfer.

• System most sensitive when trying to write system disk and swap at the same time.


IDE/DMA errors, cont’d• Behavior varied by disk drive—Seagate disk drives—

file corruption, Western Digital drives, occasional hangs of system, IBM drives—OK (up to 2.4.9 kernel).

• Vendor did 2 complete system disk swaps, first WD, then IBM.

• Problem reappears with new 2.4.18 kernel “feature”, shuts down the drive and halts the machine if one of these errors happens.

• Most IDE/DMA errors not counted in error summary below.


CPU Power—Fermi Cycles

• CPU clock speed numbers not consistent between Intel PIII, Intel Xeon (P4), and AMD Athlon MP

• SPEC CPU2000 numbers don’t exist far enough back for historical comparison

• We define PIII 1 GHz = 1000 Fermi Cycles• Compilers that Fermi is tied to can’t give the full

performance promised for SPEC CPU2000 numbers—AMD MP1800+ faster than Xeon 2.0GHz.

• Performance is measured by real performance of our applications on systems.


Farms Buying History

Purch. Yr. Type # nodes Cost Fer. Cyc $ / FC

Jun 1998 PII 333 36 85128 21600 0.25

Sep 1999 PIII 500 150 409400 141900 0.34

Sep 2000 PIII 750 50 212955 75000 0.37

Jan 2001 PIII 800 40 110410 64000 0.57

Jun 2001 PIII 1000 136 341060 272000 0.8

Dec 2001 PIII 1000 32 61980 64000 0.96

Feb 2002 PIII 1266 16 33768 42176 1.24

Mar 2002 PIII 1266 32 77760 84352 1.08

Sep 2002 AMD2000 240 403000 810240 2.01


First Linux farm

36 nodes, ran from 1998-200132 hardware failures—25 system disks, six

power supplies, one memory.These nodes had only one disk, used for

system, staging, swap, everything, and swapped heavily due to low memory.

Failures correlated to power outagesRate—0.024 failures/machine-month.


Mini-tower farms, 1999

150 nodes, organized into 3 farms of 50.

CDF, D0, Fixed Target, 50 each.

Bought Sep 1999, just out of warranty now in Sep 2002.

140 nodes still in the farm, statistics based on them.

3 disks in each, one system and 2 data.


Mini-tower farms, cont’d.• Fixed target—50 nodes, only 5 service calls over 3

years.• 1 Memory problem, 1 bad data disk, 3 bad

motherboards (one caused from failed BIOS upgrade). • CDF—50 nodes, 19 service calls over 3 years• 5 system disk, 2 power supply, 9 data disk, 2

motherboard, 1 CPU. • D0—40 nodes, 18 service calls over 3 years• 9 system disk, 2 power supply, 3 data disk, 3

motherboard, 1 network card.


Failures per month by cluster

00.5

11.5

22.5

33.5

Se

p-9

9

Ma

r-0

0

Se

p-0

0

Ma

r-0

1

Se

p-0

1

Ma

r-0

2

Se

p-0

2

Fa

ilure

s fnpc201-250

fncdf1-50

fnd01-40

Frequency distribution of failures

0

10

20

30

40

0 1 2 3 4 5 6

Number of failures

Mo

nth

s

fnpc1-37

fnpc201-250

fncdf1-50

fnd01-40


Analysis• Four different failure rates:• Old farm—0.024 failures/machine month• FT farm—0.0028+/-0.0012 failures/machine month• CDF—0.0083+/- 0.0021 failures/machine month• D0—0.0130+/-0.0044 failures/machine month• Statistical analysis reveals the distributions are not

statistically consistent with each other, also not Poisson.

• CDF and D0 are identical hardware in same computer room.


Analysis continued

• Failure rate could depend on any of the following– Frequency of use (D0 farm typically loaded >

98%, others less)– Vigilance of system administrators in finding

and addressing hardware errors– Phase of moon.– Dependability of hardware.– Cooling efficiency


Residual value

• Latest farm purchase got us 2 Fermi cycles per dollar.

• Residual value of 140 nodes bought in 1999 is $70K—they could be replaced with 40 of the nodes we are buying today.

• Cost of electricity=180W*150 machines * 26280 hrs *.047$/kWh=$33.3K


Total Cost of Ownership

• Depreciation--$339K

• Maintenance--$20K (estimate)

• Electricity--$33K (estimate)

• Memory upgrades--$23K

• Total--$415K

• Personnel—2 FTE * 3 years—how much?

• (doesn’t count developer time, user time)


Lessons Learned

• Hitech has been out of business for more than a year

• Decision One was still able to get replacement parts from component vendors, at least for processors and disk drives

• Decision One identified replacement motherboard since initial one isn’t manufactured anymore.

• Conclusion—we can survive if a vendor doesn’t stay in business for the length of the 3 year warranty.


Cost forecast for 2U units

• Maintenance costs will be higher—– have already racked up $10K of maintenance in

1.5 years of deployment on 64 CDF nodes, for example.

– Dominated by memory upgrades and disk swaps.


2U Intel boards:• 50 2U nodes, D0, bought Sep. 00.• 9 PS replaced during burn-in.• Since then—1 system disk, 2 PS, 6 memory, 4 data

disk, 6 motherboard, 1 net.• Four nodes have been to shop > 3 times.• 0.016 failures/machine month• 23 nodes for CDF bought Jan ’01• 1 system disk, 11 power supplies, 1 data disk, 1

network card so far.• 0.031 failures/machine month


2U Supermicro boards• 64 nodes for CDF bought Jun ’01• 10 system disks, 2 data disks, 3 motherboards, 1

floppy, 2 batteries.• (not to mention total swap of system disks twice)• 0.010 failures/machine month• 40 nodes for FT bought Jun ’01• Only 1 problem so far, memory.• 0.002 failures/machine month.• Identical hardware in 2 groups but failure rate is

different by factor of five!


2U Tyan boards

• 32 bought for D0, arrived Dec 28, 2001 (after being sent back for new motherboards and cases).

• 3 hardware calls so far, all system disks.• 0.003 failures/machine month• 16 bought for KTeV, arrived March ’02• 1 hardware call so far, data disk• 0.009 failures/machine month• 32 bought for CDF, arrived April ’02• 2 hardware calls so far, system disk, CPU


SUMMARYFARM NAME Failures/machine-month

Old fnpc1-37 0.024

Fncdf1-50 0.010

Fnd01-40 0.012

Fnpc201-250 0.003

Fnd051-100 0.015

Fncdf51-73 0.027

Fncdf91-154 0.018

Fnpc51-90 0.002

Fnd0101-132 0.009

Fnpc1-16 0.008

Fncdf75-90,155-170 0.009


Hardware errors by typeHardware Errors by type

55

239

21

177 SYSDISK

PS

MEM

DATA

MB

OTH


Conclusions thus far

• We now format a disk and check for bad blocks before placing service call to replace—it can often rescue a disk.

• At moment, software-related hangs are much greater problem than hardware errors and more time consuming to diagnose.

• With 750 machines and 0.01 failures/machine month we can expect 8 hardware failures/month.

• GRAND TOTAL—10692 machine-months so far, 0.0122 failures per machine-month.

• Machines currently running are averaging 0.0105 failures per machine-month.


Cluster errors over timeCluster errors over time

02468

1012

Jul-9

8

Nov-9

8

Mar

-99

Jul-9

9

Nov-9

9

Mar

-00

Jul-0

0

Nov-0

0

Mar

-01

Jul-0

1

Nov-0

1

Mar

-02

Jul-0

2

Nov-0

2

Time

Err

ors

fncdf75-170

fnpc1-16

fnd0101-132

fnpc51-90

fncdf91-154

fncdf51-73

fnd051-100

fnd01-40

fncdf1-50

fnpc201-250

fnpc1-37


Machines vs time

0100200300400500600700

Jul-

98

No

v-9

8

Ma

r-9

9

Jul-

99

No

v-9

9

Ma

r-0

0

Jul-

00

No

v-0

0

Ma

r-0

1

Jul-

01

No

v-0

1

Ma

r-0

2

Jul-

02

No

v-0

2

Time

N o

f m

ac

hin

es

Failures per machine

00.005

0.010.015

0.020.025

0.03

Jul-

98

Jan

-99

Jul-

99

Jan

-00

Jul-

00

Jan

-01

Jul-

01

Jan

-02

Jul-

02

Failures permachine

25 oct [email protected] hepix1 measurements of hardware reliability in the fermilab farms...

Documents