Operational Machines: Operational Machines: ASCI WhiteASCI WhitePresented to SOS7Presented to SOS7
Mark SeagerMark [email protected]@llnl.gov925-423-3141925-423-3141
ICCD ADH for Advanced TechnologyICCD ADH for Advanced TechnologyLawrence Livermore National LaboratoryLawrence Livermore National Laboratory
This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48.
Q1: Is your machine living up to the performance Q1: Is your machine living up to the performance expectations? If yes, how? If not, what is the root cause?expectations? If yes, how? If not, what is the root cause?
• ASCI White is providing robust cycles to the tri-laboratory community
• Application performance relative to peak is less than expected– FMA 5-16% of floating point arithmetic instructions
issued– Some users sacrifice (turn off) compiler optimization
to have strict reproducibility• Modern coding techniques lead to poor memory
bandwidth utilization– Low cache-line payload utilization– OOP and non-uniform grids several memory
references per floating point operation
Q2: What is the MTBI? Q2: What is the MTBI? MTBF
y = 30.629x - 1E+06R2 = 0.2034
0
20
40
60
80
100
120
140
160
180
1/5/
2001
2/5/
2001
3/5/
2001
4/5/
2001
5/5/
2001
6/5/
2001
7/5/
2001
8/5/
2001
9/5/
2001
10/5
/200
1
11/5
/200
1
12/5
/200
1
1/5/
2002
2/5/
2002
3/5/
2002
4/5/
2002
5/5/
2002
6/5/
2002
7/5/
2002
8/5/
2002
9/5/
2002
10/5
/200
2
11/5
/200
2
12/5
/200
2
1/5/
2003
2/5/
2003
Hou
rs (W
hite
, Fro
st, I
ce)
0
20000
40000
60000
80000
100000
120000
Hr/N
ode MTBF (hr)
MTBF (hr/node)
Linear (MTBF (hr/node))
NH-2 MTBF is about 26,000 hr/node or 51 hours for white. Typical applications (of 1/3 machine size or smaller run for weeks at a time)
What are the topmost reasons for HW What are the topmost reasons for HW interrupts? interrupts?
0
50
100
150
200
250
300
HW
-CP
U
HW
-IO
HW
-LO
CA
L_D
ISK
HW
-ME
MO
RY
HW
-M
OTH
ER
BO
AR
D
HW
-NO
DE
_SW
AP
HW
-OTH
ER
HW
-P
OW
ER
_SU
PP
LY
HW
-S
SA
_AD
AP
TER
HW
-SS
A_D
ISK
HW
-SS
A_O
THE
R
HW
-SW
ITC
H
HW
-C
OLO
NY
_AD
AP
T
HW
-RIO
HW
-PO
WE
R
HW
-SS
A_D
ISK
-Hot
HW
-D
ATA
RA
M_M
EM
O
HW
-3R
D_P
AR
TY_D
I
HW
2/21/20032/14/20032/7/20031/31/20031/24/20031/17/20031/10/20031/3/200312/27/200212/20/200212/13/200212/6/200211/29/200211/22/200211/15/200211/8/200211/1/200210/25/200210/18/200210/11/200210/4/20029/27/20029/20/20029/13/20029/6/20028/30/20028/23/20028/16/20028/9/20028/2/20027/26/2002
Sect whit Imp yes
Count of Id
Type Subtype
Wk-endng
What are the topmost reasons for What are the topmost reasons for SW interrupts?SW interrupts?
0
20
40
60
80
100
120
SW
-CO
MM
_SS
SW
-GP
FS
SW
-LL_
PO
E
SW
-OS
SW
-OTH
ER
SW
-CO
MP
ILE
R
LOC
AL-
DP
CS
LOC
AL-
NE
TWO
RK
LOC
AL-
OTH
ER
LOC
AL-
PO
WE
R
SW zLocal
2/21/20032/14/20032/7/20031/31/20031/24/20031/17/20031/10/20031/3/200312/27/200212/20/200212/13/200212/6/200211/29/200211/22/200211/15/200211/8/200211/1/200210/25/200210/18/200210/11/200210/4/20029/27/20029/20/20029/13/20029/6/20028/30/20028/23/20028/16/20028/9/20028/2/20027/26/2002
Sect whit Imp yes
Count of Id
Type Subtype
Wk-endng
What is the average utilization rate?What is the average utilization rate?
0
10
20
30
40
50
60
70
80
90
100
Dec-9
6Jan-97
Feb-97
Mar-97
Apr-97
May-9
7Jun-97
Jul-97
Aug-97
Sep-97
Oct-97
Nov-9
7De
c-97
Jan-98
Feb-98
Mar-98
Apr-98
May-9
8Jun-98
Jul-98
Aug-98
Sep-98
Oct-98
Nov-9
8De
c-98
Jan-99
Feb-99
Utili
zatio
n (p
erce
nt) Blue
SkyWhiteFrostIce
Q3: What is the primary complaint, Q3: What is the primary complaint, if any, from the users?if any, from the users?
• Not enough time on the machine– Users want more access
• Scalability of MPI– MPI_ALLREDUCE– MPI_BARRIER
• Extremely long job startup