[ppt]operational machines: asci white - sandia national...

Post on 12-May-2018

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Operational Machines: Operational Machines: ASCI WhiteASCI WhitePresented to SOS7Presented to SOS7

Mark SeagerMark Seagerseager@llnl.govseager@llnl.gov925-423-3141925-423-3141

ICCD ADH for Advanced TechnologyICCD ADH for Advanced TechnologyLawrence Livermore National LaboratoryLawrence Livermore National Laboratory

This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48.

Q1: Is your machine living up to the performance Q1: Is your machine living up to the performance expectations? If yes, how? If not, what is the root cause?expectations? If yes, how? If not, what is the root cause?

• ASCI White is providing robust cycles to the tri-laboratory community

• Application performance relative to peak is less than expected– FMA 5-16% of floating point arithmetic instructions

issued– Some users sacrifice (turn off) compiler optimization

to have strict reproducibility• Modern coding techniques lead to poor memory

bandwidth utilization– Low cache-line payload utilization– OOP and non-uniform grids several memory

references per floating point operation

Q2: What is the MTBI? Q2: What is the MTBI? MTBF

y = 30.629x - 1E+06R2 = 0.2034

0

20

40

60

80

100

120

140

160

180

1/5/

2001

2/5/

2001

3/5/

2001

4/5/

2001

5/5/

2001

6/5/

2001

7/5/

2001

8/5/

2001

9/5/

2001

10/5

/200

1

11/5

/200

1

12/5

/200

1

1/5/

2002

2/5/

2002

3/5/

2002

4/5/

2002

5/5/

2002

6/5/

2002

7/5/

2002

8/5/

2002

9/5/

2002

10/5

/200

2

11/5

/200

2

12/5

/200

2

1/5/

2003

2/5/

2003

Hou

rs (W

hite

, Fro

st, I

ce)

0

20000

40000

60000

80000

100000

120000

Hr/N

ode MTBF (hr)

MTBF (hr/node)

Linear (MTBF (hr/node))

NH-2 MTBF is about 26,000 hr/node or 51 hours for white. Typical applications (of 1/3 machine size or smaller run for weeks at a time)

What are the topmost reasons for HW What are the topmost reasons for HW interrupts? interrupts?

0

50

100

150

200

250

300

HW

-CP

U

HW

-IO

HW

-LO

CA

L_D

ISK

HW

-ME

MO

RY

HW

-M

OTH

ER

BO

AR

D

HW

-NO

DE

_SW

AP

HW

-OTH

ER

HW

-P

OW

ER

_SU

PP

LY

HW

-S

SA

_AD

AP

TER

HW

-SS

A_D

ISK

HW

-SS

A_O

THE

R

HW

-SW

ITC

H

HW

-C

OLO

NY

_AD

AP

T

HW

-RIO

HW

-PO

WE

R

HW

-SS

A_D

ISK

-Hot

HW

-D

ATA

RA

M_M

EM

O

HW

-3R

D_P

AR

TY_D

I

HW

2/21/20032/14/20032/7/20031/31/20031/24/20031/17/20031/10/20031/3/200312/27/200212/20/200212/13/200212/6/200211/29/200211/22/200211/15/200211/8/200211/1/200210/25/200210/18/200210/11/200210/4/20029/27/20029/20/20029/13/20029/6/20028/30/20028/23/20028/16/20028/9/20028/2/20027/26/2002

Sect whit Imp yes

Count of Id

Type Subtype

Wk-endng

What are the topmost reasons for What are the topmost reasons for SW interrupts?SW interrupts?

0

20

40

60

80

100

120

SW

-CO

MM

_SS

SW

-GP

FS

SW

-LL_

PO

E

SW

-OS

SW

-OTH

ER

SW

-CO

MP

ILE

R

LOC

AL-

DP

CS

LOC

AL-

NE

TWO

RK

LOC

AL-

OTH

ER

LOC

AL-

PO

WE

R

SW zLocal

2/21/20032/14/20032/7/20031/31/20031/24/20031/17/20031/10/20031/3/200312/27/200212/20/200212/13/200212/6/200211/29/200211/22/200211/15/200211/8/200211/1/200210/25/200210/18/200210/11/200210/4/20029/27/20029/20/20029/13/20029/6/20028/30/20028/23/20028/16/20028/9/20028/2/20027/26/2002

Sect whit Imp yes

Count of Id

Type Subtype

Wk-endng

What is the average utilization rate?What is the average utilization rate?

0

10

20

30

40

50

60

70

80

90

100

Dec-9

6Jan-97

Feb-97

Mar-97

Apr-97

May-9

7Jun-97

Jul-97

Aug-97

Sep-97

Oct-97

Nov-9

7De

c-97

Jan-98

Feb-98

Mar-98

Apr-98

May-9

8Jun-98

Jul-98

Aug-98

Sep-98

Oct-98

Nov-9

8De

c-98

Jan-99

Feb-99

Utili

zatio

n (p

erce

nt) Blue

SkyWhiteFrostIce

Q3: What is the primary complaint, Q3: What is the primary complaint, if any, from the users?if any, from the users?

• Not enough time on the machine– Users want more access

• Scalability of MPI– MPI_ALLREDUCE– MPI_BARRIER

• Extremely long job startup

top related