presentation outline a word or two about our program our hpc system acquisition process program...

Presentation Outline

A word or two about our program

Our HPC system acquisition process

Program benchmark suite

Evolution of benchmark-based performance metrics

Where do we go from here?

HPC Modernization Program

HPC Modernization Program Goals

DoD HPC Modernization Program

HPCMP Serves a Large, Diverse DoD User Community

519 projects and 4,086 users at approximately 130 sites

Requirements categorized in 10 Computational Technology Areas (CTA)

FY08 non-real-time requirements of 1,108 Habu-equivalents

156 users are self characterized as “Other”

Computational Structural Mechanics – 437 Users

Electronics, Networking, and Systems/C4I – 114 Users

Computational Chemistry, Biology & Materials Science – 408 Users

Computational Electromagnetics & Acoustics – 337 Users

Computational Fluid Dynamics – 1,572 Users

Environmental Quality Modeling & Simulation – 147 Users

Signal/Image Processing – 353 Users

Integrated Modeling & Test Environments – 139 Users

Climate/Weather/Ocean Modeling & Simulation – 241 Users

Forces Modeling & Simulation – 182 Users

High Performance Computing Centers

4 Allocated Distributed Centers

Strategic Consolidationof Resources

4 Major SharedResource Centers

2007

Total HPCMP End-of-Year Computational Capabilities

1993

HPCMP Center Resources

4.2

1.1

6.3

1.014.8

3.0

23.17.9 39.2

10.3 109.5

13.9 247.0

69.6

0

50

100

150

200

250

300

350

Ha

bu

s

FY 01 FY 02 FY 03 FY 04 FY 05 FY 06 FY 07

Fiscal Year (TI-XX)

MSRCs ADCs

5,85110,6703,9121,693947 21,54128,76558423 24 19549

114,400

1,1101,544

19,050

2,1392,872

5,800

16,189

10,697

335 62156988 92 1750

50,000

100,000

150,000

200,000

250,000

300,000

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

20

01

20

02

20

03

20

04

20

05

20

06

20

07

Fiscal Year

Pe

ak

GF

s

MSRCs ADCs

LegendMSRCs

ADCs (DCs)

Note: Computational capability reflects available GFLOPS during fiscal year

HPC Center System Processors

Army Research Laboratory (ARL)

Linux Networx Cluster Linux Networx Cluster IBM Opteron Cluster (C) SGI Altix Cluster (C) Linux Networx Cluster Linux Networx Cluster (C)

256 PEs 2,100 PEs 2,372 PEs 256 PEs

4,528 PEs 3,464 PEs

Aeronautical Systems Center (ASC)

SGI Origin 3900 SGI Origin 3900 (C) IBM P4 (C) SGI Altix Cluster HP Opteron SGI Altix

2,048 PEs 128 PEs 32 PEs

2,048 PEs 2,048 PEs 9,216 PEs

Engineer Research and Development Center (ERDC)

SGI Origin 3900 Cray XT3 (FY 07 upgrade) Cray XT4

1,024 PEs 8,192 PEs 8,848 PEs

Naval Oceanographic Office (NAVO)

IBM P4+ IBM 1600 P5 Cluster IBM 1600 P5 Cluster (C)

3,456 PEs 3,072 PEs 1,920 PEs

FY03FY04FY05FY06FY07

FY03FY04FY05FY06FY07

As of: August 2007

HPC Modernization Program (MSRCs)

HPC Center System Processors

Army High Performance Computing Research Center (AHPCRC)

Cray X1E Cray XT3

1,024 PEs 1,128 PEs

Arctic Region Supercomputing Center (ARSC)

IBM Regatta P4 Sun x4600

800 PEs 2,312 PEs

Maui High Performance Computing Center (MHPCC)

Dell PowerEdge 1955 5,120 PEs

Space & Missile Defense Command (SMDC)

SGI Origin 3000 SGI Altix West Scientific Cluster IBM e1300 Cluster IBM Regatta P4 Cray X1E Atipa Linux Cluster IBM Xeon Cluster Cray XD1

736 PEs 128 PEs 64 PEs

256 PEs 32 PEs

128 PEs 256 PEs 128 PEs 288 PEs

FY03FY04FY05FY06

FY03FY04FY05FY06

As of: August 2007

HPC Modernization Program (ADCs)

Overview of TI-XX Acquisition Process

Determination of Requirements, Usage, and Allocations

Choose application benchmarks, test cases, and weights

Vendors provide measured and projected times on offered systems

Vendors provide measured and projected times on offered systems

Measure benchmark times on DoD standard system

Measure benchmark times on existing DoD systems

Determine performance for each offered system on each application test case

Determine performance for each existing system on each application test case

Determine performance for each offered system

Usability/past performance information on offered systems

Usability/past performance information on offered systems

Collective Acquisition Decision

Use optimizer to determine price/performance for each offered system and combination of systems

Center facility requirements

Vendor pricingVendor pricing

Life-cycle costs for offered systems

TI-08 Synthetic Test Suite

CPUBench – Floating point execution rate

ICBench – Interconnect bandwidth and latency

LANBench – External network interface and connection bandwidth

MEMBench – Memory bandwidth (MultiMAPS)

OSBench – Operating system noise (PSNAP from LANL)

SPIOBench – Streaming parallel I/O bandwidth

TI-08 Application Benchmark Codes

AMR – Gas dynamics code

– (C++/FORTRAN, MPI, 40,000 SLOC)

AVUS (Cobalt-60) – Turbulent flow CFD code

– (Fortran, MPI, 19,000 SLOC)

CTH – Shock physics code

– (~43% Fortran/~57% C, MPI, 436,000 SLOC)

GAMESS – Quantum chemistry code


HYCOM – Ocean circulation modeling code


ICEPIC – Particle-in-cell magnetohydrodynamics code

– (C, MPI, 60,000 SLOC)

LAMMPS – Molecular dynamics code

– (C++, MPI, 45,400 SLOC)

OOCore – Out-of-core solver mimicking electromagnetics code


Overflow2 – CFD code originally developed by NASA


WRF – Multi-Agency mesoscale atmospheric modeling code

– (Fortran and C, MPI, 100,000 SLOC)

Application Benchmark History

Computational Technology Area

FY 2003 FY 2004 FY 2005 FY 2006 FY 2007 FY 2008

Computational Structural Mechanics

CTH CTH RFCTH RFCTH CTH CTH

Computational Fluid Dynamics

Cobalt60LESLIE3D

Aero

Cobalt60Aero

AVUSOverflow2

Aero

AVUSOverflow2

Aero

AVUSOverflow2

AVUSOverflow2

AMR

Computational Chemistry, Biology, and

Materials Science

GAMESSNAMD

GAMESSNAMD

GAMESS GAMESSLAMMPS

GAMESSLAMMPS

GAMESSLAMMPS

Computational Electromagnetics and

Acoustics

OOCore OOCore OOCore OOCoreICEPIC

OOCoreICEPIC

Climate/Weather/ Ocean Modeling and Simulation

NLOM HYCOM HYCOMWRF

HYCOMWRF

HYCOMWRF

HYCOMWRF

Determination of Performance

Establish a DoD standard benchmark time for each application benchmark case– ERDC Cray dual-core XT3 (Sapphire) chosen as standard DoD

system

– Standard benchmark times on DoD standard system measured at 128 processors for standard test cases and 512 processor for large test cases

– Split in weight between standard and large application test cases will be made at 256 processors

Benchmark timings (at least four on each test case) are requested for systems that meet or beat the DoD standard benchmark times by at least a factor of two (preferably four)

Benchmark timings may be extrapolated provided they are guaranteed, but at least two actual timings must be provided for each test case

Determination of Performance (cont.)

Curve fit: Time = A/N + B + C*N– N = number of processing cores

– A/N = time for parallel portion of code (|| base)

– B = time for serial portion of code

– C*N = parallel penalty (|| overhead)

Constraints– A/N ≥ 0 Parallel base time is non-negative.

– Tmin≥ B ≥ 0 Serial time is non-negative and is not greater than the minimum observed time.


Curve fit approach

– For each value of B (Tmin≥ B ≥ 0) Determine A: Time – B = A/N Determine C: Time – (A/N + B) = C*N Calculate fit quality

(Ni, Ti) = time Ti observed at Ni cores

M = number of observed core counts

– Select the value of B with largest fit quality

M

iiii NCBNAT

QualityFit

1

2)*/(

0.1


Calculate score (in DoD standard system equivalents)– C = number of compute cores in target system

– Cbase = number of compute cores in standard system

– Sbase = number of compute cores in standard execution

– STM = size-to-match = number of compute cores of target system required to match performance of Sbase cores of the standard system

STM

S

C

CScore base

base

AMR Large Test Case on HP Opteron Cluster

AMR Large Test Case on SGI Altix

AMR Large Test Case on Dell Xeon Cluster

Overflow-2 Standard Test Case on Dell Xeon Cluster

Overflow-2 Large Test Case on IBM P5+

ICEPIC Standard Test Case on SGI Altix

ICEPIC Large Test Case on SGI Altix

Comparison of HPCMP System Capabilities: FY 2003 - FY 2008

What’s Next?

Continue to evolve application benchmarks to represent accurately the HPCMP computational workload

Increase profiling and performance modeling to understand application performance better

Use performance predictions to supplement application benchmark measurements and guide vendors in designing more efficient systems

presentation outline a word or two about our program our hpc system acquisition process program...

Documents

sloc slide

test case slide

users computational

dod standard benchmark

electromagnetics code

standard test cases

sloc overflow2 cfd code

computational capability