presentation outline a word or two about our program our hpc system acquisition process program...
TRANSCRIPT
Presentation Outline
A word or two about our program
Our HPC system acquisition process
Program benchmark suite
Evolution of benchmark-based performance metrics
Where do we go from here?
HPC Modernization Program
HPC Modernization Program Goals
DoD HPC Modernization Program
HPCMP Serves a Large, Diverse DoD User Community
519 projects and 4,086 users at approximately 130 sites
Requirements categorized in 10 Computational Technology Areas (CTA)
FY08 non-real-time requirements of 1,108 Habu-equivalents
156 users are self characterized as “Other”
Computational Structural Mechanics – 437 Users
Electronics, Networking, and Systems/C4I – 114 Users
Computational Chemistry, Biology & Materials Science – 408 Users
Computational Electromagnetics & Acoustics – 337 Users
Computational Fluid Dynamics – 1,572 Users
Environmental Quality Modeling & Simulation – 147 Users
Signal/Image Processing – 353 Users
Integrated Modeling & Test Environments – 139 Users
Climate/Weather/Ocean Modeling & Simulation – 241 Users
Forces Modeling & Simulation – 182 Users
High Performance Computing Centers
4 Allocated Distributed Centers
Strategic Consolidationof Resources
4 Major SharedResource Centers
2007
Total HPCMP End-of-Year Computational Capabilities
1993
HPCMP Center Resources
4.2
1.1
6.3
1.014.8
3.0
23.17.9 39.2
10.3 109.5
13.9 247.0
69.6
0
50
100
150
200
250
300
350
Ha
bu
s
FY 01 FY 02 FY 03 FY 04 FY 05 FY 06 FY 07
Fiscal Year (TI-XX)
MSRCs ADCs
5,85110,6703,9121,693947 21,54128,76558423 24 19549
114,400
1,1101,544
19,050
2,1392,872
5,800
16,189
10,697
335 62156988 92 1750
50,000
100,000
150,000
200,000
250,000
300,000
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
Fiscal Year
Pe
ak
GF
s
MSRCs ADCs
LegendMSRCs
ADCs (DCs)
Note: Computational capability reflects available GFLOPS during fiscal year
HPC Center System Processors
Army Research Laboratory (ARL)
Linux Networx Cluster Linux Networx Cluster IBM Opteron Cluster (C) SGI Altix Cluster (C) Linux Networx Cluster Linux Networx Cluster (C)
256 PEs 2,100 PEs 2,372 PEs 256 PEs
4,528 PEs 3,464 PEs
Aeronautical Systems Center (ASC)
SGI Origin 3900 SGI Origin 3900 (C) IBM P4 (C) SGI Altix Cluster HP Opteron SGI Altix
2,048 PEs 128 PEs 32 PEs
2,048 PEs 2,048 PEs 9,216 PEs
Engineer Research and Development Center (ERDC)
SGI Origin 3900 Cray XT3 (FY 07 upgrade) Cray XT4
1,024 PEs 8,192 PEs 8,848 PEs
Naval Oceanographic Office (NAVO)
IBM P4+ IBM 1600 P5 Cluster IBM 1600 P5 Cluster (C)
3,456 PEs 3,072 PEs 1,920 PEs
FY03FY04FY05FY06FY07
FY03FY04FY05FY06FY07
As of: August 2007
HPC Modernization Program (MSRCs)
HPC Center System Processors
Army High Performance Computing Research Center (AHPCRC)
Cray X1E Cray XT3
1,024 PEs 1,128 PEs
Arctic Region Supercomputing Center (ARSC)
IBM Regatta P4 Sun x4600
800 PEs 2,312 PEs
Maui High Performance Computing Center (MHPCC)
Dell PowerEdge 1955 5,120 PEs
Space & Missile Defense Command (SMDC)
SGI Origin 3000 SGI Altix West Scientific Cluster IBM e1300 Cluster IBM Regatta P4 Cray X1E Atipa Linux Cluster IBM Xeon Cluster Cray XD1
736 PEs 128 PEs 64 PEs
256 PEs 32 PEs
128 PEs 256 PEs 128 PEs 288 PEs
FY03FY04FY05FY06
FY03FY04FY05FY06
As of: August 2007
HPC Modernization Program (ADCs)
Overview of TI-XX Acquisition Process
Determination of Requirements, Usage, and Allocations
Choose application benchmarks, test cases, and weights
Vendors provide measured and projected times on offered systems
Vendors provide measured and projected times on offered systems
Measure benchmark times on DoD standard system
Measure benchmark times on existing DoD systems
Determine performance for each offered system on each application test case
Determine performance for each existing system on each application test case
Determine performance for each offered system
Usability/past performance information on offered systems
Usability/past performance information on offered systems
Collective Acquisition Decision
Use optimizer to determine price/performance for each offered system and combination of systems
Center facility requirements
Vendor pricingVendor pricing
Life-cycle costs for offered systems
TI-08 Synthetic Test Suite
CPUBench – Floating point execution rate
ICBench – Interconnect bandwidth and latency
LANBench – External network interface and connection bandwidth
MEMBench – Memory bandwidth (MultiMAPS)
OSBench – Operating system noise (PSNAP from LANL)
SPIOBench – Streaming parallel I/O bandwidth
TI-08 Application Benchmark Codes
AMR – Gas dynamics code
– (C++/FORTRAN, MPI, 40,000 SLOC)
AVUS (Cobalt-60) – Turbulent flow CFD code
– (Fortran, MPI, 19,000 SLOC)
CTH – Shock physics code
– (~43% Fortran/~57% C, MPI, 436,000 SLOC)
GAMESS – Quantum chemistry code
– (Fortran, MPI, 330,000 SLOC)
HYCOM – Ocean circulation modeling code
– (Fortran, MPI, 31,000 SLOC)
ICEPIC – Particle-in-cell magnetohydrodynamics code
– (C, MPI, 60,000 SLOC)
LAMMPS – Molecular dynamics code
– (C++, MPI, 45,400 SLOC)
OOCore – Out-of-core solver mimicking electromagnetics code
– (Fortran, MPI, 39,000 SLOC)
Overflow2 – CFD code originally developed by NASA
– (Fortran, MPI, 83,600 SLOC)
WRF – Multi-Agency mesoscale atmospheric modeling code
– (Fortran and C, MPI, 100,000 SLOC)
Application Benchmark History
Computational Technology Area
FY 2003 FY 2004 FY 2005 FY 2006 FY 2007 FY 2008
Computational Structural Mechanics
CTH CTH RFCTH RFCTH CTH CTH
Computational Fluid Dynamics
Cobalt60LESLIE3D
Aero
Cobalt60Aero
AVUSOverflow2
Aero
AVUSOverflow2
Aero
AVUSOverflow2
AVUSOverflow2
AMR
Computational Chemistry, Biology, and
Materials Science
GAMESSNAMD
GAMESSNAMD
GAMESS GAMESSLAMMPS
GAMESSLAMMPS
GAMESSLAMMPS
Computational Electromagnetics and
Acoustics
OOCore OOCore OOCore OOCoreICEPIC
OOCoreICEPIC
Climate/Weather/ Ocean Modeling and Simulation
NLOM HYCOM HYCOMWRF
HYCOMWRF
HYCOMWRF
HYCOMWRF
Determination of Performance
Establish a DoD standard benchmark time for each application benchmark case– ERDC Cray dual-core XT3 (Sapphire) chosen as standard DoD
system
– Standard benchmark times on DoD standard system measured at 128 processors for standard test cases and 512 processor for large test cases
– Split in weight between standard and large application test cases will be made at 256 processors
Benchmark timings (at least four on each test case) are requested for systems that meet or beat the DoD standard benchmark times by at least a factor of two (preferably four)
Benchmark timings may be extrapolated provided they are guaranteed, but at least two actual timings must be provided for each test case
Determination of Performance (cont.)
Curve fit: Time = A/N + B + C*N– N = number of processing cores
– A/N = time for parallel portion of code (|| base)
– B = time for serial portion of code
– C*N = parallel penalty (|| overhead)
Constraints– A/N ≥ 0 Parallel base time is non-negative.
– Tmin≥ B ≥ 0 Serial time is non-negative and is not greater than the minimum observed time.
Determination of Performance (cont.)
Curve fit approach
– For each value of B (Tmin≥ B ≥ 0) Determine A: Time – B = A/N Determine C: Time – (A/N + B) = C*N Calculate fit quality
(Ni, Ti) = time Ti observed at Ni cores
M = number of observed core counts
– Select the value of B with largest fit quality
M
iiii NCBNAT
QualityFit
1
2)*/(
0.1
Determination of Performance (cont.)
Calculate score (in DoD standard system equivalents)– C = number of compute cores in target system
– Cbase = number of compute cores in standard system
– Sbase = number of compute cores in standard execution
– STM = size-to-match = number of compute cores of target system required to match performance of Sbase cores of the standard system
STM
S
C
CScore base
base
AMR Large Test Case on HP Opteron Cluster
AMR Large Test Case on SGI Altix
AMR Large Test Case on Dell Xeon Cluster
Overflow-2 Standard Test Case on Dell Xeon Cluster
Overflow-2 Large Test Case on IBM P5+
ICEPIC Standard Test Case on SGI Altix
ICEPIC Large Test Case on SGI Altix
Comparison of HPCMP System Capabilities: FY 2003 - FY 2008
What’s Next?
Continue to evolve application benchmarks to represent accurately the HPCMP computational workload
Increase profiling and performance modeling to understand application performance better
Use performance predictions to supplement application benchmark measurements and guide vendors in designing more efficient systems