high performance technical computing hptc strategy · high performance technical computing strategy...
TRANSCRIPT
1
High Performance Technical ComputingStrategy &Technology
Dr. Jochen KrebsDr. Jochen KrebsManager Manager HPTC Presales & Business DevelopmentHPTC Presales & Business Development
Compaq Computer GmbHCompaq Computer GmbH
[email protected]@compaq.com
2
AgendaAgenda
HPTC Strategy
Alpha Microprocessors
AlphaServer Family
AlphaServer SC Series
CEA Project
3
Drive Drive TechnologyTechnology
Drive Drive VolumeVolume
TechnicalUsers
TechnicalTechnicalUsersUsers
Early adoptersEarly adopters
CommercialUsers
CommercialCommercialUsersUsers
Analysis(virtual prototype)
Analysis(decision support)
Transaction& Process
Data
Material& Process
Data
VisualizationVisualization
Strategic Role of High Performance ComputingStrategic Role of High Performance Strategic Role of High Performance ComputingComputing
4
Compaq Value PropositionCompaq Value Propositionfor Scientific Computingfor Scientific Computing
uu Top Sustained Application PerformanceTop Sustained Application Performancell enabled by complete 64enabled by complete 64--bit solutionsbit solutionsll optimized ISV applicationsoptimized ISV applicationsll Alpha floating point advantageAlpha floating point advantagell scalable sharedscalable shared-- and distributedand distributed--memory memory
computer systemscomputer systemsuu Upward binary compatibility across platformsUpward binary compatibility across platformsuu Leading cluster technologyLeading cluster technologyuu PrePre--integrated and preintegrated and pre--tested complex systems tested complex systems uu Planning, deployment, and management servicesPlanning, deployment, and management services
5
Top100 Supercomputer Top100 Supercomputer Architectures (June 1999 Top500)Architectures (June 1999 Top500)
40,8%
50,4%
41,0%
25,1%
9,5%
3,0%
19,5%17,1%
30,0%
6,8%5,3%
7,0%5,6%
9,3%
6,0%
1,8%4,6%
7,0%
0,7%3,3%
5,0%
0,4% 0,5% 1,0%0,0% 0,0% 0,0%0,0%
10,0%
20,0%
30,0%
40,0%
50,0%
60,0%
Percent CPUs Percent GFlops Percent Sites
Alpha
PentiumRS6000
MIPSHitachi
FujitsuNEC
SPARC
PA-RISC
6
High Performance Reduces the High Performance Reduces the Time to Solution ...Time to Solution ...
uu Compaq builds computers designed to sustain high Compaq builds computers designed to sustain high rates of computation for scientific and numerically rates of computation for scientific and numerically intensive applicationsintensive applications
uu Pay particular attention to the memory subsystemPay particular attention to the memory subsystemll Microprocessor levelMicroprocessor levelll SMP levelSMP levelll SAN levelSAN level
uu LowLow--latency and highlatency and high--bandwidth memory bandwidth memory operations across the machineoperations across the machine
2
7
The Application’s PerspectiveThe Application’s Perspective
Support two broad classes of parallel architecturesSupport two broad classes of parallel architecturesll SMP (sharedSMP (shared--memory parallel)memory parallel)
–– tasks reside in shared memory accessible to all processors doingtasks reside in shared memory accessible to all processors doing the the workwork
–– programming with threads or using compiler directives (programming with threads or using compiler directives (OpenMPOpenMP))–– parallelizationparallelization of a serial program can be automatic (compiler or preof a serial program can be automatic (compiler or pre--
processor) or manualprocessor) or manual
ll DMP (distributedDMP (distributed--memory parallel)memory parallel)–– tasks reside in distinct compute nodes; synchronization requirestasks reside in distinct compute nodes; synchronization requires
exchanging data over a system area network (message passing) exchanging data over a system area network (message passing) –– programming is explicit and manual, using a messageprogramming is explicit and manual, using a message--passing paradigm passing paradigm
like MPI (message passing interface)like MPI (message passing interface)ll SMP and DMP can be combined in one systemSMP and DMP can be combined in one system
–– SMP nodes in a system area networkSMP nodes in a system area network–– “Thread safe” communication libraries“Thread safe” communication libraries
8
Compaq Product Positioning in Compaq Product Positioning in Large Scale architecturesLarge Scale architectures
2 10 20Cost per MFLOP/s
So
lu
ti
on
i
nt
eg
ra
ti
on
Low Costclusters
“Sier
ra”
Midrangeclusters
QSW
Myrinet/ServerNet
LinuxTru64 Unix
Latency
128node
...
Bandwidth
64 node“Beo
wulf”
FastEthernet
9
EV6/EV67 processorsEV6/EV67 processors
u Advanced Architecture– Up to four instructions per cycle (4 Int, 2 FP)– Out-of-Order execution, up to 80 instructions in-flight– Speculative execution with sophisticated branch prediction
u High Performance Memory Subsystem– 64K I-cache & D-cache– Up to 32 in-flight loads + 32 stores– 5GB/s ext cache bandwidth, 3GB/s memory bandwidth
uu 15.2 millions gates15.2 millions gatesuu 0.35m(Intel), or 0.25 (Samsung) Technologies0.35m(Intel), or 0.25 (Samsung) Technologies
10
SPEC2000: 1SPEC2000: 1--CPUCPU
S P E C 2 0 0 0 C o m p a r i s o n
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
3 0 0
3 5 0
4 0 0
4 5 0
5 0 0
5 5 0
6 0 0
S P E C f p _ 2 0 0 0 S P E C i n t _ 2 0 0 0
E S 4 0 6 6 7 M H zI B M S P P w r 3 3 7 5 M H zS U N U S - I I I 3 3 3 M H zH P N 4 0 0 0 4 4 0 M H zI n t e l P e n t - I I I 7 3 3 M H zS G I R 1 2 K 4 0 0 M H z
11
Memory subMemory sub--system is critical for system is critical for high performance ...high performance ...
uu OutOut--ofof--order execution masks latency of access order execution masks latency of access to memory to memory
uu Evolution from EV5 to EV6 increased system Evolution from EV5 to EV6 increased system memory bandwidth per operationmemory bandwidth per operationll EV56: 250EV56: 250--350 MB/s sustained350 MB/s sustainedll EV6: ~2 GB/s sustainedEV6: ~2 GB/s sustainedll EV7: ~6 GB/s sustainedEV7: ~6 GB/s sustained
uu Alpha RISC machines deliver a higher fraction Alpha RISC machines deliver a higher fraction of peak performance to applications of peak performance to applications
12
M c C a l p i n S M P M e m o r y C o p y B a n d w i d t h C o m p a r i s o n
0 2 5 0 5 0 0 7 5 0
1 0 0 0 1 2 5 0 1 5 0 0 1 7 5 0 2 0 0 0 2 2 5 0 2 5 0 0 2 7 5 0 3 0 0 0 3 2 5 0 3 5 0 0 3 7 5 0 4 0 0 0
0 1 2 3 4 5 6 7 8 # C P U s
MB /s ec
E S 4 0 / 6 6 7 E S 4 0 / 5 0 0 A l p h a S e r v e r 4 1 0 0 5 / 6 0 0 G S 1 4 0 A l p h a S e r v e r 8 4 0 0 I B M R S 6 0 0 0 -5 9 1 S G I O r i g i n 2 K / 3 0 0 M H z H P N 4 0 0 0 S u n U l t r a E n t e r p r i s e 6 0 0 0 I n t e l A l d e r P e n t i u m P r o
McCalpin McCalpin CopyCopy
1
13
Alpha Instructions Per Cycle Alpha Instructions Per Cycle ComparisonsComparisons
Average IPC'sAverage IPC'suu EV4 EV4 -- 0.50.5uu EV5 EV5 -- 0.80.8uu EV6 EV6 -- 1.51.5uu EV7 EV7 -- 1.51.5uu EV8 EV8 -- 2.92.9
Relative IPC'sRelative IPC'suu EV4 EV4 -- 1.01.0uu EV5 EV5 -- 1.61.6uu EV6 EV6 -- 3.03.0uu EV7 EV7 -- 3.03.0uu EV8 EV8 -- 5.85.8
14
•1- 4 Processors•Up to 16GB of memory•Up to 10 PCI slots
ES Series
Complete Suite of HPTC Systems
•1- 2 Processors•Up to 4GB of memory•6 PCI slots
Switched based system - 64-bit PCI I/O subsystems - Very Large Memory
Scalable clusters on Tru64 UNIX, OpenVMS and Linux
Modular system packaging - advanced systems management
DS Series•1-32 Processors•EV67 Powered•Up to 128+GB of memory•Up to 224 PCI slots
GS Series
SC Series
•64-4096 Processors•EV67 Powered•Up to 16 TB memory•Up to 28K I/O slots
15
AlphaServerAlphaServer ES40 EV67 RefreshES40 EV67 Refresh
u AlphaServer ES40 enhancements planned for 1Q00l 667MHz EV67
l Double the speed & size of the cache to 8MBl Double the memory to 32GBl New 64-bit PCI Ultra2 SCSI RAID controllers & Ultra3-
ready StorageWorks expansion
u Up to 35-45% Performance Improvement of 667MHz EV67 over 500MHz EV6ll SPECint95 & SPECint2000 + 42% = 2SPECint95 & SPECint2000 + 42% = 2--3x above Sun!3x above Sun!ll SPECfp95 + 60%,SPECfp95 + 60%, SpecfpSpecfp 2000 + 36%2000 + 36%ll HPTC App’s 30HPTC App’s 30--47% faster (MARC, 47% faster (MARC, AnsysAnsys, , CharmmCharmm))
16
Next Generation AlphaServersNext Generation AlphaServersHighHigh--End SystemEnd System
uu SMP systems up to 32/64SMP systems up to 32/64++ CPUsCPUsll Supports multiple Alpha chip generationsSupports multiple Alpha chip generationsll Scalable memory and I/O BWScalable memory and I/O BW
uu Transparent Memory HierarchyTransparent Memory HierarchyAvoids complexity of traditional Avoids complexity of traditional
NUMA systems by providing:NUMA systems by providing:ll Small latency penalty for nonSmall latency penalty for non--local memory local memory
access (<3:1)access (<3:1)ll Large global memory bandwidthLarge global memory bandwidthll Maintains desirable characteristics of busMaintains desirable characteristics of bus--
based SMP systems, while providing based SMP systems, while providing scalabilityscalability
uu Modular design for investment Modular design for investment protection and easy growthprotection and easy growth
17
AlphaServerAlphaServer GS160/GS320: GS160/GS320: Scalable Enterprise ServerScalable Enterprise Server
• Up to 32 CPUs with 128GB of memory in 3-bay cabinet• Up to sixty-four 64-bit PCI buses; 224 slots• Subset configuration for fewer processors available− field-upgradeable to maximum capacity
Inp
ut
Po
wer
an
d
Po
wer
Su
pp
lies
EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67 EV67
EV67 EV67
EV67 EV67
EV67 EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67
EV67
Glo
bal
Sw
itch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/OSwitch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/OSwitch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/OSwitch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/OSwitch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/O Switch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/O Switch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/O Switch
EV67 EV67EV67 EV67
Mem MemMem Mem
I/O Switch
18
Load Independent PerformanceLoad Independent PerformanceA decade of high performance switch technologyA decade of high performance switch technology
uu New highNew high--end end AlphaServerAlphaServer innovative architecture innovative architecture supports low latency highsupports low latency high--performance switching fabric performance switching fabric and infrastructureand infrastructure
ll Provides high CPUProvides high CPU--count advantages while eliminating count advantages while eliminating performance penalties associated with Nonperformance penalties associated with Non--Uniform Memory Uniform Memory Access (NUMA) architecture Access (NUMA) architecture
ll Advantage is high performance at application level Advantage is high performance at application level applications without degradation under loadapplications without degradation under load
uu No modification to applications required No modification to applications required ll Binary compatible with existing Binary compatible with existing Tru64Tru64 UNIX and UNIX and OpenVMS OpenVMS
AlphaServersAlphaServersll Can move from any Can move from any AlphaServerAlphaServer without recompilation and without recompilation and
immediately take advantage of largest configurationimmediately take advantage of largest configuration
2
19
GS320 and HPTCGS320 and HPTC
uu Designed for sustained application performanceDesigned for sustained application performanceuu Awesome memory bandwidth (52 GB/s)Awesome memory bandwidth (52 GB/s)uu Low, uniform latency across the machineLow, uniform latency across the machineuu Latency stays flat under loadLatency stays flat under loaduu Sustain heavy computational load maintaining Sustain heavy computational load maintaining
delivered performancedelivered performanceuu For large SMPFor large SMP--type applicationstype applications
20
AlphaServer SC SeriesAlphaServer SC SeriesUp to 4096 CPUs in a single system CY2000
21
Sierra Product Objectives Sierra Product Objectives
uu Memory hierarchy able to support high Memory hierarchy able to support high sustained performance of scientific sustained performance of scientific applicationsapplications
uu System software that presents a single System software that presents a single system image to users, programmers, and system image to users, programmers, and system administratorssystem administrators
uu Environment and tools to develop and run Environment and tools to develop and run parallel applications that scale to multiple parallel applications that scale to multiple TeraFLOPS TeraFLOPS performanceperformance
22
“ Compaq“ Compaq AlphaServerAlphaServer SC Series”SC Series”
Scalable, distributedScalable, distributed--memory parallel computer systems memory parallel computer systems built from standard components built from standard components
ll AlphaServer AlphaServer SMP nodesSMP nodesll QSW Elan/Elite system area networkQSW Elan/Elite system area networkll Tru64 UNIX, cluster file system and system Tru64 UNIX, cluster file system and system
administration software administration software ll QSW parallel file system and resource QSW parallel file system and resource
management softwaremanagement softwarell Parallel application development environment and Parallel application development environment and
tools from Compaq and third partiestools from Compaq and third parties
23
The Sierra Program: The Sierra Program: performance growth over timeperformance growth over time
Multiple generations of hardware building blocksMultiple generations of hardware building blocksuu Alpha microprocessor andAlpha microprocessor and AlphaServersAlphaServers
ll EV6, EV67, EV68, EV7, EV8, ...EV6, EV67, EV68, EV7, EV8, ...ll ES40, GS series, ...ES40, GS series, ...
uu AlphaServer AlphaServer SC Interconnect from QSWSC Interconnect from QSWll 16 nodes, 128 nodes, 256 nodes, 512 nodes, …16 nodes, 128 nodes, 256 nodes, 512 nodes, …ll 200 MB/s, 500 MB/s, 1GB/s, ...200 MB/s, 500 MB/s, 1GB/s, ...
24
Switching System Area Interconnect bySwitching System Area Interconnect byQuadrics Supercomputers WorldQuadrics Supercomputers World
ll ElanElan--3 PCI adapter3 PCI adapter-- DMA drivenDMA driven-- Get and put Get and put -- >200 MB/s/rail bi>200 MB/s/rail bi--directionaldirectional
ll Elite “fat tree” switchElite “fat tree” switch-- 88--way xway x--bar chipsbar chips-- 16 or 128 port package16 or 128 port package-- Up to 20m cablesUp to 20m cables-- 0.035 0.035 µµs switch latencys switch latency
ll Multiple virtual circuits and loadMultiple virtual circuits and loadBalancingBalancing
ll Latency: <3 Latency: <3 µµs DMAs DMAll /shmem, <6 /shmem, <6 µµs MPIs MPI
1
25
Job Mgt Job Mgt -- RMS OverviewRMS Overview
uu Administration of partitionsAdministration of partitionsll creates, removes, dynamically reconfigures partitionscreates, removes, dynamically reconfigures partitions
–– interactive, I/O serving, parallel, etc…interactive, I/O serving, parallel, etc…ll access controlaccess control
uu SystemSystem--wide accounting of resourceswide accounting of resourcesll maintained in SQL databasemaintained in SQL database
uu MPIMPI--execution utility execution utility ((prunprun))ll MPI jobs are scheduled 1 process per CPUMPI jobs are scheduled 1 process per CPUll MPI deployment when CPU’s are availableMPI deployment when CPU’s are available
uu TimeTime--sliced gang schedulingsliced gang scheduling
26
Cluster File System (CFS)Cluster File System (CFS)
uu File system mounted on any node is visible to all File system mounted on any node is visible to all nodes without race conditionsnodes without race conditions
uu Single namespace & security domainSingle namespace & security domainuu Each node is both a CFS server and CFS clientEach node is both a CFS server and CFS clientuu Coherency is maintained by exchanging tokensCoherency is maintained by exchanging tokensuu Semantics are POSIX and X/OPEN compliantSemantics are POSIX and X/OPEN compliant
uu Performance depends on access type and patternPerformance depends on access type and patternll Local files are more efficientLocal files are more efficientll ReadRead--only is more efficient than shared read/writeonly is more efficient than shared read/writell Locality of reference is rewardedLocality of reference is rewarded
27
Parallel File SystemParallel File System
uu Aggregates CFS files into a single parallel fileAggregates CFS files into a single parallel fileuu Enables striping a single logical file across multiple Enables striping a single logical file across multiple
underlying local filesunderlying local filesuu Enables I/O performance to scale linearly with the Enables I/O performance to scale linearly with the
number of file serversnumber of file servers
UnderlyingCFSFile 0
UnderlyingCFSFile 1
UnderlyingCFSFile 2
UnderlyingCFSFile 3
Parallel File
metafile
Normal I/O Operations and MPI-IO
Are striped over multiplehost files...
28
StorageStorage
uu RAID sets ensure no single disk failure causes RAID sets ensure no single disk failure causes data to become unavailabledata to become unavailable
uu Multiply connected RAID sets ensure no single Multiply connected RAID sets ensure no single node failure causes a RAID set to become node failure causes a RAID set to become unavailableunavailable
29
Compilers & ToolsCompilers & Tools
uu Compaq F90, C, C++, Java, …Compaq F90, C, C++, Java, …uu Shared memoryShared memory
ll Parallelization within SMP node by OpenMPParallelization within SMP node by OpenMPll 3rd party decomposition tools (KAI)3rd party decomposition tools (KAI)
uu Cray T3D/ECray T3D/E--compatible shmem librarycompatible shmem libraryuu MPI (MPI 2, MPIMPI (MPI 2, MPI--I/O, threadI/O, thread--safe)safe)uu Debugger: TotalView (Etnus, Inc.)Debugger: TotalView (Etnus, Inc.)uu Performance analysis: Vampir (PALLAS)Performance analysis: Vampir (PALLAS)
uu Load Balancing: LSF (Platform), CODINE/GRD (Load Balancing: LSF (Platform), CODINE/GRD (GridwareGridware))
30
SCSC Series RoadmapSeries Roadmap
0
10
2030
40
50
60
70
80
90
100
TF
lops
1999 2000 2001 2002 2004
~100~1002562566464>1500>150020042004
~40,0~40,02562566464>1200>120020022002
~30,0~30,02562566464>1000>100020012001
~7,0~7,01281283232>700>70020002000
~0,5~0,51281284450050019991999
TFlopsTFlopsNumberNumberofof NodesNodes
NumberNumber of of SMP CPUsSMP CPUs
Alpha Alpha (MHz)(MHz)
YearYear
2
31
CEA
u Created in 1945, the French Atomic Energy Commission (CEA) is a public research organization whose mission is to develop applications of the atom in the sectors of energy, industry, research, healthcare, environmental protection and defense.
u More than 16,000 researchers, engineers and other employees have committed themselves to this task, which covers the short, medium and long term.
u CEA informs, advises and makes its expertise available to government authorities. It studies a wide range of scientific and technical solutions so that decision makers in government and industry may have all the information they require at the right moment to make the decisions which are best adapted to present and future needs.
32
CEA - civilu Business Requirements
l Replace a Cray - T3E/750/280 CPUsl Get the best speed and performance of the T3E -nearly four times
faster on real applicationsl Everything finalized by next 2 dec
u Proposed solutionl Sierra System:
– 58 nodes– dual rail QSW switch– 2 TB disk space (CFS + PFS) on FC– RMS management interface
l Each node– ES40
– 4 GB - RAM– 4 x EV67 at 667 MHZ
33
Compute nodesCompute nodes x 58x 58ES40ES40
-- 4 x EV66@667 4 x EV66@667 --system/swap disk 1x9.1GBsystem/swap disk 1x9.1GB
-- 4GB RAM 4GB RAM -- local filesystem 2x9.1GBlocal filesystem 2x9.1GB
LegendLegend
Management SANQSW SAN
Internal Console SAN
Terminalserver
1TBShared (DFS) filesystemShared (DFS) filesystem
HIPPI
HIPPI
CEA Civil network
SAN DataSwitch
Proposed configuration:Proposed configuration:
34
CEA Direction des ApplicationsCEA Direction des ApplicationsMilitairesMilitaires
uu Most powerful supercomputer in Europe (5+ Most powerful supercomputer in Europe (5+ TeraflopsTeraflops))
uu Simulate the condition over time of the French Simulate the condition over time of the French stockpile of nuclear armsstockpile of nuclear arms
uu Installation and startup will be managed by Installation and startup will be managed by Compaq ServicesCompaq Services’’ Integration Center atIntegration Center at AnnecyAnnecy, , FranceFrance
uu The system is planned to be fully operational by The system is planned to be fully operational by the end of 2001.the end of 2001.
35
Software & Application Software & Application Development Tools Development Tools -- Key URLsKey URLs
uu http://www.digital.com/fortran/http://www.digital.com/fortran/uu http://www.unix.digital.com/linux/http://www.unix.digital.com/linux/uu http://www.http://www.testdrivetestdrive..compaqcompaq.com/linux/.com/linux/uu http://www.digital.com/info/hpc/http://www.digital.com/info/hpc/uu http://www.digital.com/info/hpc/news/news_fortran_linux_beta.htmhttp://www.digital.com/info/hpc/news/news_fortran_linux_beta.htmlluu http://www.digital.com/info/hpc/apps/apptech.htmlhttp://www.digital.com/info/hpc/apps/apptech.htmluu http://www.digital.com/info/http://www.digital.com/info/hpchpc//hptchptc/index.html (HPTC Info Center)/index.html (HPTC Info Center)uu http://www.http://www.unixunix.digital.com/.digital.com/developerstoolkitdeveloperstoolkit/ Developer's Toolkit/ Developer's Toolkituu http://www.http://www.unixunix.digital.com/.digital.com/dtkdtk/ Developer's Toolkit Supplement/ Developer's Toolkit Supplementuu http://www.http://www.unixunix.digital.com/tools..digital.com/tools.htm htm Tru64 Development Tools SummaryTru64 Development Tools Summaryuu http://www.http://www.unixunix.digital.com/.digital.com/linuxlinux/software./software.htm htm Linux CompilersLinux Compilers
Better answers.Better answers.