massive supercomputing coping with heterogeneity of modern accelerators toshio endo and satoshi...
TRANSCRIPT
Massive Supercomputing Coping with Heterogeneity of
Modern Accelerators
Toshio Endo and Satoshi Matsuoka
Tokyo Institute of Technology, Japan
Accelerators for High Performance Computing
In HPC systems, power consumption has been/will remain a major concern
SIMD accelerators are promising for their excellent Flops/Watt ratio
ClearSpeed X620 accelerator,80GFlops, 25W
NVidia GeForce 8800GTX,375GFlops(Single precision), ~200W
Heterogeneous Architectures (1)HPC systems only with “special purpose” accelerators are inf
easible They don’t directly support existing compilers, application
s, MPI, Linux…
Heterogeneous architectures will be attractive for Generality by general purpose CPUs
• Typically x86/x86-64 CPUs Higher Flops/Watt ratio by accelerators
• ClearSpeed accelerators, GPGPUs, CellBE… Example: IBM Roadrunner, TokyoTech TSUBAME
Heterogeneous Architectures (2)Objectives: Running a large parallel application on heterogeneous
systemsQuestions: How can we utilize heterogeneous resources effectively? Are they scalable up to supercomputing scale?
AMD Opteron 8804.8GFlops peak / core
ClearSpeed X620 accelerator80GFlops peak
+
Overview of Our Work
We take a tightly-coupled program, Linpack Combined usage of 10,368 Opteron cores and
648 ClearSpeed SIMD accelerators >60TFlops: The world’s highest Linpack
performance on heterogeneous supercomputers
+
NEC/Sun/ClearSpeed/VoltaireTokyoTech TSUBAME Supercomputer (2006)
500GB48disks 500GB
48disks500GB48disks
Voltaire ISR9288 Infiniband 10Gbps
ClearSpeed CSX600
SIMD acceleratorx 360 PCI-X boards
SunFire X460016 Opteron core/nodex 655nodes
648102TFlops peak= Opteron 49.8TF +
ClearSpeed 52.2TF16th supercomputer in the world,2nd in Asia (Top500 in Nov 2007)
Structure of TSUBAME Node with Heterogeneous Processors
SunFire X4600
8 dual-coreOpteron CPUs(16 cores)
ClearSpeedAcceleratoron PCI-X
InfiniBandInterconnect
Cooling Towers (~20 units)Cooling Towers (~20 units)
1.6PByte storage1.6PByte storage16 Opteron cores x16 Opteron cores x
655 Compute nodes655 Compute nodes
288Port 10Gbps288Port 10Gbps
InfiniBand SW x 6InfiniBand SW x 6
ClearSpeed Accelerator PCI-X accelerator boards
• CSX600 SIMD processor x 2 + 1GB DRAM on board• 210MHz x 2FP x 96SIMD x 2 = 80.6GFlops peak
• Configurable up to 250MHz• Power: 25W/board
Provided software:• Cn programming language• CSXL BLAS library <= Used by this work• CSFFT library
DGEMM Performance of Opteron and ClearSpeed
0
1
2
3
4
5
0 1000 2000 3000 4000
Matrix size M
Spee
d (G
flops
)
B=960 B=240
Multiply of (MxB) x (BxM)
An accelerator is equivalent to 14 CPU cores at peak ClearSpeed Performance is much more sensitive to
matrix size!
B M
GOTO BLAS on Opteron (1 core)
0
10
20
30
40
50
60
70
0 2000 4000 6000 8000 10000
12000
Matrix size M
Spe
ed (
GFl
ops)
B=864 B=576
CSXL BLAS 2.50 on ClearSpeed
- GOTO BLAS is by Kazushige Goto, U. Texas
Linpack: Our Target Application Benchmark
Linpack is a numerical benchmark used in Top500• Solve N x N dense linear equations
HPL (High-performance Linpack) by A. Petitet• A well-known MPI parallel implementation• Matrix multiply (DGEMM) computation is most ti
me-consuming; O(N3) in total
Data Decomposition in HPL
Matrix is uniformly distributed with 2D Block-Cyclic distribution
N
B
Matrix distribution on 6 (=2x3) processes
Flow of HPL (simplified)
Performance is bottlenecked by the slowest process
HPL is designed for uniform systems
Panel fact, etc.
Matrix multiply (DGEMM)
for own data
Back to Top
Panel fact, etc
Matrix multiply (DGEMM)
for own data
Back to Top
Panel fact, etc
Matrix multiply (DGEMM)
for own data
Back to Top
MPI comm MPI comm MPI comm
Requirements on Heterogeneous System
HPL is designed for homogeneous systems, but Intra-node heterogeneity: A node has both general
purpose CPUs and SIMD accelerator Inter-node heterogeneity: (In previous TSUBAME
configuration) about half the nodes have accelerators, while others not
We want to keep modification to HPL source code small
How can we run HPL efficiently on heterogeneous systems?
Three System Configurations
CPU-Only
Fully-Acc’d
Half-Acc’d
No Heterogeneity
Intra-node Hetero
Intra-node Hetero +Inter-node Hetero
Our Basic Policy (1/2)
For intra-node heterogeneity, we ‘virtualize’ heterogeneous processors at library layer
Processors are provides of DGEMM performance We control mapping between processes and processors
Processes
Processors
Example of mapping during DGEMM
Our Basic Policy (2/2) For inter-node heterogeneity, we control the
number of processes among nodes• cf. CHARM++, AMPI from UIUC
We can keep kernel workload of each process uniform (good for HPL ), while maintaining heterogeneity
Careful Tuning is Necessary for Performance
Since SIMD accelerators are sensitive to many HPL parameters, careful tuning is necessary• Process granularity• Process mapping• Block size
We need different tuning for each system configuration
Tuning of Process Granularity
We can tune ‘process granularity’ as number of BLAS threads
If processes are too coarse (a process uses many threads), it is more difficult to balance among nodes
If too fine, HPL suffers from duplicated computation
Fine
Coarse
Tuning of Block Size
When block size B is small, ClearSpeed performance is heavily degraded
When B is too large, HPL suffers from large overhead for panel factorization
B M
Tuning on “CPU-only” Case
Focus is bringing out performance of BLAS performance on CPUs
16 Opteron cores
Block size is 240, which is good for GOTO BLAS
x 648
Tuning on “Fully-Acc’d” Case Focus is
• Process granularity / block size should be large enough for ClearSpeed BLAS
• Balancing among processes, while utilizing both processors
16 Opteron cores
ClearSpeed
For PCI-X communication
Block size is 864, which is good for ClearSpeed BLAS
x 648
Tuning on “Half-Acc’d” Case Focus is balance among accelerated nodes
and non-accelerated nodesNode w/o ClearSpeed
ClearSpeed
For PCI-X
Node with ClearSpeed
Block size is 864
x 288
x 360
Experimentation 648 SunFire X4600 nodes in TSUBAME Modified HPL + Voltaire MPI + GOTO BLAS
+ CSXL BLAS Three configurations:
• CPU Only: Only Opteron CPUs are used • Half Acc’d: Only half the nodes are accelerated• Fully Acc’d: All the nodes are accelerated
Experimental Results
0
1020
3040
5060
70
40 648Number of nodes
Spe
ed (TF
lops
)
CPU OnlyHalf Acc'dFully Acc'd
38.18TF in “CPU-only” 48.88TF in “Half-Acc’d”
• +28% over CPU Only
63.51TF in “Fully-Acc’d”• Check precise figures in the next Top500 in June
Relative speed (CPU only=1)
0
0.5
1
1.5
40 648Number of nodes
Rel
ativ
e sp
eed
CPU Only Half Acc'd Fully Acc'd
Summary
Scalability of heterogeneous supercomputers with SIMD accelerators is demonstrated
>60TFlops Linpack performance is achieved Our method works efficiently even when nodes
are partially accelerated
Future work: From hand-tuning to automatic tuning Other useful applications!