massive supercomputing coping with heterogeneity of modern accelerators toshio endo and satoshi...

Massive Supercomputing Coping with Heterogeneity of

Modern Accelerators

Toshio Endo and Satoshi Matsuoka

Tokyo Institute of Technology, Japan

Accelerators for High Performance Computing

In HPC systems, power consumption has been/will remain a major concern

SIMD accelerators are promising for their excellent Flops/Watt ratio

ClearSpeed X620 accelerator,80GFlops, 25W

NVidia GeForce 8800GTX,375GFlops(Single precision), ~200W

Heterogeneous Architectures (1)HPC systems only with “special purpose” accelerators are inf

easible They don’t directly support existing compilers, application

s, MPI, Linux…

Heterogeneous architectures will be attractive for Generality by general purpose CPUs

• Typically x86/x86-64 CPUs Higher Flops/Watt ratio by accelerators

• ClearSpeed accelerators, GPGPUs, CellBE… Example: IBM Roadrunner, TokyoTech TSUBAME

Heterogeneous Architectures (2)Objectives: Running a large parallel application on heterogeneous

systemsQuestions: How can we utilize heterogeneous resources effectively? Are they scalable up to supercomputing scale?

AMD Opteron 8804.8GFlops peak / core

ClearSpeed X620 accelerator80GFlops peak

+

Overview of Our Work

We take a tightly-coupled program, Linpack Combined usage of 10,368 Opteron cores and

648 ClearSpeed SIMD accelerators >60TFlops: The world’s highest Linpack

performance on heterogeneous supercomputers

+

NEC/Sun/ClearSpeed/VoltaireTokyoTech TSUBAME Supercomputer (2006)

500GB48disks 500GB

48disks500GB48disks

Voltaire ISR9288 Infiniband 10Gbps

ClearSpeed CSX600

SIMD acceleratorx 360 PCI-X boards

SunFire X460016 Opteron core/nodex 655nodes

648102TFlops peak= Opteron 49.8TF +

ClearSpeed 52.2TF16th supercomputer in the world,2nd in Asia (Top500 in Nov 2007)

Structure of TSUBAME Node with Heterogeneous Processors

SunFire X4600

8 dual-coreOpteron CPUs(16 cores)

ClearSpeedAcceleratoron PCI-X

InfiniBandInterconnect

Cooling Towers (~20 units)Cooling Towers (~20 units)

1.6PByte storage1.6PByte storage16 Opteron cores x16 Opteron cores x

655 Compute nodes655 Compute nodes

288Port 10Gbps288Port 10Gbps

InfiniBand SW x 6InfiniBand SW x 6

ClearSpeed Accelerator PCI-X accelerator boards

• CSX600 SIMD processor x 2 + 1GB DRAM on board• 210MHz x 2FP x 96SIMD x 2 = 80.6GFlops peak

• Configurable up to 250MHz• Power: 25W/board

Provided software:• Cn programming language• CSXL BLAS library <= Used by this work• CSFFT library

DGEMM Performance of Opteron and ClearSpeed

0

1

2

3

4

5

0 1000 2000 3000 4000

Matrix size M

Spee

d (G

flops

)

B=960 B=240

Multiply of (MxB) x (BxM)

An accelerator is equivalent to 14 CPU cores at peak ClearSpeed Performance is much more sensitive to

matrix size!

B M

GOTO BLAS on Opteron (1 core)

0

10

20

30

40

50

60

70

0 2000 4000 6000 8000 10000

12000

Matrix size M

Spe

ed (

GFl

ops)

B=864 B=576

CSXL BLAS 2.50 on ClearSpeed

- GOTO BLAS is by Kazushige Goto, U. Texas

Linpack: Our Target Application Benchmark

Linpack is a numerical benchmark used in Top500• Solve N x N dense linear equations

HPL (High-performance Linpack) by A. Petitet• A well-known MPI parallel implementation• Matrix multiply (DGEMM) computation is most ti

me-consuming; O(N3) in total

Data Decomposition in HPL

Matrix is uniformly distributed with 2D Block-Cyclic distribution

N

B

Matrix distribution on 6 (=2x3) processes

Flow of HPL (simplified)

Performance is bottlenecked by the slowest process

HPL is designed for uniform systems

Panel fact, etc.

Matrix multiply (DGEMM)

for own data

Back to Top

Panel fact, etc


for own data

Back to Top

Panel fact, etc


for own data

Back to Top

MPI comm MPI comm MPI comm

Requirements on Heterogeneous System

HPL is designed for homogeneous systems, but Intra-node heterogeneity: A node has both general

purpose CPUs and SIMD accelerator Inter-node heterogeneity: (In previous TSUBAME

configuration) about half the nodes have accelerators, while others not

We want to keep modification to HPL source code small

How can we run HPL efficiently on heterogeneous systems?

Three System Configurations

CPU-Only

Fully-Acc’d

Half-Acc’d

No Heterogeneity

Intra-node Hetero

Intra-node Hetero +Inter-node Hetero

Our Basic Policy (1/2)

For intra-node heterogeneity, we ‘virtualize’ heterogeneous processors at library layer

Processors are provides of DGEMM performance We control mapping between processes and processors

Processes

Processors

Example of mapping during DGEMM

Our Basic Policy (2/2) For inter-node heterogeneity, we control the

number of processes among nodes• cf. CHARM++, AMPI from UIUC

We can keep kernel workload of each process uniform (good for HPL ), while maintaining heterogeneity

Careful Tuning is Necessary for Performance

Since SIMD accelerators are sensitive to many HPL parameters, careful tuning is necessary• Process granularity• Process mapping• Block size

We need different tuning for each system configuration

Tuning of Process Granularity

We can tune ‘process granularity’ as number of BLAS threads

If processes are too coarse (a process uses many threads), it is more difficult to balance among nodes

If too fine, HPL suffers from duplicated computation

Fine

Coarse

Tuning of Block Size

When block size B is small, ClearSpeed performance is heavily degraded

When B is too large, HPL suffers from large overhead for panel factorization

B M

Tuning on “CPU-only” Case

Focus is bringing out performance of BLAS performance on CPUs

16 Opteron cores

Block size is 240, which is good for GOTO BLAS

x 648

Tuning on “Fully-Acc’d” Case Focus is

• Process granularity / block size should be large enough for ClearSpeed BLAS

• Balancing among processes, while utilizing both processors

16 Opteron cores

ClearSpeed

For PCI-X communication

Block size is 864, which is good for ClearSpeed BLAS

x 648

Tuning on “Half-Acc’d” Case Focus is balance among accelerated nodes

and non-accelerated nodesNode w/o ClearSpeed

ClearSpeed

For PCI-X

Node with ClearSpeed

Block size is 864

x 288

x 360

Experimentation 648 SunFire X4600 nodes in TSUBAME Modified HPL + Voltaire MPI + GOTO BLAS

+ CSXL BLAS Three configurations:

• CPU Only: Only Opteron CPUs are used • Half Acc’d: Only half the nodes are accelerated• Fully Acc’d: All the nodes are accelerated

Experimental Results

0

1020

3040

5060

70

40 648Number of nodes

Spe

ed (TF

lops

)

CPU OnlyHalf Acc'dFully Acc'd

38.18TF in “CPU-only” 48.88TF in “Half-Acc’d”

• +28% over CPU Only

63.51TF in “Fully-Acc’d”• Check precise figures in the next Top500 in June

Relative speed (CPU only=1)

0

0.5

1

1.5

40 648Number of nodes

Rel

ativ

e sp

eed

CPU Only Half Acc'd Fully Acc'd

Summary

Scalability of heterogeneous supercomputers with SIMD accelerators is demonstrated

>60TFlops Linpack performance is achieved Our method works efficiently even when nodes

are partially accelerated

Future work: From hand-tuning to automatic tuning Other useful applications!

massive supercomputing coping with heterogeneity of modern accelerators toshio endo and satoshi...

Documents

simd x

board210mhz x

amd opteron

clearspeed simd accelerators

clearspeed goto blas

gbpsinfiniband sw x

corecsxl blas

tflops peak