pedraforca: a first arm + gpu cluster for...

$: Pedraforca: a First ARM + GPU Cluster for HPCon-demand.gputechconf.com/gtc/2013/presentations/S...64x Compute nodes 4x ARM Cortex-A9 1x NVIDIA Tesla K20 ... See a d\ eveloped first$
www.bsc.es

Pedraforca: a First

ARM + GPU Cluster for HPC

Nikola Puzovic, Alex Ramirez

We’ve hit the power wall

ALL computers are limited by power consumption

Multi-core – Fujitsu Ultra SPARC VIIIfx

– Intel SandyBridge

– AMD Bulldozer

Low-power processors – IBM BlueGene/Q

Compute accelerators – IBM Cell

– NVIDIA Tesla

– AMD Radeon

– Intel Xeon Phi

Energy-efficient approaches

Build the next HPC system on commodity and super-commodity components

– 100M tablets in 2012

– 750M smartphones in 2012

The next step in the commodity chain

HPC

Servers

Desktop

Mobile

Tegra 2

– Dual-core ARM Cortex-A9

– ULP Embedded GPU

Tegra 3

– Quad-core ARM Cortex-A9

– 12-core Embedded GPU

Tegra 4


– 72-core Embedded GPU

NVIDIA Tegra: Commodity CPU + GPU platform

Proof of concept – It is possible to deploy a cluster of

smartphone processors

Enable software stack development

Tibidabo: The first ARM multicore cluster

Q7 carrier board

2 x Cortex-A9

2 GFLOPS

1 GbE + 100 MbE

7 Watts

0.3 GFLOPS / W

Q7 Tegra 2

2 x Cortex-A9 @ 1GHz

2 GFLOPS

5 Watts (?)

0.4 GFLOPS / W

1U Rackable blade

8 nodes

16 GFLOPS

65 Watts

0.25 GFLOPS / W

2 Racks

32 blade containers 256 nodes

512 cores

9x 48-port 1GbE switch

512 GFLOPS

3.4 Kwatt

0.15 GFLOPS / W

Open source system software stack – Ubuntu/Debian Linux OS

– GNU compilers • gcc, g++, gfortran

– Scientific libraries • ATLAS, FFTW, HDF5,...

– Slurm cluster management

Runtime libraries – MPICH2, CUDA, …

– OmpSs toolchain*

Developer tools – Paraver, Scalasca

– Allinea DDT debugger

HPC System software stack on ARM

OmpSs runtime library (NANOS++)

GPU CPU GPU CPU

CPU GPU …

Source files (C, C++, FORTRAN, …)

gcc gfortran OmpSs … Compiler(s)

Executable(s)

CUDA OpenCL MPI

GASNet

Linux Linux Linux

FFTW HDF5 … … ATLAS Scientific libraries

Scalasca … Paraver

Developer tools

Cluster management (Slurm)

* S3232 - OmpSs: Leveraging CUDA and OpenCL to

Exploit Heterogeneous Clusters of Hardware

Accelerators. Thursday, 10:00, Marriott Ballroom 3

Porting applications to ARM

Application Domain Institution Prog. Model

Scalability ARM port MPI OpenMP Other

YALES2 Combustion CNRS/CORIA Y >32K

EUTERPE Fusion BSC Y Y >60K

SPECFEM3D Wave propagation CNRS Y CUDA, SMPSs >150K, >1K GPU

MP2C Multi-particle collision JSC Y >65K

BigDFT Elect. Structure CEA Y Y CUDA, OpenCL >2K, >300 GPU

Quantum Expresso Elect. Strcuture CINECA Y Y CUDA Good

PEPC Coulomg + gravitational

forces JSC Y Pthreads, SMPSs >300K

SMMP Protein folding JSC Y OpenCL

16K

ProFASI Protein folding JSC Y Good

COSMO Weather forecast CINECA Y Y

BQCD Particle physics LRZ Y Y ~300K

Porting full-scale HPC applications to ARM cluster requires minimal effort

Tegra3 SoC


– 6 PCIe lanes (gen1)

Quadro 1000M

– CUDA supported

1 GbE

First hybrid

ARM + CUDA

platform

CARMA: CUDA on ARM developer kit

CARMA Kit: Energy Efficiency

CARMA platform is much more energy-efficient than Tegra3 alone

Pedraforca v1: The first ARM + GPU cluster

Development cluster of 16 CARMA kits @ BSC

Pedraforca v1: Initial application performance results

Only 3.72 GLFOPS in Linpack … but – DGEMM: 21.3 GFLOPS (0.78 GFLOPS/W)

– SGEMM: 127.8 GFLOPS (5.04 GFLOPS/W)

– Low PCIe bandwidth (400 MB/s peak)

– No overlap of data transfers and computation

Pedraforca v2: Next generation ARM + GPU platform Tegra3 Q7 module 4x ARM Cortex-A9 @ 1.3 GHz

2GB DDR2

Mini-ITX carrier

4x PCIe Gen1

SATA 2.0

1 GbE

2.5” SSD

250 GB

SATA 3 MLC

NVIDIA Tesla K20

16x PCIe Gen3

1170 GFLOPS (peak)

Mellanox ConnectX-3

8x PCIe Gen3

40 Gb/s

Ethernet 1 Gb/s (service + storage)

InfiniBand 40 Gb/s (MPI)

Pedraforca: Rack enclosure

2x GbE switch

4x IB switch

Login nodes

Intel SandyBridge E5

64x Compute nodes

4x ARM Cortex-A9

1x NVIDIA Tesla K20

NFS Storage

Pedraforca: Interconnect

GbE network for service and storage

IB network for MPI – With extra ports to connect to other clusters …

GbE

GbE

IB

IB

IB

IB

Current GPU clusters

– Fixed ratio of CPU to GPU

– Unused GPU in not-accelerated apps

– Unused CPU in heavily accelerated apps

Decouple CPU from GPU

– Off-load kernels to remote GPU

– Direct GPU to GPU data transfers

• Orchestrated by light-weight ARM CPU

GPU-accelerated cluster vs. GPU-accelerator cluster

CPU GPU CPU GPU CPU GPU CPU GPU



Interconnection network

CPU CPU CPU CPU

GPU GPU GPU GPU

CPU CPU CPU CPU

CPU CPU CPU CPU

GPU GPU GPU GPU

Interconnection network

Conclusions

CARMA is not an HPC solution …

… but it enables software development already

Pedraforca is the second generation ARM + GPU prototype – GPU-accelerator cluster, instead of GPU-accelerated cluster

• ARM CPU used to orchestrate direct GPU to GPU communication

CPU + GPU integration is happening already – Embedded mobile platforms with OpenCL capable GPU

Get ready for your next generation CPU + GPU platforms!

We’re hiring!

Do you want to work on the next generation of energy-efficient

HPC systems?

Lead the way to the Exascale?

Change the HPC world forever?

http://www.bsc.es/about_bsc/employment/vacancies

– Senior Researchers in Energy-Efficient Supercomputers

– HPC Application Developers



pedraforca: a first arm + gpu cluster for...

Documents