parallel programming many-core computing: hardware (2/5)

PARALLEL PROGRAMMING

MANY-CORE COMPUTING:

HARDWARE (2/5)

Rob van Nieuwpoort

rob@cs.vu.nl

Realism of modern GPUs 2

http://www.youtube.com/watch?v

=bJDeipvpjGQ&feature=play

er_embedded#t=49s

Schedule 3

1. Introduction, performance metrics & analysis

2. Many-core hardware, low-level optimizations

3. Cuda class 1: basics

4. Cuda class 2: advanced

5. Case study: LOFAR telescope with many-cores

Hierarchical systems 4

Cluster

Multiple GPUs per node

Multiple chips per GPU

Streaming multiprocessors

Hardware threads

} This course

Multi-core CPUs 5

General Purpose Processors 6

Architecture

Few fat cores

Vectorization Streaming SIMD Extensions (SSE)

Advanced Vector Extensions (AVX)

Homogeneous

Stand-alone

Memory

Shared, multi-layered

Per-core cache and shared cache

Programming

Multi-threading

OS Scheduler

Coarse-grained parallelism

Intel 7

AMD Magny-Cours 8

AMD Magny-Cours

Two 6-core processors on a single chip

Up to four of these chips in a single compute node

48 cores in total

Non-uniform memory access

Per-core cache

Per-chip cache

Local memory

Remote memory (hypertransport)

AMD Magny-Cours 10

AMD Magny-Cours 11

AWARI on the Magny-Cours 12

51 hours

72 machines / 144 cores

72 GB RAM in total

1.4 TB disk in total

Magny-Cours

45 hours

1 machine, 48 cores

128 GB RAM in 1 machine

4.5 TB disk in 1 machine

Less than 12 hours with new algorithm (needs more RAM)

Multi-core CPU programming

Threads

Pthreads, Java threads, …

OpenMP

OpenCL

Vectorization

Streaming SIMD Extensions (SSE)

Advanced Vector Extensions (AVX)

Vectorizing with SSE

Assembly instructions

16 registers

C or C++: intrinsics

Name instruction, but not registers

Work on variables, not registers

Declare vector variables

Vectorizing with SSE examples

float data[1024];

// init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc.

init(data);

// Set all elements in my vector to zero.

__m128 myVector0 = _mm_setzero_ps();

// Load the first 4 elts of the array into my vector.

__m128 myVector1 = _mm_load_ps(data);

// Load the second 4 elts of the array into my vector.

__m128 myVector2 = _mm_load_ps(data+4);

0 element

0.0 0.0 0.0

0 element

3.0 2.0 1.0

0 element

7.0 6.0 5.0

Vectorizing with SSE examples

// Add vectors 1 and 2; instruction performs 4 FLOPs.

__m128 myVector3 = _mm_add_ps(myVector1, myVector2);

// Multiply vectors 1 and 2; instruction performs 4 FLOPs.

__m128 myVector4 = _mm_mul_ps(myVector1, myVector2);

// _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2.

__m128 myVector5 = _mm_shuffle_ps(myVector1, myVector2,

_MM_SHUFFLE(2, 3, 0, 1));

0 element

4.0 = + 6.0 8.0 10.0

0 element

0.0 1.0 2.0 3.0

0 element

4.0 5.0 6.0 7.0

0 element

0.0 = x 5.0 12.0 21.0

0 element

2.0 = 3.0 4.0 5.0 s

0 element

0.0 1.0 2.0 3.0

0 element

4.0 5.0 6.0 7.0

0 element

0.0 1.0 2.0 3.0

0 element

4.0 5.0 6.0 7.0

Vector add

void vectorAdd(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i++) {

c[i] = a[i] + b[i];

Vector add with SSE: unroll loop

for(int i=0; i<size/4; i += 4) {

c[i+0] = a[i+0] + b[i+0];

c[i+1] = a[i+1] + b[i+1];

c[i+2] = a[i+2] + b[i+2];

c[i+3] = a[i+3] + b[i+3];

Vector add with SSE: vectorize loop

for(int i=0; i<size/4; i += 4) {

__m128 vecA = _mm_load_ps(a + i); // load 4 elts from a

__m128 vecB = _mm_load_ps(b + i); // load 4 elts from b

__m128 vecC = _mm_add_ps(vecA, vecB); // add four elts

_mm_store_ps(c + i, vecC); // store four elts

The Cell Broadband Engine 20

Cell/B.E. 21

Cell/B.E. 22

Architecture

Heterogeneous

1 PowerPC (PPE)

8 vector-processors (SPEs)

Programming

User-controlled scheduling

6 levels of parallelism, all under user control

Fine- and coarse-grain parallelism

Cell/B.E. memory 23

“Normal” main memory

PPE: normal read / write

SPEs: Asynchronous manual transfers: DMA

Per-core fast memory: the Local Store (LS)

Application-managed cache

256 KB

128 x 128 bit vector registers

Roadrunner (IBM) 24

Los Alamos National Laboratory

#1 of top500 June 2008 – November 2009

Now #10

122,400 cores, 1.4 petaflops

First petaflops system

PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz

The Cell’s vector instructions

Differences with SSE

SPEs execute only vector instructions

More advanced shuffling

Not 16, but 128 registers!

Fused Multiply Add support

FMA instruction

Product

( truncate digits )

Product

= (retain all digits)

(no loss of precision)

Multiply-Add (MAD): D = A * B + C

Fused Multiply-Add (FMA): D = A * B + C

Cell Programming models

IBM Cell SDK

C + MPI

OpenCL

Many models from academia...

Cell SDK

Threads, but only on the PPE

Distributed memory

Local stores = application-managed cache!

DMA transfers

Signaling and mailboxes

Vectorization

Direct Memory Access (DMA)

Start asynchronous DMA mfc_get (local store space, main mem address, #bytes, tag);

Wait for DMA to finish mfc_write_tag_mask(tag);

mfc_read_tag_status_all();

DMA lists

Overlap communication with useful work

Double buffering

Vector sum

float vectorSum(int size, float* vector) {

float result = 0.0;

for(int i=0; i<size; i++) {

result += vector[i];

Return result;

Parallelization strategy

Partition problem into 8 pieces

(Assuming a chunk fits in the Local Store)

PPE starts 8 SPE threads

Each SPE processes 1 piece

Has to load data from PPE with DMA

PPE adds the 8 sub-results

Vector sum SPE code (1)

float vectorSum(int size, float* PPEVector) {

float result = 0.0;

int chunkSize = size / NR_SPES; // Partition the data.

float localBuffer[chunkSize]; // Allocate a buffer in

// my private local store.

int tag = 42;

// Points to my chunk in PPE memory.

float* myRemoteChunk = PPEVector + chunkSize * MY_SPE_NUMBER;

Vector sum SPE code (2)

// Copy the input data from the PPE.

mfc_get(localBuffer, myRemoteChunk, chunkSize, tag);

mfc_write_tag_mask(tag);

mfc_read_tag_status_all();

// The real work.

for(int i=0; i<chunkSize; i++) {

result += localBuffer[i];

return result;

Can we optimize this strategy? 34

Can we optimize this strategy? 35

Vectorization

Overlap communication and computation

Double buffering

Strategy:

Split in more chunks than SPEs

Let each SPE download the next chunk while processing the

current chunk

DMA double buffering example (1)

float vectorSum(float* PPEVector, int size, int nrChunks) {

float result = 0.0;

int chunkSize = size / nrChunks;

int chunksPerSPE = nrChunks / NR_SPES;

int firstChunk = MY_SPE_NUMBER * chunksPerSPE;

int lastChunk = firstChunk + nrChunks;

// Allocate two buffers in my private local store.

float localBuffer[2][chunkSize];

int currentBuffer = 0;

// Start asynchronous DMA of first chunk.

float* myRemoteChunk = PPEVector + firstChunk * chunkSize;

mfc_get(localBuffer[currentBuffer], myRemoteChunk, chunkSize,

currentBuffer);

DMA double buffering example (2)

for (int chunk = firstChunk; chunk < lastChunk; chunk++) {

// Prefetch next chunk asynchronously.

if(chunk != lastChunk - 1) {

float* nextRemoteChunk = PPEVector + (chunk+1) * chunkSize;

mfc_get(localBuffer[!currentBuffer], nextRemoteChunk,

chunkSize, !currentBuffer);

// Wait for of current buffer DMA to finish.

mfc_write_tag_mask(currentBuffer); mfc_read_tag_status_all();

// The real work.

for(int i=0; i<chunkSize; i++)

result += localBuffer[currentBuffer][i];

currentBuffer = !currentBuffer;

return result;

Double and triple buffering

Read-only data

Double buffering

Read-write data

Triple buffering!

Work buffer

Prefetch buffer, asynchronous download

Finished buffer, asynchronous upload

General technique

On-chip networks

GPUs (PCI-e)

MPI (cluster)

Intel’s many-core platforms 39

Intel Single-chip Cloud Computer 40

Architecture

Tile-based many-core (48 cores)

A tile is a dual-core

Stand-alone

Memory

Per-core and per-tile

Shared off-chip

Programming

Multi-processing with message passing

User-controlled mapping/scheduling

Gain performance …

Coarse-grain parallelism

Multi-application workloads (cluster-like)

Intel Single-chip Cloud Computer 41

Intel SCC Tile

2 cores

16K L1 cache per core

256K L2 per core

8K Message passing buffer

On-chip network router

Intel's Larrabee

GPU based on x86 architecture

Hardware multithreading

Wide SIMD

Achieved 1 tflop sustained application performance (SC09)

Canceled in Dec 2009, re-targeted to HPC market

Intel's Many Integrated Core (MIC)

May 2010: Larrabee + 80-core research chip + SCC → MIC

X86 vector cores

Knights Ferry: 32 Cores, 128 Threads, 1.2GHz, 8MB shared cache

Knights Corner: 22 nm, 50+ cores

GPU hardware introduction 45

CPU vs GPU 46

The Mythbusters

Jamie Hyneman & Adam Savage

Discovery Channel

Appearance at NVIDIA’s NVISION 2008

parallel programming many-core computing: hardware (2/5)

Documents

1 advanced hardware parallel/distributed processing high...

embedded distributed/parallel computing hardware for · pdf...

modern commodity hardware for parallel computing, and...

data parallel computing on graphics hardware

introduction to parallel computing · introduction to...

parallel computing why & how? - sintefwe´re already at the...

parallel applications parallel hardware parallel software it...

introduction to parallel computing - ulisboaintroduction to...

a view of the parallel computing landscape - people @...

kickstaring the transition to parallel computing with open...

parallel computing with cuda...parallel computing with cuda...

hardware and software for parallel computing · o parallel...

future hardware challenges for scientific computing ·...

parallel applications parallel hardware parallel software 1...

parallel computing in python: multiprocessing · parallel...

1 parallel computing final exam review. 2 what is parallel...

today's software for tomorrow's hardware: an introduction to...

kickstarting the transition to parallel computing with...

teraops hardware: a new massively-parallel mimd computing

accelerated sequence alignment for precision...