hpc – algorithms and applications · (source: intel/raj hazra – isc’14 keynote presentation)...

62
TUM – SCCS HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages Michael Bader Winter 2015/2016 Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 1

Upload: others

Post on 09-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, andLanguages

Michael Bader

Winter 2015/2016

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 1

Page 2: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Part I

Parallel Architectures

(sorry, not everywhere the latest ones . . . )

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 2

Page 3: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Multicore CPUs – Intel’s Nehalem Architecture

• example: quad-core CPU with shared and private caches• simultaneous multithreading: 8 threads on 4 cores• memory architecture: Quick Path Interconnect

(replaced Front Side Bus)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 3

Page 4: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Multicore CPUs – Intel’s Nehalem Architecture (2)

• NUMA (non-uniform memory access) architecture:CPUs have “private” memory, but uniform access toremote memory

• max. 25 GB/s bandwidth

(source: Intel – Nehalem Whitepaper)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 4

Page 5: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Manycore CPU – Intel Xeon Phi Coprocessor

• coprocessor = works as an extension card on the PCI bus• ≈ 60 cores, 4 hardware threads per core• simpler architecture for each core, but• wider vector computing unit (8 double-precision floats)• next generation (Knights Landing) announced to be

available as standalone CPUMichael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 5

Page 6: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Manycore CPU – Intel “Knights Landing”

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. 1Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores, clock frequency and floating point operations per cycle. FLOPS = cores x clock frequency x floating-point operations per second per cycle. . 2Modified version of Intel® Silvermont microarchitecture currently found in Intel® AtomTM processors. 3Modifications include AVX512 and 4 threads/core support. 4Projected peak theoretical single-thread performance relative to 1st Generation Intel® Xeon Phi™ Coprocessor 7120P (formerly codenamed Knights Corner). 5 Binary Compatible with Intel Xeon processors using Haswell Instruction Set (except TSX) . 6Projected results based on internal Intel analysis of Knights Landing memory vs Knights Corner (GDDR5). 7Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory only with all channels populated.

Unveiling Details of Knights Landing (Next Generation Intel® Xeon Phi™ Products)

2nd half ’15 1st commercial systems

3+ TFLOPS1 In One Package

Parallel Performance & Density

On-Package Memory:

up to 16GB at launch

5X Bandwidth vs DDR47

Compute: Energy-efficient IA cores2

Microarchitecture enhanced for HPC3

3X Single Thread Performance vs Knights Corner4

Intel Xeon Processor Binary Compatible5

1/3X the Space6

5X Power Efficiency6

. . .

. . .

Integrated Fabric

Intel® Silvermont Arch. Enhanced for HPC

Processor Package

Conceptual—Not Actual Package Layout

Platform Memory: DDR4 Bandwidth and

Capacity Comparable to Intel® Xeon® Processors

Jointly Developed with Micron Technology

(source: Intel/Raj Hazra – ISC’14 keynote presentation)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 6

Page 7: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

GPGPU – NVIDIA Fermi

7

Hardware Execution

CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes

one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks;

and CUDA cores and other execution units in the SM execute threads. The SM executes

threads in groups of 32 threads called a warp. While programmers can generally ignore warp

execution for functional correctness and think of programming one thread, they can greatly

improve performance by having threads in a warp execute the same code path and access

memory in nearby addresses.

An Overview of An Overview of An Overview of An Overview of the Fermi Architecturethe Fermi Architecturethe Fermi Architecturethe Fermi Architecture

The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA

cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The

512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory

partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM

memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread

global scheduler distributes thread blocks to SM thread schedulers.

Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical

rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache).

(source: NVIDIA – Fermi Whitepaper)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 7

Page 8: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

GPGPU – NVIDIA Fermi (2)

8

Third Generation Streaming

Multiprocessor

The third generation SM introduces several

architectural innovations that make it not only the

most powerful SM yet built, but also the most

programmable and efficient.

512 High Performance CUDA cores

Each SM features 32 CUDA

processors—a fourfold

increase over prior SM

designs. Each CUDA

processor has a fully

pipelined integer arithmetic

logic unit (ALU) and floating

point unit (FPU). Prior GPUs used IEEE 754-1985

floating point arithmetic. The Fermi architecture

implements the new IEEE 754-2008 floating-point

standard, providing the fused multiply-add (FMA)

instruction for both single and double precision

arithmetic. FMA improves over a multiply-add

(MAD) instruction by doing the multiplication and

addition with a single final rounding step, with no

loss of precision in the addition. FMA is more

accurate than performing the operations

separately. GT200 implemented double precision FMA.

In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result,

multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly

designed integer ALU supports full 32-bit precision for all instructions, consistent with standard

programming language requirements. The integer ALU is also optimized to efficiently support

64-bit and extended precision operations. Various instructions are supported, including

Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population

count.

16 Load/Store Units

Each SM has 16 load/store units, allowing source and destination addresses to be calculated

for sixteen threads per clock. Supporting units load and store the data at each address to

cache or DRAM.

Dispatch Unit

Warp Scheduler

Instruction Cache

Dispatch Unit

Warp Scheduler

Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

SFU

SFU

SFU

SFU

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

Interconnect Network

64 KB Shared Memory / L1 Cache

Uniform Cache

Core

Register File (32,768 x 32-bit)

CUDA Core

Operand Collector

Dispatch Port

Result Queue

FP Unit INT Unit

Fermi Streaming Multiprocessor (SM)

8

Third Generation Streaming

Multiprocessor

The third generation SM introduces several

architectural innovations that make it not only the

most powerful SM yet built, but also the most

programmable and efficient.

512 High Performance CUDA cores

Each SM features 32 CUDA

processors—a fourfold

increase over prior SM

designs. Each CUDA

processor has a fully

pipelined integer arithmetic

logic unit (ALU) and floating

point unit (FPU). Prior GPUs used IEEE 754-1985

floating point arithmetic. The Fermi architecture

implements the new IEEE 754-2008 floating-point

standard, providing the fused multiply-add (FMA)

instruction for both single and double precision

arithmetic. FMA improves over a multiply-add

(MAD) instruction by doing the multiplication and

addition with a single final rounding step, with no

loss of precision in the addition. FMA is more

accurate than performing the operations

separately. GT200 implemented double precision FMA.

In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result,

multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly

designed integer ALU supports full 32-bit precision for all instructions, consistent with standard

programming language requirements. The integer ALU is also optimized to efficiently support

64-bit and extended precision operations. Various instructions are supported, including

Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population

count.

16 Load/Store Units

Each SM has 16 load/store units, allowing source and destination addresses to be calculated

for sixteen threads per clock. Supporting units load and store the data at each address to

cache or DRAM.

Dispatch Unit

Warp Scheduler

Instruction Cache

Dispatch Unit

Warp Scheduler

Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

SFU

SFU

SFU

SFU

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

Interconnect Network

64 KB Shared Memory / L1 Cache

Uniform Cache

Core

Register File (32,768 x 32-bit)

CUDA Core

Operand Collector

Dispatch Port

Result Queue

FP Unit INT Unit

Fermi Streaming Multiprocessor (SM) (source: NVIDIA – Fermi Whitepaper)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 8

Page 9: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

GPGPU – NVIDIA Fermi (3)

General Purpose Graphics Processing Unit:

• 512 CUDA cores (organized in16 streaming multiprocessors)

• improved double precisionperformance

• shared vs. global memory• new: L1 und L2 cache (768 KB)• trend from GPU towards CPU?

15

Memory Subsystem Innovations

NVIDIA Parallel DataCacheTM with Configurable L1 and Unified L2 Cache

Working with hundreds of GPU computing

applications from various industries, we learned

that while Shared memory benefits many

problems, it is not appropriate for all problems.

Some algorithms map naturally to Shared

memory, others require a cache, while others

require a combination of both. The optimal

memory hierarchy should offer the benefits of

both Shared memory and cache, and allow the

programmer a choice over its partitioning. The

Fermi memory hierarchy adapts to both types of

program behavior.

Adding a true cache hierarchy for load / store

operations presented significant

challenges. Traditional GPU architectures

support a read-only ‘‘load’’ path for texture

operations and a write-only ‘‘export’’ path for

pixel data output. However, this approach is

poorly suited to executing general purpose C or

C++ thread programs that expect reads and

writes to be ordered. As one example: spilling a

register operand to memory and then reading it

back creates a read after write hazard; if the

read and write paths are separate, it may be necessary to explicitly flush the entire write /

‘‘export’’ path before it is safe to issue the read, and any caches on the read path would not be

coherent with respect to the write data.

The Fermi architecture addresses this challenge by implementing a single unified memory

request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2

cache that services all operations (load, store and texture). The per-SM L1 cache is

configurable to support both shared memory and caching of local and global memory

operations. The 64 KB memory can be configured as either 48 KB of Shared memory with 16

KB of L1 cache, or 16 KB of Shared memory with 48 KB of L1 cache. When configured with

48 KB of shared memory, programs that make extensive use of shared memory (such as

electrodynamic simulations) can perform up to three times faster. For programs whose memory

accesses are not known beforehand, the 48 KB L1 cache configuration offers greatly improved

performance over direct access to DRAM.

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 9

Page 10: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Future Parallel Computing Architectures?

Not exactly sure how the hardware will look like . . .(CPU-style, GPU-style, something new?)

However: massively parallel programming required• revival of vector computing→ several/many FPUs performing the same operation

• hybrid/heterogenous achitectures→ different kind of cores; dedicated accelerator hardware

• different access to memory→ cache and cache coherency→ small amount of memory per core

• new restrictions→ power efficiency, heat management, . . .

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 10

Page 11: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Vector Computing in Different Flavours

Vector Processors: (climax in the 80ies)• large number of ALUs (arithmetic logic units)• vector arguments fed by pipelining• discontinued in the early 90ies: “attack of the killer micros”→ “massively parallel” clusters of ‘conventional CPUs

SIMD (Single Instruction Multiple Data) Registers:• same operations executed on pairs/triples of 2, 4, . . .

operands (ideally in one clock cycle)• vector instructions generated by compiler (“vectorization”)

or (guided by) programmer (“intrinsics””, e.g.)

GPUs: SIMT (“Single Instruction Multiple Thread”):• one instruction chain imposed on multiple lightweight cores

(via warp scheduler/dispatch unit)Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 11

Page 12: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Part II

Performance Evaluation

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 12

Page 13: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Speed-Up

Definition:• T (p): runtime on p processors• speed-up S(p) quantifies the improvement factor in

processing speed:

S(p) :=T (1)

T (p), typically: 1 ≤ S(p) ≤ p

Absolute vs. Relative Speed-Up:• absolute speed-up: best sequential algorithm for the

mono-processor system is compared to the best parallelalgorithm for the multi-processor system

• relative speed-up: compare the same (parallel) algorithmon mono- and multi-processor system

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 13

Page 14: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Parallel Efficiency

Definition:• efficiency E(p) relates speed-up S(p) to the number of

processors p:

E(p) :=S(p)

p

• indicates the relative improvement in processing speed• typically: 0 ≤ E(p) ≤ 1• again: absolute vs. relative efficiency

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 14

Page 15: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Scalability

• reduction of execution time, if the number of processors isincreased

• quantitative: speed-up or parallel efficiency• strong scalability: increase number of processors for fixed

problem size• weak scalability: increase number of processors and

increase problem size• qualitative: is there an improvement at all?

“If you were plowing a field, which would you ratheruse: two strong oxen or 1024 chickens?”

(Seymour Cray)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 15

Page 16: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Amdahl’s Law

Assumptions:• program consists of a sequential part s, 0 ≤ s ≤ 1, which

can not be parallelised(synchronisation, data I/O, etc.)

• parallelisable part, 1− s, can be perfectly parallelised(perfect speed-up on arbitrary number of processors)

• execution time for the parallel program on p processors:

T (p) = s · T (1) +1− s

p· T (1)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 16

Page 17: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Amdahl’s Law (2)

• resulting speed-up:

S(p) =T (1)

T (p)=

T (1)

s · T (1) + 1−sp · T (1)

=1

s + 1−sp

• consider increasing number of processors:

limp→∞

S(p) = limp→∞

1s + 1−s

p

=1s

• Amdahl’s law: speed-up is bounded by S(p) ≤ 1s

• message: any inherently sequential part will destroyscalability once the number of processors becomes bigenough

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 17

Page 18: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Gustafson’s Law

Assumptions:• Amdahl: sequential part stays for increased problem size• Gustavson: assume that any sufficient large problem can

be efficiently parallelised• fixed-time concept:

• parallel execution time is normalised to T (p) = 1• this contains a non-parallelisable part σ, 0 ≤ σ ≤ 1

• execution time on the mono-processor:

T (1) = σ + p · (1− σ)

• thus: sequential part of total work gets smaller withincreasing p

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 18

Page 19: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Gustafson’s Law (2)

• resulting speed-up (as T (p) = 1):

S(p) = σ + p · (1− σ) = p − σ(p − 1)

• resulting parallel efficiency:

E(p) =S(p)

p=σ

p+ (1− σ)→ 1− σ

• more realistic: larger problem sizes,if more processors areavailable; parallelisable parts typically increase

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 19

Page 20: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Compute-Bound vs. Memory-Bound Performance

Consider a memory-bandwidth intensive algorithm:• you can do a lot more flops than can be read from memory• operational intensity (or arithmetic intensity) of a code:

number of performed flops per accessed byte

Memory-Bound Performance:• arithmetic intensity smaller than critical ratio• you could execute additional flops “for free”• speedup only possible by reducing memory accesses

Compute-Bound Performance:• enough computational work to “hide” memory latency• speedup only possible by reducing operations

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 20

Page 21: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

The Roofline ModelG

Flo

p/s

− lo

g

Operational Intensity [Flops/Byte] − log11/21/41/8 2 4 8 16 32 64

peak stream bandwidth

without NUMA

non−unit strid

ewithout instruction−level parallism

without vectorizationS

pMV

peak FP performance

5−pt

ste

ncil

mat

rix m

ult.

(100

x100

)

[Williams, Waterman, Patterson, 2008]Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 21

Page 22: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Roofline Model – CommentsDrawing the Roofline Model:• bandwidth phase: if a is the arithmetic intensity (Flops per byte), then a memory

throughput of b GB/s leads to ba GFlop/swe use a log-log plot: log(ba) = log b + log a = log b + x (if x := log a);hence, in the log-log-plot the bandwidth line has unit slope

• if the arithmetic intensity is 1 and the memory throughput is b GB/s, the CPUexecutes b GFlop/s⇒ we can read the bandwidth value of the machine from thebandwidth-part of the roofline at a = 1

Calculating the Arithmetic Intensity:• the roofline model typically only counts accesses to main memory

(stressed by term “operational” intensity); accesses to cache are ignored• thus: improving cache use can increase the arithmetic intensity of a kernel• roofline model can be adapted to consider specific cache level instead of main

memory

“Ceilings for Performance”:• reduced memory bandwidth due to non-optimal access pattern (or similar)→ lower available memory bandwidth for your kernel

• code cannot exploit instruction-level parallelism, vectorization, fused-multiply-addinstructions, etc. → lower available peak performance

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 22

Page 23: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Part III

Parallel Models

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 23

Page 24: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

The PRAM Model(s)

Shared Memory

P1 P2 P3 Pn. . .

Central Control

Concurrent or Exclusive Read/Write Access:EREW exclusive read, exclusive writeCREW concurrent read, exclusive writeERCW exclusive read, concurrent writeCRCW concurrent read, concurrent write

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 24

Page 25: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Exclusive/Concurrent Read and Write Access

exclusive read concurrent read

X1X2

X3 X4X5 X6 X Y

exclusive write concurrent write

X1X2

X3 X4X5 X6 X Y

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 25

Page 26: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Example: Minimum Search on the PRAM

“Binary Fan-In”:

53 8

3

3

4 7 9 6 10

8534

5

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 26

Page 27: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Minimum on the PRAM – Implementation

MinimumPRAM( L:Array[1..n]) : Integer {! n assumed to be 2ˆk! Model: EREW PRAMfor i from 0 to k−1 do {

for j from 1 to n by 2ˆ( i+1) do in parallelif L[ j+2ˆi ] < L[j ]then L[j ] := L[ j+2ˆi ];end if ;

}return L [1];

}

Complexity: T (n) = Θ(log n) on n2 processors

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 27

Page 28: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Conventions for PRAM Programs(Lockstep Execution of Loops and If-Statements)

Lockstep Execution of parallel for:• Parallel for-loops (i.e., with extension in parallel) are executed “in lockstep”.• Any instruction in a parallel for-loop is executed at the same time (and “in sync”) by all

involved processors.• If an instruction consists of several substeps, all substeps are executed in sync.• If an if-then-else statement appears in a parallel for-loop, all processors first evaluate the

comparison at the same time. Then, all processors on which the condition evaluates as trueexecute the then branch. Finally, all processors on which the condition evaluates to falseexecute the else branch.

Lockstep Example:

for i from 1 to n do in parallel {if U[i ] > 0then F[i ] := (U[i]−U[i−1]) / dxelse F[ i ] := (U[i+1]−U[i]) / dxend if

}

• First, all processors perform the comparison U[i]>0• All processors where U[i]>0 then compute F[ i ]; note that first all processors read U[i ] and

then all processors read U[i−1] (substeps!); hence, there is no concurrent read access!• Afterwards, the else-part is executed in the same manner by all processors with U[i]<=0

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 28

Page 29: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Parallel External Memory – Memory Scheme

CPU

CPU

CPU

External S

hared Mem

ory

private local caches(M words per cache)

[Arge, Goodrich, Nelson, Sitchinava, 2008]

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 29

Page 30: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Parallel External Memory – History

Extension of the classical I/O model:• large, global memory (main memory, hard disk, etc.)• CPU can only access smaller working memory

(cache, main memory, etc.) of M words each• both organised as cache lines of size L words• algorithmic complexity determined by memory transfers

Extension of the PRAM:• multiple CPUs access global shared memory

(but locally distributed)• EREW, CREW, CRCW classification

(for local and external memory)• similar programming model (sychronised execution, e.g.)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 30

Page 31: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Parallel External Memory – Comments

Load and Stores to/from the Cache→ two options:• “out of core” style: assume that it is possible to explicitly control to load variables

into cache and (not) to evict them from cache→ algorithm can specify cachebehavior

• “cache oblivious” style: assume that the cache is intelligent and can even lookinto the future – variables will only be evicted, if they are no longer used;if the cache needs to evict variables for capacity reasons, it will evict thosevariables that lead to the fewest overall loads & stores

Memory Complexity in Parallel External Memory Model:• count the number of cache line transfers between memory and cache• underlying assumption: memory access determines execution time• thus: prerequisite for applying the roofline model

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 31

Page 32: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Interconnection Networks

• multiple CPUs with private memory• CPUs connected via interconnection network• new: topology of the network explicitly considered

Example: 1D mesh (linear array) of processors:

1P

2P

3P

4P

5P

6P

7P

8P

CPUi

P

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 32

Page 33: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

2D Processor Mesh (Array)

P1,1

P1,2

P1,3

P1,4

P1,5

P2,1

P2,2

P2,4

P2,3

P2,5

P5,1

P5,2

P5,3

P5,4

P5,5

Problem: Broadcast• information transported by at most 1 processor per step• phase 1: propagate along first line• phase 2: propagate along columns

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 33

Page 34: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Broadcast on the 2D Mesh – Implementation

Broadcast 2Dmesh (P[1,1]:X, n) {! Model: 2D mesh p[i,j] with n∗n processorsinput: P [1,1]:X ! element to be broadcastedfor j from 1 to n−1 do

P[1, j +1]:X <<< P[1,j]:Xfor i from 1 to n−1 do

for P[i , j ]: 1<=j<=n do in parallelP[i+1,j ]: X <<< P[i,j]:X

end in parallel! value of X is now available on each processor:output: P[i , j ]: X, range 1<= i,j <= n

}

Time complexity: 2n−2 steps on an n×n mesh of processors

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 34

Page 35: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Interconnection-Network Programs(Conventions for Execution)

Notation for variables:

• P[i , j ]: X stresses that X is a local variable on processor P[i,j]→ each processor has a distinct X (exactly one element of the array)→ array is stored in a distributed fashion on all processors

• in comparison to PRAM: here X would need to be an array X[1..n]

Input/Output Configurations:• output: P[i , j ]: X, range 1<= i,j <= n means that for the given range of

processor indices the processor-local variables X contain the (distributed) output• input P[i ,k ]: A,B,C, range P[i,k ]: 1<=i,k<=p (compare matrix multiplication

example) means that the matrices A, B, and C are stored distributed on the givenrange of processors; each processor holds only one element (or block) of A,B,Cin its local memory

Explicit Data Transfer (Send/Receive):• P[i+1,j ]: X <<< P[i,j]:X means that the content of the local variable X on

processor P[i,j] is copied (transferred) into the local variable X on processorP[i+1,j]→ in an MPI program, this might correspond to a pair of send and receive calls

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 35

Page 36: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Bulk Synchronous Parallelism

• suggested as a “bridging model” between software andhardware aspects

• hardware model:• multiple CPUs with private memory• CPUs connected via point-to-point network

CPU CPU CPU CPU

. . .

point−to−point network

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 36

Page 37: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Bulk Synchronous Parallelism – Execution Model

Computation organised into sequence of “Super Steps”:1. each CPU executes a sequence of operations

(on local data, synchronised from the last super step)2. CPUs send and receive point-to-point messages

(no broadcasts or send/receive to/by multiple CPUsallowed)

3. synchronisation at a barrier

Goal:• estimate time for steps 1, 2, 3 based on CPU speed γ,

bandwidth β, and latency λ• estimated execution time thus: T = s(n · γ + m · β + λ)

for s steps with n operations and m sent/received bytes

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 37

Page 38: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Summary: Parallel Algorithmic Models

Each model highlights specific properties of the hardware:• limits to parallelism (number of processors)→ PRAM• influence of (parallel) caches→ Parallel External Memory• compute- or memory-bound performance→ Roofline• transfer of messages→ Interconnection Networks• computation vs. communication time→ BSP

“All models are wrong – some of them are useful”• more than one model required to analyze the performance

of parallel algorithm on real machines• pick the model that fits best to an anticipated bottleneck

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 38

Page 39: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Part IV

Parallel Languages

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 39

Page 40: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

OpenMP

• shared-memory application programming interface (API)• extends sequential programs by directives to help compiler

generate parallel code• available for Fortran or C/C++• fork-join-model: programs will be executed by a team of

cooperating threads• memory is shared between threads, except few private

variables

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 40

Page 41: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Matrix-Vector Product in OpenMP

void mvp(int m, int n, double∗ restrict y,double∗∗ restrict A, double∗ restrict x)

{int i , j ;

#pragma omp parallel for default(none) \shared(m,n,y,A,x) private(i, j )

for ( i=0; i<n; i++) {y[ i ] = 0.0;for ( j=0; j<n; j++) {

y[ i ] += A[i ][ j ]∗x[ j ];}

} /∗−− end of omp parallel for −−∗/}

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 41

Page 42: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

OpenMP Directives

OpenMP directives are inserted as #pragma:

#pragma omp parallel for default(none) \shared(m,n,y,A,x) private(i, j )

Advantages:• directives will be ignored by compilers that do not support

OpenMP• sequential and parallel program in the same code!• incremental programming approach possible

(add parallel code sections as required)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 42

Page 43: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

OpenMP’s Memory Model

#pragma omp parallel ... default(none) \shared(m,n,y,A,x) private(i, j )

• memory usually shared between threads –here: matrix and vectors

• however, certain variables are private:loop variables (here: indices), temporary variables, etc.

• if not specified, default settings apply –here: default(none) to switch off all default settings

• programmer is responsible to sort out concurrentaccesses!(even if default settings are used)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 43

Page 44: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Pthreads

• standardised programming interface according to POSIX(Portable Operating System Interface)

• thread model is a generalised version of the UNIX processmodel (forking of threads on shared address space)

• scheduling of threads to CPU cores done by operatingsystem

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 44

Page 45: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Pthreads – Typical Methods

• pthread create(...): (main) thread creates a furtherthread that will start by executing a specified function

• pthread join(...): thread will wait until a specifiedthread has terminated (useful for synchronisation)

• pthread cancel(...): cancel another thread• functions to synchronise data structures:mutex: mutual exclusive access to data structures;cond: wait for or signal certain conditions

• functions to take influence on scheduling• etc.

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 45

Page 46: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Programming Patterns

Pthreads allow arbitrary coordination of threads; however,certain programming patterns are common, e.g.:• Master-Slave model:

master thread controls program execution andparallelisation by delegating work to slave threads

• Worker model:threads are not hierarchically organised, butdistribute/organise the operations between themselves(example: jobs retrieved from a work pool)

• Pipelining model:threads are organised via input/output relations: certainthreads provide data for others, etc.

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 46

Page 47: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Example: Matrix Multiplication

#include <pthread.h>typedef struct {

int size, row, column;double (∗MA)[8], (∗MB)[8], (∗MC)[8];} matrix type t ;

void thread mult(matrix type t ∗work) {int i , row = work−>row, col = work−>column;work−>MC[row][col] = 0.0;for( i=0;i<work−>size;i++)

work−>MC[row][col] +=work−>MA[row][i] ∗ work−>MB[i][col];

}

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 47

Page 48: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Example: Matrix Multiplication (cont.)

void main() {double MA[8][8], MB[8][8], MC[8][8];pthread t thread[8∗8];for( int row=0; row<8;row++)

for( int col=0; col<8;col++) {matrix type t ∗work = (matrix type t ∗) malloc( /∗ ... ∗/ );work−>size = 8; work−>row = row; work−>col = col;work−>MA = MA; work−>MB = MB; work−>MC = MC;pthread create(&(thread[col+8∗row]), NULL,

(void∗) thread mult, (void∗) work);};

for( int i=0; i<8∗8;i++) pthread join(thread[ i ], NULL);}(example from: Rauber&Rı¿ 1

2 nger: Parallele Programmierung)Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 48

Page 49: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Java Threads

• object-oriented language design explicitly includes threads• class Thread to represent threads – can be extended to

implement customised threads(inherits start(), run() methods, etc.)

• interface Runnable:classes that implement Runnable, i.e., provide a methodrun() can be used to create a thread:

Thread th = new Thread(runnableObject);

• keyword synchronized for methods that should be treatedas a critical region

• methods wait() and notify() in class Object

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 49

Page 50: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Example: Matrix Multiplication

class MatMult extends Thread {static int a [][], b [][], c [][], n=3;int row;MatMult(int row) {

row = row; this . start ();}public void run() {

for( int i=0; i<n; i++) {c[row][ i ] = 0.0;for( int j=0; j<n; j++)

c[row][ i ] = c[row][ i ] + a[row][ j ]∗b[ j ][ i ];};} /∗ class MatMult t.b.c. ∗/

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 50

Page 51: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Example: Matrix Multiplication (cont.)

public static void main() {a = new int[n ][n ]; b = new int[n ][n ]; c = new int[n ][n ];/∗ ... initialise a,b,c ... ∗/MatMult mat = new MatMult[n];for( int i=0; i<n; i++)

mat[i ] = new MatMult(i);try {

for( int i=0; i<n; i++) mat[i ]. join ();} catch(Exception E) { /∗ ... ∗/ };}}

(cmp. example in Rauber&Rı¿ 12 nger: Parallele Programmierung)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 51

Page 52: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Unified Parallel C (UPC)

• extension of C, specified in the ISO C99 standard• based on a distributed shared memory model:

physically distributed memory with shared address space• PGAS language: “partitioned global address space”• single program, multiple data: every program is executed

in parallel on specified number of threads• variables are private by default, but can be declared asshared

• consistency model can be varied: strict vs. relaxed

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 52

Page 53: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Example: Matrix Multiplication

Declaration of variables:

shared [N∗N/THREADS] int a[N][N];shared [N/THREADS] int b[N][N];shared [N∗N/THREADS] int c[N][N];

Variables have an affinity towards threads:• block-cyclic distribution of variables top threads• a and c declared with a block size of N∗N/THREADS→ block-oriented distribution of rows to threads

• b declared with a block-size of N/THREADS→ block-oriented distribution of columns to threads

affinity can reflect physical distribution of data

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 53

Page 54: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Example: Matrix Multiplication (cont.)

Code excerpt for matrix multiplication:

upc forall ( i=0;i<N;i++;&a[i ][0])/∗ &a[i ][0] specifies that iteration will be executed by

thread that has affinity to a[ i ][0] ∗/for( j=0;j<N; j++) {

c[ i ][ j ] = 0;for( l=0;l<N;l++) c[i ][ j ] += a[ i ][ l ]∗b[ l ][ j ];

};upc barrier ;

(source: Rauber&Rı¿ 12 nger: Parallele Programmierung)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 54

Page 55: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Further PGAS Languages

• Co-Array Fortran (CAF)→ will become part of the next Fortran standard

• Titanium (similar to UPC, but for Java)• X10: extension of Java;

globally asynchronous, locally synchronous: add “places”that execute threads

• Chapel• Fortress

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 55

Page 56: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Further Example: Intel PBB

“Intel Parallel Building Blocks”:• language extensions & libraries for C/C++• Intel Cilk Plus: language extension for simple loop & task

oriented parallelism for C/C++• Intel Threading Building Blocks: C++ template library to

support task parallelism• Intel Array Building Blocks: C++ template library to

support vector parallelism

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 56

Page 57: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Examples – Intel Cilk Plus

Intel Cilk:

void mergesort(int a [], int left , int right ) {if ( left < right ) {

int mid = ( left +right )/2;cilk spawn mergesort(a,left ,mid);mergesort(a,mid,right);cilk sync ;merge(a,left ,mid,right );}}

(source: Intel)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 57

Page 58: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Examples – Intel ArBB

Intel ArBB:

voidmatvec product(const dense<f32, 2>& matrix,

const dense<f32>& vector,dense<f32>& result)

{result = add reduce(matrix

∗ repeat row(vector, matrix.num rows()));}

(source: Intel)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 58

Page 59: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

Message Passing – MPI

Abstraction of a distributed memory computer• MPI run consists of a set of processes with separate

memory space• processes can exchange data by sending messages• no explicit view on network topology required

SIMD/MIMD?? . . .→ MPMD/SPMD• processes can run different programms (“codes”)→ multiple program multiple data (MPMD)

• more common (and simpler):processes run instances of the same program (“code”)→ single program multiple data (SPMD)

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 59

Page 60: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

MPI Example: “Hi there” . . .

#include ”mpi.h”int main( int argc, char ∗∗argv ){

int myrank;MPI Init ( &argc, &argv );MPI Comm rank( MPI COMM WORLD, &myrank );if (myrank == 0)

send a message();else if (myrank == 1)

receive a message();MPI Finalize();}

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 60

Page 61: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

MPI Example: “Hi there” . . . (cont.)

void send a message() {char message[40];strcpy(message,”Mr. Watson, come here, I want you.”);MPI Send(message, strlen(message)+1, MPI CHAR, 1,

110, MPI COMM WORLD);}void receive a message() {

char message[40];MPI Status status;MPI Recv(message, 40, MPI CHAR, 0,

110, MPI COMM WORLD, &status);printf ( ”received: %s:\n”, message);

}

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 61

Page 62: HPC – Algorithms and Applications · (source: Intel/Raj Hazra – ISC’14 keynote presentation) Michael Bader: HPC – Algorithms and Applications Fundamentals – Parallel Architectures,

TUM – SCCS

References (Languages)

• T. Rauber, G. Runger:Parallele Programmierung, Springer, 2007 (2. Aufl.)

• B. Chapman, G. Jost, R. van der Pas:Using OpenMP – Portable Shared Memory ParallelProgramming, MIT Press, 2008

• Intel Parallel Building Blocks:http://software.intel.com/en-us/articles/

intel-parallel-building-blocks/

Michael Bader: HPC – Algorithms and Applications

Fundamentals – Parallel Architectures, Models, and Languages, Winter 2015/2016 62