introduction to parallel computers and parallel programming · different ways of parallel...

29
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p. 1

Upload: others

Post on 26-Sep-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Introduction to parallel computersand parallel programming

Introduction to parallel computersand parallel programming – p. 1

Page 2: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Content

A quick overview of morden parallel hardwareParallelism within a chip

PipeliningSuperscaler executionSIMDMultiple cores

Parallelism within a nodeMultiple socketsUMA vs. NUMA

Parallelism across multiple nodes

A very quick overview of parallel programming

Note: The textbook [Quinn 2004] is outdated on parallel hardware, pleaseread instead https://computing.llnl.gov/tutorials/parallel comp/

Introduction to parallel computersand parallel programming – p. 2

Page 3: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

First things first

CPU—central processing unit—is the “brain” of a computer

CPU processes instructions, many of which require data transfersfrom/to the memory on a computer

CPU integrates many components (registers, FPUs, caches...)

CPU has a “clock”, which at each clock cycle synchronizes the logicunits within the CPU to process instructions

Introduction to parallel computersand parallel programming – p. 3

Page 4: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

An example of a CPU

Block diagram of an Intel Xeon Woodcrest CPU (core)

Introduction to parallel computersand parallel programming – p. 4

Page 5: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Instruction pipelining

Suppose every instruction has five stages (each needs one cycle)

Without instruction pipelining

With instruction pipelining

Introduction to parallel computersand parallel programming – p. 5

Page 6: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Superscalar execution

Multiple execution units ⇒ more than one instruction can finish per cycle

An enhanced form of instruction-level parallelism

Introduction to parallel computersand parallel programming – p. 6

Page 7: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Data

Data are stored in computer memory as sequence of 0s and 1s

Each 0 or 1 occupies one bit

8 bits constitute one byte

Normally, in the C languagechar: 1 byteint: 4 bytesfloat: 4 bytesdouble: 8 bytes

Bandwidth—the speed of data transfer—is measured as number ofbytes transferred per second

Introduction to parallel computersand parallel programming – p. 7

Page 8: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

SIMD

SISD: single instruction stream single data stream

SIMD: single instruction stream multiple data streams

Introduction to parallel computersand parallel programming – p. 8

Page 9: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

An example of a floating-point unit

FP unit on a Xeon CPU

Introduction to parallel computersand parallel programming – p. 9

Page 10: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Multicore processor

Modern hardware technology can put several independent CPUs (called“cores”) on the same chip—a multicore processor

Intel Xeon Nehalem quad-core processor

Introduction to parallel computersand parallel programming – p. 10

Page 11: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Multi-threading

Modern CPU cores often have threading capability

Hardware support for multiple threads to be executed within a core

However, threads have to share resources of a corecomputing unitscachestranslation lookaside buffer

Introduction to parallel computersand parallel programming – p. 11

Page 12: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Vector processor

Another approach, different from multicore (and multi-threading)

Massive SIMDVector registersDirect pipes into main memory with high bandwidth

Used to be dominating high-performance computing hardware, butnow only niche technology

Introduction to parallel computersand parallel programming – p. 12

Page 13: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Multi-socket

Socket—a connection on motherboard that a processor is pluggedinto

Modern computers often have several socketsEach socket holds a multicore processorExample: Nehalem-EP (2×socket, quad-core CPUs, 8 cores intotal)

OS maps each core to its concept of a CPU for historical reasons

Introduction to parallel computersand parallel programming – p. 13

Page 14: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Shared memory

Shared memory: all CPU cores can access all memory as globaladdress space

Traditionally called “multiprocessor”

Two common definitions of SMP (dependent on context):1. SMP (shared-memory processing)—multiple CPUs (or multiple

CPU cores) share access, not necessarily of same accessspeed, to the entire memory available on a single system

2. SMP (symmetric multi-processing)—multiple CPUs (or multipleCPU cores) have same speed and latency to a shared memory

Introduction to parallel computersand parallel programming – p. 14

Page 15: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

UMA

UMA—uniform memory access, one type of shared memory

Another name for symmetric multi-processing

Dual-socket Xeon Clovertown CPUs

Introduction to parallel computersand parallel programming – p. 15

Page 16: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

NUMA

NUMA—non-uniform memory access, another type of sharedmemory

Several symmetric multi-processing units are linked together

Each core should access its closest memory unit, as much aspossible

Dual-socket Xeon Nehalem CPUsIntroduction to parallel computersand parallel programming – p. 16

Page 17: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Cache coherence

Important for shared-memory systems

If one CPU core updates a value in its private cache, all the othercores “know” about the update

Cache coherence is accomplished by hardware

Introduction to parallel computersand parallel programming – p. 17

Page 18: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

“Competition” among the cores

Within a multi-socket multicore computer, some resources are shared

Within a socket, the cores share the last-level cache

The memory bandwidth is also shared to a great extent

# cores 1 2 4 6 8

BW 3.42 GB/s 4.56 GB/s 4.57 GB/s 4.32 GB/s 5.28 GB/sActual memory bandwidth measured on a 2×socket quad-core Xeon Harpertown

Introduction to parallel computersand parallel programming – p. 18

Page 19: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Distributed memory

The entire memory consists of several disjoint parts

A communication network is needed in between

There is not a single global memory space

A CPU (core) can directly access its own local memory

A CPU (core) cannot directly access a remote memory

A distributed-memory is traditionally called “multicomputer”

Introduction to parallel computersand parallel programming – p. 19

Page 20: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Comparing shared memory and distributed memory

SharedmemoryUser-friendly programmingData sharing between processorsNot cost effectiveSynchronization needed

Distributed-memoryMemory is scalable with the number of processorsCost effectiveProgrammer responsible for data communication

Introduction to parallel computersand parallel programming – p. 20

Page 21: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Hybrid memory system

Memory

Core Core CoreCore

Cache Cache

Core Core CoreCore

Cache Cache

Bus

Compute Node

Memory

Core Core CoreCore

Cache Cache

Core Core CoreCore

Cache Cache

Bus

Compute Node

Memory

Core Core CoreCore

Cache Cache

Core Core CoreCore

Cache Cache

Bus

Compute NodeIn

terc

onne

ct N

etw

ork

Introduction to parallel computersand parallel programming – p. 21

Page 22: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Different ways of parallel programming

Threads model using OpenMPEasy to program (inserting a few OpenMP directives)Parallelism "behind the scene" (little user control)Difficult to scale to many CPUs (NUMA, cache coherence)

Message passing modelMany programming details (MPI or PVM)Better user control (data & work decomposition)Larger systems and better performance

Stream-based programming (for using GPUs)

Some special parallel languagesCo-Array Fortran, Unified Parallel C, Titanium

Hybrid parallel programming

Introduction to parallel computersand parallel programming – p. 22

Page 23: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Designing parallel programs

Determine whether or not the problem is parallelizable

Identify “hotspots”Where are most of the computations?Parallelization should focus on the hotspots

Partition the problem

Insert collaboration (unless embarrassingly parallel)

Introduction to parallel computersand parallel programming – p. 23

Page 24: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Partitioning

Break the problem into “chunks”

Domain decomposition (data decomposition)

Functional decomposition

Introduction to parallel computersand parallel programming – p. 24

Page 25: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Examples of domain decomposition

Introduction to parallel computersand parallel programming – p. 25

Page 26: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Collaboration

CommunicationOverhead depends on both the number and size of messagesOverlap communication with computation, if possibleDifferent types of communications (one-to-one, collective)

SynchronizationBarrierLock & semaphoreSynchronous communication operations

Introduction to parallel computersand parallel programming – p. 26

Page 27: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Load balancing

Objective: idle time is minimized

Important for parallel performance

Equal partitioning of work (and/or data)

Dynamic work assignment may be necessary

Introduction to parallel computersand parallel programming – p. 27

Page 28: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Granularity

Computations are typically separated from communications bysynchronization events

Granularity: ratio of computation to communication

Fine-grain parallelismIndividual tasks are relatively smallMore overhead incurredMight be easier for load balancing

Coarse-grain parallelismIndividual tasks are relatively largeAdvantageous for performance due to lower overheadMight be harder for load balancing

Introduction to parallel computersand parallel programming – p. 28

Page 29: Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Exercises

Problem 1 of INF3380 written exam V10

Problem 1 of INF3380 written exam V11

Exercise 1.5 from the textbook [Quinn 2004]

If we denote by Tpipe, which is found in the above exercise, the timerequired to process n tasks using a m-stage pipeline. Let us alsodenote by Tseq the time needed to process n tasks without pipelining.How large should n be in order to get a speedup of p?

Exercise 1.10 from the textbook [Quinn 2004]

Introduction to parallel computersand parallel programming – p. 29