introduction to parallel computing · ws17/18 ©jesper larsson träff linguistic: how are the...

©Jesper Larsson TräffWS17/18

Introduction to Parallel ComputingIntroduction, problems, models, performance

Jesper Larsson Trä[email protected]

TU WienParallel Computing

mailto:[email protected]


Mobile phones: „dual core“, „quad core“ (2012?), …

June 2012: IBM BlueGene/Q, 1572864 cores, 16PF

“Never mind that there's little-to-no software that can take advantage of four processing cores, Xtreme Notebooks has released the first quad-core laptop in the U.S (2007)”.

June 2016: 10,649,600 cores, 93 PFLOPS (125 PFLOPS peak)

Parallelism everywhere! How to use all these resources? Limits?

…octa-core (2016, Samsung Galaxy 7)

Practical Parallel Computing


As per ca. 2010:No sequential computer systems anymore (almost): multi-core, GPU/accelerator enhanced, …

Why?

June 2011: Fujitsu K, 705024 cores, 11PF



June 2011: Fujitsu K, 705024 cores, 11PF

As per 2010:a) All applications must be parallelized; or b) enough independentapplications that can run at the same time

Challenge: How do I speed up windows…?



“I spoke to five experts in the course of preparing this article, and they all emphasized the need for the developers who actually program the apps and games to code with multithreaded execution in mind”

7 myths about quad-core phones (SmartphonesUnlocked)by Jessica DolcourtApril 8, 2012 12:00 PM PDT

The Free Lunch Is Over A Fundamental Turn Toward Concurrency in SoftwareBy Herb SutterThe biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency.This article appeared in Dr. Dobb's Journal,30(3), March 2005

…many similar quotations can (still) be found



The Free Lunch Is Over A Fundamental Turn Toward Concurrency in SoftwareBy Herb SutterThe biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency.This article appeared in Dr. Dobb's Journal,30(3), March 2005

“Free lunch” (as in “There is no such thing as a free lunch”):

Exponential increase in single-core performance (18-24 month doubling rate, Moore’s “Law”), no software changes needed to exploit the faster processors


“The free lunch” was over…” ca. 2005

• Clock speed limitaround 2003

• Power consumptionlimit from 2000

• Instructions/clocklimit late 90ties

But: Numbers oftransistors/chip cancontinue to increase(>1Billion)

Solution(?): Put more cores on chip „multi-core revolution“

Kunle Olukotun (Stanford), ca. 2010


Parallelism challenge:Solve problem p timesfaster on p (slower, moreenergy efficient) cores

Kunle Olukotun (Stanford), ca. 2010

“The free lunch” was over…” ca. 2005

• Clock speed limitaround 2003

• Power consumptionlimit from 2000

• Instructions/clocklimit late 90ties


Kunle Olukotun, 2010, Karl Rupp (TU Wien), 2015


Intel “Skylake”: 4-28 cores

AMD “Naples”: 8-32 cores

New(er) server processors (2017)


Single-core performance does still increase, somewhat, but much slower… Henk Poley, 2014, www.preshing.com

Average, normalized, SPEC benchmark numbers, see www.spec.org

http://www.preshing.com/

http://www.spec.org/


2006:Factor 103-104

increase in integer performance over 1978 high-performance processor

Single-core processor development (“free lunch”) made life verydifficult for parallel computing in the 90ties

From Hennessy/Patterson, Computer Architecture


Architectural ideas driving sequential performance increase

• Deep pipelining• Superscalar execution (>1 instruction/clock) through multiple-

functional units…• Data parallelism through SIMD units (old HPC idea: vector

processing), same instruction on multiple data/clock• Out-of-order execution, speculative execution• Branch prediction• Caches (pre-fetching)• Simplified/better instruction sets (better for compiler)

• SMT/Hyperthreading

Increase in clock-frequency (“technological advances”) alone (factor 20-200?) alone does not explain performance increase:

Mostly fully transparent, at most the compiler needs to care: the “free lunch”


• Very diverse parallel architectures• No single, commonly agreed upon

abstract model (for designing and analyzing algorithms)

Many different programming paradigms (models), different programming frameworks (interfaces, languages)

• Has been so• Is still so (largely)


Theoretical Parallel Computing

Parallelism was always there! What can be done? Limits?

Assume p processors instead of just one, reasonably connected (memory, network, …)

• How much faster can some given problem be solved? Can some problems be solved better?

• How? New algorithms? New techniques?• Can all problems be solved faster? Are there problems that

cannot be solved faster with more processors?• Which assumptions are reasonable?• Does parallelism give new insights into nature of computation?


Algorithm in model

Concrete program (C, C++, Java, Haskell, Fortran,…)

Sequential computing

Concrete architecture

is difficult. Analysis in model (often) has somerelation to concreteexecution

“Gap”: Standard model (RAM) may abstract from many, concrete features of “real” architecture


Algorithm in model


Sequential computing


is difficult. Analysis in model (often) has somerelation to concreteexecution

vs. Parallel computing

Algorithmin model A

Algorithmin model B

Algorithmin model Z

Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)

Conc. Arch. A Conc. Arch. Z…


is extremely difficult. Analysis in model mayhave little relation toconcrete execution

Parallel computing

Algorithmin model A

Algorithmin model B

Algorithmin model Z


Conc. Arch. A Conc. Arch. Z

Huge gap:No „standard“ abstractmodel (e.g., RAM),

…


Extremely difficult: Analysis in model mayhave little relation toconcrete execution

Algorithmin model A

Algorithmin model B

Algorithmin model Z



Challenges:

• Algorithmic: Not all problems seems to beeasily parallelizable

• Portability: Support for same language/interface on different architectures (e.g. MPI in HPC)

• Performance portability (?)

Parallel computing

…


Algorithmin model A

Algorithmin model B

Algorithmin model Z



Elements of Parallel computing:

• Algorithmic: Find theparallelism

• Linguistic: Express theparallelism

• Practical: Validate theparallelism(correctness, performance)

• Technical challenge: run-time/compiler-support support forparadigm

Parallel computing

…


Parallel computing: Accomplish something with a coordinated setof processors under control of a parallel algorithm

• It is inevitable: multi-core revolution, GPGPU paradigm, …

• It‘s interesting, challenging, highly non-trivial, full ofsurprises

• Key discipline of computer science (von Neumann, golden theory decade: 1980-early 90ies)

• It‘s ubiquituous (gates, architecture: Pipelines, ILP, TLP, systems: operating systems, software), not alwaystransparent

• It‘s useful: Large, extremely computationally intensive problems, Scientific Computing, HPC

• …

Why study parallel computing?


Parallel computing:The discipline of efficiently utilizing dedicated parallel resources (processors, memories, …) to solve single, given computational problem.

Specifically:Parallel resources with significant inter-communication capabilities, for problems with non-trivial communication and computational demands

Buzz words: tightly coupled, dedicated parallel system; multi-core processor, GPGPU, High-Performance Computing (HPC), …

Focus on properties of solution (time, size, energy, …) to given, individual problem


Distributed computing:The discipline of making independent, non-dedicated resourcesavailable and coorperative toward solving specified problemcomplexes.

Buzz words: internet, grid, cloud, agents, autonomous computing, mobile computing, …

Typical concerns: correctness, availability, progress, security, integrity, privacy, robustness, fault tolerance, …


Concurrent computing:The discipline of managing and reasoning about interactingprocesses that may (or may not) progress simultaneously

Buzz words: operating systems, synchronization, interprocesscommunication, locks, semaphores, autonomous computing, processcalculi, CSP, CCS, pi-calculus, lock/wait-freeness, …

Typical concerns: correctness (often formal), e.g. deadlock-freedom, starvation-freedom, mutual exclusion, fairness


Parallel vs. Concurrent computing

Process

ResourceProc

Process ProcessProcess …

Concurrent computing: Focus on coordination of access to/usage of shared resources (to solve given, computational problem)

Proc

Memory (locks, semaphores, data structures), device, …

Given problem: Specification, algorithm, data

Adopted from Madan Musuvathi


Subproblem

Proc

Subproblem SubproblemSubproblem

ProblemSpecification, algorithm, data

Parallel computing: Focus on dividing given problem (specification, algorithm, data) into subproblems that can be solved by dedicated processors (in coordination)

Proc Proc Proc

Coordination:synchronization, communication


The “problem” of parallelization

Subproblem Subproblem SubproblemSubproblem

Problem

How to divide given problem into subproblems that can be solved in parallel?

• Specification• Algorithm?• Data?

• How is the computation divided? Coordination necessary? Does the sequential algorithm help/suffice?

• Where are the data? Is communication necessary?


Aspects of parallelization

• Algorithmic: How to divide computation into independent parts that can be executed in parallel? What kinds of shared resources are necessary? Which kinds of coordination? How can overheads be minimized (redundancy, coordination, synchronization)?

• Scheduling/Mapping: How are independent parts of the computation assigned to processors?

• Load balancing: How can independent parts be assigned to processors such that all resources are utilized efficiently?

• Communication: When must processors communicate? How?• Synchronization: When must processors agree/wait?


Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming language, interface, library)

Pragmatic/practical: How does the actual, parallel machine look? What is a reasonable, abstract model?

Architectural: Which kinds of parallel machines can be realized? How do they look?


Levels of parallelization

Architecture: gates, logical and functional units

Computational: Instruction Level Parallelism (ILP)

SIMD: Explicit vector processing

Functional, explicit: threads (possibly concurrent, sharing architecture resources), cores, processors

“Parallel computing”

Not here

Large-scale: coarse grained tasks, multi-level, coupled application parallelism

Not here


Can’t we just leave it all to the compiler/hardware?

Automatic parallelization(?)

• “high-level” problem specification• Sequential program

Efficient code that can be executed on parallel, multi-core processor

• Successsful only to a limited extent:• Compilers cannot invent a different algorithm• Hardware parallelism not likely to go much further

Samuel P. Midkiff: Automatic Parallelization: An Overview of Fundamental Compiler Techniques. Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers 2012


Explicitly, parallel programming (today):

• Explicitly parallel code in some parallel language• Support from parallel libraries• (Domain specific languages)

• Compiler does as much as compiler can do

Lot’s of interesting problems and tradeoffs, active area of research


Some „given, individual problems“ for this lecture

Problem 1:

Matrix-vector multiplication: Given nxm matrix A, m elementvector v, compute n element vector u = Av

x =A[i,j] u[i] = ∑A[i,j]v[j]

Dimensions n,mlarge: obviouslysome parallelism

How to ∑ in parallel?Access to vector?

……


Problem 1a:

Matrix-matrix multiplication: Given nxk matrix A, kxm matrix B, compute nxm matrix product C = AB

x =

C[i,j] = ∑A[i,k]B[k,j]


Problem 1b:

Solving sets of linear equations. Given matrix A and vector b, find x such that Ax = b

Preprocess A such that solution to Ax = b can easily be found for any B (LU factorization, …)


Problem 1c:

Sparse matrix-matrix multiplication: Given nxk matrix A, kxmmatrix B, compute nxm matrix product C = AB

x =

C[i,j] = ∑A[i,k]B[k,j]


Problem 2:

Stencil computation, given nxm matrix A, update as follows, anditerate until some convergence criteria is fulfilled

A[i,j]

iterate {for all (i,j) {A[i,j] <-

(A[i-1,j]+A[i+1,j]+A[i,j-1]+A[i,j+1])/4}

} until (convergence)

with suitable handlingof matrix border

Looks well-behaved, „embarassingly parallel“? Data distribution? Conflicts on updates?

Use: Discretization and solution of certain parallel differential equations (PDEs); image processing; …

5-point stencil in 2d


Problem 3:

Merge two sorted arrays of size n and m into a sorted array ofsize n+m

Easy to do sequentially, but sequentialalgorithm looks… sequential

Problem 3a:Sort by merging


Problem 4:

Computation of all prefix-sums: Given an array A of size n ofelements of some type S with an associative operations +, compute for all indices 0≤i<n the prefix-sums in array B

B[i] = ∑0≤j<iA[j] Implies solution to problem ofcomputing just ∑. Parallel?

A: 1 2 3 4 5 6 7 …

B: x 1 3 6 10 15 21 28 …All prefix-sums


Problem 5:

Sorting a sequence of objects („reals“, integers, objects withorder relation) stored in array

Hopefully parallel merge solution can be ofhelp. Other approaches? Quicksort? Integers (counting, bucket, radix)


Parallel computing as a (theoretical) CS discipline

• Than what?

• What is a reasonable „model“ for parallel computing? • How fast can a given problem be solved? How many resources

can be productively exploited?

• Are there problems that cannot be solved in parallel? Fast? At all?

• …

(Traditional) objective: Solve a given computational problem faster


Architecture, machine, computing, programming models

Architecture/machine model:Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model:(formal) framework for the design and analysis of algorithms for the computational system; cost model.




M

P

Main example: The RAM (Random-Access Machine)

Processor (ALU, PC, registers) capable ofexecuting instructions stored in memory on data in memory

Execution of instruction, access to memory:

First assumption: Unit cost for all operations


M

P


• Memory: Load, store• Arithmetic (integer, real

numbers)• Logic: and, or, xor, shift• Branch, compare, procedure call

Traditional RAM: Same unit cost for all operations

Instructions

Realistic?

Useful?

Another useful model of computation: The Turing machine


M

P


Aka von Neumann architecture or stored program computer, data and program in same memory („programs as data“)

John von Neumann (1903-57), Report on EDVAC, 1945; also Eckert&Mauchly, ENIAC

„von Neumann bottleneck“: Program anddata separate from CPU, processing rate bounded by memory rate.

John W. Backus: Can Programming Be Liberated From the von Neumann Style? A Functional Style and its Algebra of Programs. Commun. ACM 21(8): 613-641 (1978), Turing Award Lecture, 1977


M

P


Processor (ALU, PC, registers) capable ofexecuting instructions stored in memory on data in memory

Non-uniform memory cost (hierarchical memors: cache’s)

Realistic? Perhaps more so

Cache


M

P

Parallel RAM variations

M M M…

Increased memory bandwidth: Vector computer, ALU operateson vectors instead of scalars


P


M

PP P

Shared-memory model, bus based

„von Neumann bottleneck“


P


M

PP P

Shared-memory model, multi-port, network based, …

• How are the processors synchronized?• What can the memory do?• What are the costs?


P


M

PP P

All processors operate under control of the same clock: Synchronous, shared-memory model

Parallel RAM (PRAM):• All processors work in lock-step (all same program, or

individual programs)• Unit-time instruction and memory access (uniform)


P


M

PP P

PRAM conflict resolution: What happens if several processors in the same time step access the same memory location?

• EREW PRAM: Not allowed, neither read nor write• CREW PRAM: Concurrent reads allowed, concurrent writes

not• CRCW PRAM: Both concurrent read and write


P


M

PP P

CRCW PRAM conflict resolution:

• COMMON: All processors that conclict must write the same value

• ARBITRARY: One write succeeds• PRIORITY: A priority scheme determines which

Realistic?

Useful?


P


M

PP P

Realistic?

Useful?

Joseph JáJá: An Introduction to Parallel Algorithms. Addison-Wesley 1992, ISBN 0-201-54856-9

Not realistic (could possibly be emulated), but extremely useful theoretical tool (late 70ties-mid 90ties): Techniques and lower bounds

Not this lecture


par (0<=i&&i<n) b[i] = true; // a[i] could bepar (0<=i&&i<n, 0<=j&&j<n)if (a[i]<a[j]) b[i] = false; // a[i] is not

par (0<=i&&i<n) if (b[i]) x = a[i];

PRAM Example: Fast maximum finding

Problem: Given n numbers in shared memory array a, find the maximum

Idea: Perform all n2 comparisons (a[i],a[j]) in parallel, eliminate those numbers that cannot be maximum

1.2.

3.



par (0<=i&&i<n) if (b[i]) x = a[i];

1.2.

3.

Three parallel steps, O(1) time with n, n2 and n PRAM processors, respectivelyCRCW capability needed in Steps 2 and 3

Theorem:On a Common CRCW PRAM, the maximum of n numbers can be found in O(1) time steps and O(n2) operations in total


Theorem:On a Common CRCW PRAM, the maximum of n numbers can be found in O(1) time steps and O(n2) operations in total

Note: Constant time algorithm with polynomial resources (operations, processors

Is this a good algorithm? Answer later


P

M

PP P

M M M…

Shared-memory model, banked, network based, not synchronous


• How to synchronize: Atomic operations, barriers• Memory consistency: When can some processor “see” what

some other processor has written into memory?


UMA (Uniform Memory Access):Access time to memory location is independent of location andaccessing processor, e.g., O(1), O(log M), …

NUMA (Non-Uniform Memory Access):Access time dependent on processor and location.

P

M

PP P

P

M

PP P

M M M…

Parallel memory access cost

Locality: For each processor, somelocations can be accessed faster thanother locations

Example: PRAM is UMA


P

M

PP P

M M M…

Parallel, distributed memory RAM

Interconnection Network

Memory is distributed over processors, each processor can directly access only its own, local memory


Architecture model defines resources; describes• Composition of processor, functional units: ALU, FPU,

registers, w-bit words vs. unlimited, Vector Unit (MMX, SSE, AVX)

• Types of instructions• Memory system, caches• …

Execution model/cost model specifies• How instructions are executed• (relative) Cost of instructions, memory accesses• …

Level of detail/formality dependent on purpose:What is to be studied (complexity theory, algorithms design, …), what counts (instructions, memory accesses, …)


Parallel architecture model defines• Synchronization between processors• Synchronization operations• Atomic operations, shared resources (memory, registers)• Communication mechanisms: network topology, properties• Memory: shared, distributed, hierarchical, …• …

Cost model defines• Cost of synchronization, atomic operations• Cost of communication (latency, bandwidth, …)• …


Yet another parallel architecture model:

John on Neumann, Arthur W. Burks: Theory of self-reproducingautomata, 1966H. T. Kung: Why systolic architectures? IEEE Computer 15(1): 37-46, 1982

State of cell (i,j) in next step determined by• Own state• State of neighboors in some

neighborhood, e.g., (i,j-1), (i+1,j), (i,j+1), (i-1,j)s

Cellular automaton, systolic array, … : Simple processors withoutmemory (finite state automata, FSA), operate in lock step on (potentially infinite) grid, local communication only


Flynn‘s taxonomy:Orthognal classification of (parallel) architectures/models

SISDSingle Instruction Single Data

MISDMultiple Instruction Single Data

SIMDSingle Instruction Multiple Data

MIMDMultiple Instruction MultipleData

M. J. Flynn: Some computer organizations and theireffectiveness. IEEE Trans. Comp. C-21(9):948-960, 1972

Intruction stream

Dat

a st

ream


SISD: Single processor, single stream of instructions, operateson single stream of data. Sequential architecture (e.g. RAM)

SIMD: Single processor, single stream of operations, operateson multiple data per instruction. Example: traditional vectorcomputer, PRAM (some variants)

MISD: Multiple instructions operate on single data stream.Example: Pipelined architectures, streaming architectures(?), systolic arrays (70ties architetural idea).

MIMD: Multiple instruction streams, multiple data streams

Some say: Empty


M

P

SISD

M

P

M MM

SIMD

Typical Flynn classification instances

M

P P P P

M M M… M

P P P P

M M M

Communication network

MIMD

M P P P P

…

… MISD (?)


Programming model:Abstraction, close to programming language, defining parallel resources, management of parallel resources, parallelizationparadigms, memory structure, memory model, synchronizationand communication features, and their semantics

Execution model: When and how parallelism in programmingmodel is effected




Programming model:Abstraction, close to programming language, defining parallel resources, management of parallel resources, parallelizationparadigms, memory structure, memory model, synchronizationand communication features, and their semantics

Parallel programming language, or library („interface“) is theconcrete implementation of one (or more: multi-modal, hybrid) parallel programming model(s)

Cost of operations: May be specified with programming model; often by architecture/computational model

Execution model: When and how parallelism in programmingmodel is effected


• Parallel resources, entities, units: Processes, threads, tasks, …• Expression of parallelism: Explicit or implicit• Level and granularity of parallelism

• Memory model: Shared, distributed, hybrid• Memory semantics („when operations take effect/become

visible“)• Data structures, data distributions

• Methods of synchronization (implicit/explicit)• Methods and modes of communication

Parallel programming model defines, e.g.,


(*)SPMD: Single Program, Multiple Data(**)PGAS: Partitioned Global Address Space

Examples:

1. Threads, shared memory, block distributed arrays, fork-joinparallelism

2. Distributed memory, explicit message passing, collectivecommunication, one-sided communication („RDMA“), PGAS(**)

3. Data parallel SIMD, SPMD(*)4. …

Concrete libraries/languages: pthreads, OpenMP, MPI, UPC, TBB, …

Not this lecture


Same Program Multiple Data (SPMD)

F. Darema at al.: A single-program-multiple-data computational model for EPEX/FORTRAN, 1988

Restricted MIMD model: All processors execute same program

• May do so asynchronously, different processors may be in different parts of program at any given time

• Same objects (procedures, variables) exist for all processors, “remote procedure call”, “active messages, “remote-memory access” make sense

Programming model concepts: Active messages, remote procedure call, … (MPI: RMA/one-sided communication)

Not this lecture


Programming language/library/interface/paradigm

Programming model

Architecture model

„Real“ Hardware

Different architecture models canrealize given programming model

Closer fit: More efficient use ofarchitecture

Challenge: Programming model that isuseful and close to „realistic“ architecture models to enable realisticanalysis/prediction

OpenMP MPI

Challenge: Language that convenientlyrealizes programming model

Algorithmssupport, „run-time“

Cilk


Examples:

OpenMP programming interface/language for shared-memory model, intended for shared memory architectures; mostly „Data parallel“

Can be implemented with DSM (Distributed Shared Memory) on distributed memory architectures; but performance usually not good. Requires DSM implementation/algorithms

MPI interface/library for distributed memory model, can be used on shared-memory architectures, too. Needs algorithmic support (e.g., „collective operations“)

Cilk language (extended C) for shared-memory model, for shared-memory architectures; „Task parallel“. Need run-time (e.g., „work-stealing“)

See later


Performance: Basic observations and goals

Model:p dedicated parallel processors collaborate to solve given problem of input size n.

Focus worst-case complexities and execution times

• Processors work independently (local memory, program)• Communication and coordination incurs some overhead

Let some computational problem P(n) given• Seq: Sequential algorithm and implementation solving P• Par: Parallel algorithm and implementation solving P


Challenge and one main goal of parallel processing

… not to make life easy

Be faster than commonly used (good, best known, best possible…) sequential algorithm and implementation, utilizing the p processors efficiently


Let• Tseq(n): Time for 1 processor to solve P(n)• Tpar(p,n): Time for p processors to solve P(n)

The speedup of Tpar over Tseq is defined as

Sp(n) = Tseq(n)/Tpar(p,n)

Speedup measures the gain in moving from sequential to parallel computation (Note: Parameters p and n)

Main goal: Speeding up computations by parallel processing

Goal: Achieve as large speed-up as possible (for some n, all n) for as many processors as possible


What is „time“ (number of instructions)?

• Time for some algorithm for solving problem?• Time for a specific algorithm for solving problem?

• Time for best known algorithm for problem?• Time for best possible algorithm for problem?

• Time for specific input of size n, average case, worst case, …?• Asymptotic time, large n, large p?

• Do constants matter, e.g. O(f(p,n)) or 25n/p+3ln (4 (p/n))… ?

What exactly is Tseq(n), Tpar(p,n)?


Tseq(n):

• Theory: Number of instructions (or other critical cost measure) to be executed in the worst case for inputs of size n

• The number of instructions carried out is often termed WORK

• Practice: Measured time (or other parameter) of execution over some inputs (experiment design)

Choose sequential algorithm (theory), choose an implementation of this algorithm (practice)

Theory and practice:Always state baseline sequential algorithm/implementation


Examples (theory):

Tseq(n) = O(n):

Tseq(n,m) = O(n+m):

Tseq(n) = O(n log n):Tseq(n,m) = O(n log n + m):Tseq(n) = O(n3):

…

Finding maximum of n numbers in unsorted array; prefix-sumsMerging of two sequences; BFS/DFS in graphComparison-based sortingSingle-source Shortest Path (SSSP)Matrix multiplication, input two nxnmatrices…

Standard, worst-case, asymptotic complexities

Cormen, Leiserson, Rivest, Stein: Introduction to Algorithms. 3rd

ed., MIT Press, 2009

Can be solved in o(n3), Strassen etc.


New issue with modern processors:Is time always the same thing? Clock frequency may not be constant, e.g., can depend on load of system, energy cap, “turbomode”, etc.. Such factors difficult to control

Practice:• Construct good input examples to measure running time

Tseq(n); experimental methodology• Worst-case not always possible, not always interesting; best

case?• Experimental methods to get stable, accurate Tseq(n):

Repeat measurements many times (thumb rule: average over at least 30 repetitions)

Experimental science:Always some assumptions about repeatability, regularity, determinism, …


Parallel performance in theory

Definition:Let Tseq(n) be the (worst-case) time of best possible/best known sequential algorithm, andTpar(n,p) the (worst-case) time of parallel algorithm. The absolute speed-up of Tpar(n,p) with p processors over Tseq(n) is


Observation (proof follows):Best-possible, absolute speed-up is linear in p

Goal: Obtain (linear) speedup for as large p as possible (as function of problem size n), for as many n as possible


For speedup (and other complexity measures), distinguish:

• Problem P to be solved (mathematical specification)

• Some algorithm A to solve P• Best possible (lower bound) algorithm A* for P, best known

algorithm A+ for G: The complexity of P

• Implementation of A on some machine M


Subproblem:Sum of n/p elements

Subproblem: sum of n/p elements

Subproblem:sum of n/p elements


Problem: sum of two n-element vectors

Example: “data parallel” (SIMD) computation

for (i=0; i<n; i++) {a[i] = b[i]+c[i];

}

Algorithm/implementation:

Tpar(p,n) = Tseq(n)/p

Sp(n) = p

Best possible parallelization: sequential work divided evenly across p processors

Complexity: Tseq(n) = Θ(n)


Subproblem:Sum of n/p elements


Subproblem:sum of n/p elements


Problem: sum of two n-element vectors

Example: “data parallel” (SIMD) computation

for (i=0; i<n; i++) {a[i] = b[i]+c[i];

}

• Perfect speedup• “Embarassingly parallel”• “Pleasantly parallel”

Tpar(p,n) = c(n/p) for constant c≥1

Algorithm/implementation:


Work

TimeTseq(n)

start

The work, measured in instructions and/or time, that has to be carried out for problem of size n

stop


p processors

Perfect parallelization: Sequential work evenly divided between p processors, no overhead, so Tpar(p,n) = Tseq(n)/p

Perfect speedup

Sp(n) = Tseq(n)/(Tseq(n)/p) = p

… and very rare in practice

TimeTseq(n)


Ti

p processors

Sequential work is unevenly dividedbetween p processors:Load imbalance, Tpar(p,n)>Tseq(n)/p, even though ∑Ti(n)=Tseq(n)

start

stopTpar is time for slowest processor to complete, all processors assumed to start at same time

Define Tpar(p,n) = max Ti(n) over all processors, starting at the same time

TimeTseq(n)


Ti

p processors

Wpar(n) = ∑Ti(n) is called the work of the parallel algorithm = total number of instructions performed by the p processors

The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: Total time in which the p processors are reserved (: have to be paid for)

Area C(n) = p*Tpar(p,n)

TimeTseq(n)


“Theorem:”Perfect speedup Sp(n) = p is best possible and cannot be exceeded

“Proof”:Assume Sp(n) > p for some n. Tseq(n)/Tpar(p,n) > p implies Tseq(n) > p*Tpar(p,n). A better sequential algorithm could be constructed by simulating the parallel algorithm on a single processor. The instructions of the p processors are carried out in some correct order, one after another on the sequential processor. This contradicts that Tseq(n) was best possible/known time.

Reminder:Speedup is calculated (measured) relative to “best” sequential implementation/algorithm


Ti

p processors

By assumption C(n) = p*Tpar(p,n) < Tseq(n)

Simulation A: one step of P1, one step of P2, …, one step of P(p-1), one step of P1, …, for C(n) iterationsSimulation B: steps of P1 until communication/synchronization, steps of P2 until communication/synchronization, …

TimeTseq(n)


Ti

1 processor

By assumption C(n) = p*Tpar(p,n) < Tseq(n)

Both simulations yield a new, sequential algorithm Tsim(n) with Tsim(n)<Tseq(n)

This contradicts that Tseq(n) was time of best possible/best known sequential algorithm

TimeTseq(n)


Lesson: Parallelism offers only „modest potential“, speed-up cannot be more than p on p processors

Lawrence Snyder: Type architecture, shared memory and the corollary of modest potential. Annual Review of Computer Science, 1986

Construction shows that the total parallel work must be at least as large as the sequential work Tseq, otherwise, better sequential algorithm can be constructed.

Crucial assumptions: Sequential simulation is possible (enough memory to hold problem and state of parallel processors), sequential memory behaves as parallel memory, …

This s NOT TRUE for real systems and real problems


Ti

1 processor

TimeTseq(n) Aside: Such simulations are actually

sometimes done, and can be very useful to understand (model) and debug parallel algorithms

Some such simulation tools:

• SimGrid (INRIA)• LogPOPSim (Hoefler et al.)• …

Possible bachelor thesis subject


Wpar(n) = ∑Ti(n) is called the parallel work of the parallel algorithm = total number of instructions performed by some number of processors

The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: Total time in which the p processors are occupied

Definition:Parallel algorithm is called cost-optimal if C(n) = O(Tseq(n)). A cost-optimal algorithm has linear (perhaps perfect) speedup

Definition:Parallel algorithm is called work-optimal if Wpar(n) = O(Tseq(n)). A work-optimal algorithm has potential for linear speedup (for some number of processors)


Proof (linear-speed up of cost-optimal algorithm):Given cost-optimal parallel algorithm with p*Tpar(p,n) = c*Tseq(n) = O(Tseq(n))

This implies Tpar(p,n) = c*Tseq(n)/p, so

Sp(n) = Tseq(n)/Tpar(p,n) = p/c

The constant factor c captures the overheads and load imbalance (see later) of the parallel algorithm relative to best sequential algorithm. The smaller c, the closer the speedup to perfect


Ti

p processors

Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)

TimeTseq(n)


Ti

p processors

Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)

Execute on smaller number of processors, such that ∑Ti(n) = p*Tpar(p,n) = O(Tseq(n))

TimeTseq(n)


Proof idea (work-optimal algorithm can have linear speed-up):

1. Work-optimal algorithm2. Schedule work-items Ti(n) on p processors, such that

p*Tpar(p,n) = O(Tseq(n))3. With this number of processors, algorithm is cost-optimal4. Cost-optimal algorithms have linear speed-up

Parallel algorithms’ design goal:Work-optimal parallel algorithm with as small Tpar(n) as possible (and therefore large parallelism: many processors can be utilized)

The scheduling in step 2 is possible in principle, but may not be trivial



par (0<=i&&i<n) if (b[i]) x = a[i];

Example: CRCW PRAM Maximum Finding algorithm

O(n2) operations (work), but sequential maximum finding requires only O(n) operations

Speedup with perfect parallelization (we do not know now whether this is possible

Sp(n) = O(n)/O(n2/p) = O(p/n)

(Only small) linear speedup for fixed n; but not independent of n

Bad!


Example: Another non work-optimal algorithm

DumbSort with T(n) = O(n2) that can be perfectly parallelized, Tpar(p,n) = O(n2/p)

Well-known that Tseq(n) = O(n log n), with many algorithms andgood implementations

Sp(n) = n log n/n2/p = p(log n)/n

Tpar(p,n) < Tseq(n) n2/p < n log n n/p < log n p > n/log n

(Small) linear speedup for fixed n (but not independent of n)

Break-even: When is parallel algorithm faster than sequential?

Non work-optimal algorithm: Speed-up decreases with n

Bad!


Best known/best possible parallel algorithm often difficult to parallelize• no redundant work (that could have been done in parallel)• tight dependencies (that forces things to be done one after

another)

Lesson:Usually does not make sense to parallelize an inferior algorithm(although sometimes much easier)

Lesson from much hard work in theory and practice:Parallel solution of a given problem often requires a new algorithmic idea!

But: Many algorithms often have a lot of potential for easy parallelization (loops, independent functions, …), so why not?

Also: Non-work optimal algorithms can sometimes be useful, as subroutine


The ratio

SRelp(n) = Tpar(1,n)/Tpar(p,n)

is called relative speedup, and expresses how well the parallel algorithm utilizes p processors (scalability)

Relative speedup NOT to be confused with absolute speedup. Absolute speedup expresses how much can be gained over the best (known/possible) sequential implementation by parallelization.

Absolute vs. relative speedup

Absolute speed-up is what eventually matters


Note:Literature (papers and books) is not always clear about this distinction.

It is easier to achieve and document good relative speedup.

Reporting speed-up relative to an inferior, sequential implementation is incorrect!



Definition:T∞(n): The smallest possible running time of parallel algorithm given arbitrarily many processors. Per definition T∞(n) ≤ Tpar(p,n) for all p. Relative speedup is limited by

SRelp(n) = Tpar(1,n)/Tpar(p,n) ≤ Tpar(1,n)/T∞(n)

Definition:The ratio Tpar(1,n)/T∞(n) is called the parallelism of the parallel algorithm.

The parallelism is the largest number of processors that can be employed and still give linear, relative speedup (assume Tpar(1,n)/T∞(n)<p in equation above)



par (0<=i&&i<n) if (b[i]) x = a[i];


O(n2) operations (work), but sequential maximum finding requires only O(n) operations

SRelp(n) = O(n2)/O(n2/p) = O(p)Parallelism: O(n2)/O(1) = n2

This (terrible) parallel algorithm has linear relative speedup for p up to n2 processors



par (0<=i&&i<n) if (b[i]) x = a[i];


This (terrible) algorithm has linear relative speedup for p up to n2 processors

Nevertheless: Useful as a building block

There exist a work-optimal CRCW PRAM algorithm that runs in O(log log n) steps requiring only O(n) operations

Advanced material: Last fact about PRAM in this lecture



Work-optimality property:

For work-optimal algorithms, absolute and relative speedup coincides: Tpar(1,n) = O(Tseq(n))


Parallel performance in practice

Speedup is an empirical quantity, “measured time”, based on experiment (benchmark)

Tseq(n): Running time for “reasonable”, good, best available, sequential implementation, on “reasonable” inputs

Tpar(p,n): parallel running time measured for a number of experiments with different typical (worst-case?) inputs


Speedup is typically not independent of problem size n, and problem instance


Time

Ti

p processors

Oi

Parallelization most often incurs overheads:• Algorithmic: Parallel algorithm may do

more work• Coordination: Communication and

synchronization• …

Tpar(p,n) Note T(1,n)≥Tseq(n)

Tseq(n)


Absolute vs. relative speedup and scalability

Good scalability and relative speedup Tpar(1,n)/Tpar(p,n) = Θ(p)

Example: 0.1p ≤ Tpar(1,n)/Tpar(p,n) ≤ 0.5p is reported, sounds good

Even for work-optimal Tpar(1,n) = 100Tseq(n) = O(Tseq(n)) it would take at least 200 processors to be as fast as the sequential algorithm with the reported

But what if Tpar(1,n) = 100Tseq(n)? Or Tseq(n) = O(n) but Tpar(p,n) = O((n log n)/p + log n)?


David H. Bailey:Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers. Supercomputing Review, Aug. 1991, pp. 54-55

Torsten Hoefler, Roberto Belli: Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. SC 2015: 73:1-73:12

Empirical, relative speedup without absolute performance baseline (and comparison to reasonable, sequential algorithm and implementation) is misleading


Practical problems: Measuring time and work

Ti

Oi

start

stop

Tpar defined as time for the last processor to complete assuming all procesors start at the same time

Problem 1:To measure Tpar, efficient, accurate synchronization is needed

“barrier synchronization”



Ti

Oi

start

stop


Problem 2:Is this relevant? Are applications synchronized? Why not just measure the work?



Ti

Oi


Problem 3: How does the time spent per processor and the work depend on the synchronization (a slow processor may cause other processes to wait)

Work = ∑Ti



Ti

Oi


Problem 3: How does the time spent per processor and the work depend on the synchronization (a slow processor may cause other processes to wait)

Work = ∑Ti

dependency


Problem 1: Most parallel interfaces include synchronization mechanisms (“barrier”) that may or may not be accurate. There is still lots of work on temporal, accurate synchronization

Problem 2: Difficult to answer

Problem 3: Not much known, most algorithms and implementations assume and depend on some synchrony when evaluating performance

Sascha Hunold, Alexandra Carpen-Amarie: Reproducible MPI Benchmarking is Still Not as Easy as You Think. IEEE Trans. Parallel Distrib. Syst. 27(12): 3617-3630 (2016)


Time

Ti

p processors

Oi


If algorithm is cost-optimal, p*Tpar(p,n) = k*Tseq(n), speedup becomes imperfect, but still linear, Sp(n) = p/k

Tseq(n)


Time

Ti

p processors

Oi


Idle

NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution

Tseq(n)


Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n)Idle

NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution

Tseq(n)

Oi

Idle


Time

Ti

p processors

Oi


Tseq(n)

Oi

Idle

Typical overhead by communication and coordination.

The (smallest) time between coordination periods is called the granularity of the parallel computation


• “Coarse grained” parallel computation/algorithm: Time/number of instructions between coordination intervals (synchronization operations, communication operations) is large (relative to total time or work)

• “Fine grained” parallel computation/algorithm: Time/number of instructions between… is small

Granularity


Time

Ti

p processors

Oi


Tseq(n)

Oi

Idle

Definition:Difference between max Ti(n) and min Ti(n) is the load imbalance. Achieving Ti(n) ≈ Tj(n) for all processors i,j called load balancing


Time

p processors

Oi

Tpar(p,n) close to Tseq(n)/p

Idle

Best parallelization has no load imbalance (and no overhead), so Tpar(p,n) = Tseq(n)/p

This is the best possible parallel time

Tseq(n)

Ti


TimeTseq(n) = (s+r)Tseq(n)

Ti

p processors

Maximum speedup becomes severely limited, e.g., if overheads are sequential and a constant fraction

Tpar(p,n) ≥ s*Tseq(n)+r*Tseq(n)/p

Sequential overhead: constant fraction


Amdahls Law (parallel version):Let a program A contain a fraction r that can be “perfectly” parallelized, and a fraction s=(1-r) that is “purely sequential”, i.e., cannot be parallelized at all. The maximum achievable speedup is 1/s, independently of n

Proof:

• Tseq(n) = (s+r)*Tseq(n)• Tpar(p,n) = s*Tseq(n) + r*Tseq(n)/p

Sp(n) =Tseq(n)/(s*Tseq(n)+r*Tseq(n)/p) = 1/(s+r/p) -> 1/s, for p -> ∞

G. Amdahl: Validity of the single processor approach to achievinglarge scale computing capabilities. AFIPS 1967


Typical victims of Amdahl‘s law:

• Sequential input/output could be a constant fraction• Sequential initialization of global data structures• Sequential processing of „hard-to-parallelize“ parts of

algorithm, e.g., shared data structures

Amdahl‘s law limits speed-up in such cases, if they are a constantfraction of total time, independent of problem size


Example:

1. Processor 0: read input, some precomputation2. Split problem into n/p parts, send part i to processor i

3. All processors i: solve part i4. All processors i: send partial solution back to processor 0

Typical Amdahl, sequential bottleneck: Constant sequential fraction (3 out of 4 steps) limits speedup)

10n9n

Amdahl: s=0.1, SU at most 10

When interested in parallel aspects, input-output and problem splitting is often explicitly not measured (see projects)


// Sequential initializationx = (int*)calloc(n*sizeof(int));…// Parallelizable partdo {

for (i=0; i<n; i++) {x[i] = f(i);

}// check for convergencedone = …;

} while (!done)

Example (contrived): K iterations beforeconvergence, (parallel) convergence check cheap, f(i) fast O(1)…

Sp(n) -> 1+K

Tseq(n) = n+K+Kn

Tpar(p,n) = n+K+Kn/p

Sequential fraction ≈ 1/(1+K)


// Sequential initializationx = (int*)malloc(n*sizeof(int));…// Parallelizable partdo {

for (i=0; i<n; i++) {x[i] = f(i);


} while (!done)

Example (contrived): K iterations beforeconvergence, (parallel) convergence check cheap, f(i) fast O(1)…

Sp(n) -> p when n>p, n->∞

Tseq(n) = 1+K+Kn

Tpar(p,n) = 1+K+Kn/p

Sequential part ≈ 1/(1+n)

Note:

If sequential part is constant(but not fraction), Amdahl‘slaw does not limit SU



for (i=0; i<n; i++) {x[i] = f(i);


} while (!done)

Example (contrived, but…): K iterations beforeconvergence, (parallel) convergence check cheap, f(i) fast O(1)…

Tseq(n) = 1+K+Kn

Tpar(p,n) = 1+K+Kn/p

Lesson:Be careful with systemfunctions (calloc, malloc)

Sequential part ≈ 1/(1+n)

Sp(n) -> p when n>p, n->∞


Avoiding Amdahl: Scaled speedup

Sequential, strictly non-parallelizable part is often not a constant fraction of the total execution time (number of instructions)

The sequential part may be constant, or grow only slowly with problem size n. Thus, to maintain good speedup, problem size should be increased with p


Assume Tseq(n) = t(n)+T(n) with sequential part t(n) and perfectly parallelizable part T(n).

Assume t(n)/T(n) -> 0 for n-> ∞

Tpar(p,n) = t(n)+T(n)/p

Sp(n) = (t(n)+T(n))/(t(n)+T(n)/p) = (t(n)/T(n)+1)/(t(n)/T(n)+1/p) -> 1/(1/p) = p

Now, speedup as a function of p and n is:

for n -> ∞


Lesson:Depending on how fast t(n)/T(n) converges, linear speed-up can be achieved by increasing problem size n

Definition:Speedup as function of p and n, with sequential and parallelizable times t(n) and T(n) is called scaled speed-up


Remarks:

• E(p,n) ≤ 1, since Sp(n) = Tseq(n)/Tpar(n,p) ≤ p• E(p,n) = constant: linear speedup

Definition:The efficiency of a parallel algorithm is ratio of best possible parallel time to actual parallel time for given p and n:

E(p,n) = (Tseq(n)/p)/Tpar(p,n) = Sp(n)/p = Tseq(n)/(p*Tpar(p,n))

cost

• Cost-optimal algorithms have constant efficiency


Scalability

Definition:A parallel algorithm/implementation is strongly scaling ifSp(n) = Θ(p) (linear, independent of (sufficiently large) n)

Definition:A parallel algorithm/implementation is weakly scaling if there isa slow-growing function f(p), such that for n = Ω(f(p)), E(p,n) remains constant. The function f is called the iso-efficiencyfunction

„Maintain efficiency by increasing problem size as f(p) or more“


John L. Gustafson: Reevaluating Amdahl's Law. Commun. ACM 31(5): 532-533 (1988)

Ananth Grama, Anshul Gupta, Vipin Kumar: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Transactions Par. Dist. Computing. 1(3): 12-21 (1993)

Scaled speed-up idea from many sources; some are


Summary: Stating parallel performance

It is convenient to state parallel performance and scalability of a parallel algorithm/implementation as

Tpar(p,n) = O(T(n)/p+t(p,n))

T(n) represents the parallel part, t(p,n) the non-parallel part of the algorithm beyond which no improvement is possible, regardless of how many processors are used (parallelism = T(n)/t(p,n) )

The work of the algorithm is W= O(p (T(n)/p+t(p,n)) ) = O(T(n)+pt(p,n))

The algorithm is work-optimal if T(n) is O(Tseq(n))



for (i=0; i<n; i++) {x[i] = f(i);


} while (!done)

Assume convergence check takes O(log p) time

Tpar(p,n) = Kn/p+K log p

E(p,n) ≈ Kn/(Kn+pK log p)

For n≥plog p, E(p,n) ≥ 1/2

Example (again):

Weakly scalable, n has to increase as O(p log p) to maintainconstant efficiency, O(log p) per processor (if balanced)


Examples (Tpar, Speedup, Optimality, Efficiency):

1. Tpar0(p,n) = n/p+1

2. Tpar1(p,n) = n/p+log p

3. Tpar2(p,n) = n/p+log2 p

4. Tpar3(p,n) = n/p+√p

5. Tpar4(p,n) = n/p+p

Linear time computation, Tseq(n) = n (constants ignored)

Embarassingly “data parallel” computation, constant overheadlogarithmic overhead, e.g. convergence check

Linear overhead, e.g. data exchange

Typical, good, work-optimal parallelizations


n=128


1. Tpar0(p,n) = n/p+1:p*Tpar(p,n) = n+p = O(n) for p=O(n)

2. Tpar1(p,n) = n/p+log p:p*Tpar(p,n) = n+p log p = O(n) for p log p = O(n)

3. Tpar2(p,n) = n/p+log2 p:p*Tpar(p,n) = n+p log2 p = O(n) for p log2 p=O(n)

4. Tpar3(p,n) = n/p+√p:p*Tpar(p,n) = n+p√p = O(n) for p√p=O(n)

5. Tpar4(p,n) = n/p+p:p*Tpar(p,n) = n+p2 = O(n) for p2=O(n)

Cost-optimality:


n=128


n=128, but p up to 256


n=16384 (=16K=1282)


n=2097152 (=2M=1283)


n=128


n=16384 (=16K=1272)


n=2097152 (=2M=1273)


Efficiency, weak scaling

1. Tpar0(p,n) = n/p+1: f0(p) = e/(1-e)*p

2. Tpar1(p,n) = n/p+log p: f1(p) = e/(1-e)*(p log p)

3. Tpar2(p,n) = n/p+log2 p: f2(p) = e/(1-e)*p log2 p

4. Tpar3(p,n) = n/p+√p: f3(p) = e/(1-e)*(p√p)

5. Tpar4(p,n) = n/p+p: f4(p) = e/(1-e)*p2

To maintain constant efficiency e=Tseq(n)/(p*Tpar(p,n)), n has to increase as


Maintained efficiency = 0.9 (90%)


Matrix-vector multiplication parallelizations

Tseq(n) = n2

Tpar0(p,n) = n2/p + nTpar1(p,n) = n2/p + n + log p


n=100


n=1000


Some non work-optimal parallel algorithms

Amdahl case, linear sequential running time:

TparA(p,n) = 0.9n/p+0.1n

1. Tseq(n) = n log n Tpar(p,n) = n2/p+1

2. Tseq(n) = n Tpar(p,n) = (n log n)/p+1


n1 = 128n2 = 16384


Limitations of empirical speedup measure

Empirical speedup assumes that Tseq(n) can be measured.

For very large n and p, this may not be the case: A large HPC system has much more (distributed) main memory than any single-processor system

Scalability measured by other means:• Stepwise speedup (1-1000 processors, 1000-100,000

processors)• Other notions of efficiency


Lecture summary, checklist

• Models of parallel computation: Architecture, programming, cost

• Flynn’s taxonomy: MIMD, SIMD; SPMD

• Sequential baseline

• Speedup (in theory and practice)

• Work, Cost optimality

• Amdahl’s law

• Scaled speed-up, strong and weak scaling

introduction to parallel computing · ws17/18 ©jesper larsson träff linguistic: how are the...

Documents