introduction to parallel computing · ws17/18 ©jesper larsson träff linguistic: how are the...

156
©Jesper Larsson Träff WS17/18 Introduction to Parallel Computing Introduction, problems, models, performance Jesper Larsson Träff [email protected] TU Wien Parallel Computing

Upload: others

Post on 06-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Introduction to Parallel ComputingIntroduction, problems, models, performance

Jesper Larsson Trä[email protected]

TU WienParallel Computing

Page 2: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Mobile phones: „dual core“, „quad core“ (2012?), …

June 2012: IBM BlueGene/Q, 1572864 cores, 16PF

“Never mind that there's little-to-no software that can take advantage of four processing cores, Xtreme Notebooks has released the first quad-core laptop in the U.S (2007)”.

June 2016: 10,649,600 cores, 93 PFLOPS (125 PFLOPS peak)

Parallelism everywhere! How to use all these resources? Limits?

…octa-core (2016, Samsung Galaxy 7)

Practical Parallel Computing

Page 3: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

As per ca. 2010:No sequential computer systems anymore (almost): multi-core, GPU/accelerator enhanced, …

Why?

June 2011: Fujitsu K, 705024 cores, 11PF

Practical Parallel Computing

Page 4: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

June 2011: Fujitsu K, 705024 cores, 11PF

As per 2010:a) All applications must be parallelized; or b) enough independentapplications that can run at the same time

Challenge: How do I speed up windows…?

Practical Parallel Computing

Page 5: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

“I spoke to five experts in the course of preparing this article, and they all emphasized the need for the developers who actually program the apps and games to code with multithreaded execution in mind”

7 myths about quad-core phones (SmartphonesUnlocked)by Jessica DolcourtApril 8, 2012 12:00 PM PDT

The Free Lunch Is Over A Fundamental Turn Toward Concurrency in SoftwareBy Herb SutterThe biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency.This article appeared in Dr. Dobb's Journal,30(3), March 2005

…many similar quotations can (still) be found

Practical Parallel Computing

Page 6: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

The Free Lunch Is Over A Fundamental Turn Toward Concurrency in SoftwareBy Herb SutterThe biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency.This article appeared in Dr. Dobb's Journal,30(3), March 2005

“Free lunch” (as in “There is no such thing as a free lunch”):

Exponential increase in single-core performance (18-24 month doubling rate, Moore’s “Law”), no software changes needed to exploit the faster processors

Page 7: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

“The free lunch” was over…” ca. 2005

• Clock speed limitaround 2003

• Power consumptionlimit from 2000

• Instructions/clocklimit late 90ties

But: Numbers oftransistors/chip cancontinue to increase(>1Billion)

Solution(?): Put more cores on chip „multi-core revolution“

Kunle Olukotun (Stanford), ca. 2010

Page 8: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Parallelism challenge:Solve problem p timesfaster on p (slower, moreenergy efficient) cores

Kunle Olukotun (Stanford), ca. 2010

“The free lunch” was over…” ca. 2005

• Clock speed limitaround 2003

• Power consumptionlimit from 2000

• Instructions/clocklimit late 90ties

Page 9: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Kunle Olukotun, 2010, Karl Rupp (TU Wien), 2015

Page 10: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Intel “Skylake”: 4-28 cores

AMD “Naples”: 8-32 cores

New(er) server processors (2017)

Page 11: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Single-core performance does still increase, somewhat, but much slower… Henk Poley, 2014, www.preshing.com

Average, normalized, SPEC benchmark numbers, see www.spec.org

Page 12: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

2006:Factor 103-104

increase in integer performance over 1978 high-performance processor

Single-core processor development (“free lunch”) made life verydifficult for parallel computing in the 90ties

From Hennessy/Patterson, Computer Architecture

Page 13: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Architectural ideas driving sequential performance increase

• Deep pipelining• Superscalar execution (>1 instruction/clock) through multiple-

functional units…• Data parallelism through SIMD units (old HPC idea: vector

processing), same instruction on multiple data/clock• Out-of-order execution, speculative execution• Branch prediction• Caches (pre-fetching)• Simplified/better instruction sets (better for compiler)

• SMT/Hyperthreading

Increase in clock-frequency (“technological advances”) alone (factor 20-200?) alone does not explain performance increase:

Mostly fully transparent, at most the compiler needs to care: the “free lunch”

Page 14: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

• Very diverse parallel architectures• No single, commonly agreed upon

abstract model (for designing and analyzing algorithms)

Many different programming paradigms (models), different programming frameworks (interfaces, languages)

• Has been so• Is still so (largely)

Page 15: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Theoretical Parallel Computing

Parallelism was always there! What can be done? Limits?

Assume p processors instead of just one, reasonably connected (memory, network, …)

• How much faster can some given problem be solved? Can some problems be solved better?

• How? New algorithms? New techniques?• Can all problems be solved faster? Are there problems that

cannot be solved faster with more processors?• Which assumptions are reasonable?• Does parallelism give new insights into nature of computation?

Page 16: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Algorithm in model

Concrete program (C, C++, Java, Haskell, Fortran,…)

Sequential computing

Concrete architecture

is difficult. Analysis in model (often) has somerelation to concreteexecution

“Gap”: Standard model (RAM) may abstract from many, concrete features of “real” architecture

Page 17: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Algorithm in model

Concrete program (C, C++, Java, Haskell, Fortran,…)

Sequential computing

Concrete architecture

is difficult. Analysis in model (often) has somerelation to concreteexecution

vs. Parallel computing

Algorithmin model A

Algorithmin model B

Algorithmin model Z

Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)

Conc. Arch. A Conc. Arch. Z…

Page 18: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

is extremely difficult. Analysis in model mayhave little relation toconcrete execution

Parallel computing

Algorithmin model A

Algorithmin model B

Algorithmin model Z

Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)

Conc. Arch. A Conc. Arch. Z

Huge gap:No „standard“ abstractmodel (e.g., RAM),

Page 19: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Extremely difficult: Analysis in model mayhave little relation toconcrete execution

Algorithmin model A

Algorithmin model B

Algorithmin model Z

Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)

Conc. Arch. A Conc. Arch. Z

Challenges:

• Algorithmic: Not all problems seems to beeasily parallelizable

• Portability: Support for same language/interface on different architectures (e.g. MPI in HPC)

• Performance portability (?)

Parallel computing

Page 20: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Algorithmin model A

Algorithmin model B

Algorithmin model Z

Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)

Conc. Arch. A Conc. Arch. Z

Elements of Parallel computing:

• Algorithmic: Find theparallelism

• Linguistic: Express theparallelism

• Practical: Validate theparallelism(correctness, performance)

• Technical challenge: run-time/compiler-support support forparadigm

Parallel computing

Page 21: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Parallel computing: Accomplish something with a coordinated setof processors under control of a parallel algorithm

• It is inevitable: multi-core revolution, GPGPU paradigm, …

• It‘s interesting, challenging, highly non-trivial, full ofsurprises

• Key discipline of computer science (von Neumann, golden theory decade: 1980-early 90ies)

• It‘s ubiquituous (gates, architecture: Pipelines, ILP, TLP, systems: operating systems, software), not alwaystransparent

• It‘s useful: Large, extremely computationally intensive problems, Scientific Computing, HPC

• …

Why study parallel computing?

Page 22: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Parallel computing:The discipline of efficiently utilizing dedicated parallel resources (processors, memories, …) to solve single, given computational problem.

Specifically:Parallel resources with significant inter-communication capabilities, for problems with non-trivial communication and computational demands

Buzz words: tightly coupled, dedicated parallel system; multi-core processor, GPGPU, High-Performance Computing (HPC), …

Focus on properties of solution (time, size, energy, …) to given, individual problem

Page 23: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Distributed computing:The discipline of making independent, non-dedicated resourcesavailable and coorperative toward solving specified problemcomplexes.

Buzz words: internet, grid, cloud, agents, autonomous computing, mobile computing, …

Typical concerns: correctness, availability, progress, security, integrity, privacy, robustness, fault tolerance, …

Page 24: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Concurrent computing:The discipline of managing and reasoning about interactingprocesses that may (or may not) progress simultaneously

Buzz words: operating systems, synchronization, interprocesscommunication, locks, semaphores, autonomous computing, processcalculi, CSP, CCS, pi-calculus, lock/wait-freeness, …

Typical concerns: correctness (often formal), e.g. deadlock-freedom, starvation-freedom, mutual exclusion, fairness

Page 25: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Parallel vs. Concurrent computing

Process

ResourceProc

Process ProcessProcess …

Concurrent computing: Focus on coordination of access to/usage of shared resources (to solve given, computational problem)

Proc

Memory (locks, semaphores, data structures), device, …

Given problem: Specification, algorithm, data

Adopted from Madan Musuvathi

Page 26: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Subproblem

Proc

Subproblem SubproblemSubproblem

ProblemSpecification, algorithm, data

Parallel computing: Focus on dividing given problem (specification, algorithm, data) into subproblems that can be solved by dedicated processors (in coordination)

Proc Proc Proc

Coordination:synchronization, communication

Page 27: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

The “problem” of parallelization

Subproblem Subproblem SubproblemSubproblem

Problem

How to divide given problem into subproblems that can be solved in parallel?

• Specification• Algorithm?• Data?

• How is the computation divided? Coordination necessary? Does the sequential algorithm help/suffice?

• Where are the data? Is communication necessary?

Page 28: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Aspects of parallelization

• Algorithmic: How to divide computation into independent parts that can be executed in parallel? What kinds of shared resources are necessary? Which kinds of coordination? How can overheads be minimized (redundancy, coordination, synchronization)?

• Scheduling/Mapping: How are independent parts of the computation assigned to processors?

• Load balancing: How can independent parts be assigned to processors such that all resources are utilized efficiently?

• Communication: When must processors communicate? How?• Synchronization: When must processors agree/wait?

Page 29: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming language, interface, library)

Pragmatic/practical: How does the actual, parallel machine look? What is a reasonable, abstract model?

Architectural: Which kinds of parallel machines can be realized? How do they look?

Page 30: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Levels of parallelization

Architecture: gates, logical and functional units

Computational: Instruction Level Parallelism (ILP)

SIMD: Explicit vector processing

Functional, explicit: threads (possibly concurrent, sharing architecture resources), cores, processors

“Parallel computing”

Not here

Large-scale: coarse grained tasks, multi-level, coupled application parallelism

Not here

Page 31: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Can’t we just leave it all to the compiler/hardware?

Automatic parallelization(?)

• “high-level” problem specification• Sequential program

Efficient code that can be executed on parallel, multi-core processor

• Successsful only to a limited extent:• Compilers cannot invent a different algorithm• Hardware parallelism not likely to go much further

Samuel P. Midkiff: Automatic Parallelization: An Overview of Fundamental Compiler Techniques. Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers 2012

Page 32: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Explicitly, parallel programming (today):

• Explicitly parallel code in some parallel language• Support from parallel libraries• (Domain specific languages)

• Compiler does as much as compiler can do

Lot’s of interesting problems and tradeoffs, active area of research

Page 33: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Some „given, individual problems“ for this lecture

Problem 1:

Matrix-vector multiplication: Given nxm matrix A, m elementvector v, compute n element vector u = Av

x =A[i,j] u[i] = ∑A[i,j]v[j]

Dimensions n,mlarge: obviouslysome parallelism

How to ∑ in parallel?Access to vector?

……

Page 34: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Problem 1a:

Matrix-matrix multiplication: Given nxk matrix A, kxm matrix B, compute nxm matrix product C = AB

x =

C[i,j] = ∑A[i,k]B[k,j]

Page 35: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Problem 1b:

Solving sets of linear equations. Given matrix A and vector b, find x such that Ax = b

Preprocess A such that solution to Ax = b can easily be found for any B (LU factorization, …)

Page 36: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Problem 1c:

Sparse matrix-matrix multiplication: Given nxk matrix A, kxmmatrix B, compute nxm matrix product C = AB

x =

C[i,j] = ∑A[i,k]B[k,j]

Page 37: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Problem 2:

Stencil computation, given nxm matrix A, update as follows, anditerate until some convergence criteria is fulfilled

A[i,j]

iterate {for all (i,j) {A[i,j] <-

(A[i-1,j]+A[i+1,j]+A[i,j-1]+A[i,j+1])/4}

} until (convergence)

with suitable handlingof matrix border

Looks well-behaved, „embarassingly parallel“? Data distribution? Conflicts on updates?

Use: Discretization and solution of certain parallel differential equations (PDEs); image processing; …

5-point stencil in 2d

Page 38: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Problem 3:

Merge two sorted arrays of size n and m into a sorted array ofsize n+m

Easy to do sequentially, but sequentialalgorithm looks… sequential

Problem 3a:Sort by merging

Page 39: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Problem 4:

Computation of all prefix-sums: Given an array A of size n ofelements of some type S with an associative operations +, compute for all indices 0≤i<n the prefix-sums in array B

B[i] = ∑0≤j<iA[j] Implies solution to problem ofcomputing just ∑. Parallel?

A: 1 2 3 4 5 6 7 …

B: x 1 3 6 10 15 21 28 …All prefix-sums

Page 40: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Problem 5:

Sorting a sequence of objects („reals“, integers, objects withorder relation) stored in array

Hopefully parallel merge solution can be ofhelp. Other approaches? Quicksort? Integers (counting, bucket, radix)

Page 41: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Parallel computing as a (theoretical) CS discipline

• Than what?

• What is a reasonable „model“ for parallel computing? • How fast can a given problem be solved? How many resources

can be productively exploited?

• Are there problems that cannot be solved in parallel? Fast? At all?

• …

(Traditional) objective: Solve a given computational problem faster

Page 42: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Architecture, machine, computing, programming models

Architecture/machine model:Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.

Computational/execution model:(formal) framework for the design and analysis of algorithms for the computational system; cost model.

Concrete program (C, C++, Java, Haskell, Fortran,…)

Concrete architecture

Page 43: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

M

P

Main example: The RAM (Random-Access Machine)

Processor (ALU, PC, registers) capable ofexecuting instructions stored in memory on data in memory

Execution of instruction, access to memory:

First assumption: Unit cost for all operations

Page 44: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

M

P

Main example: The RAM (Random-Access Machine)

• Memory: Load, store• Arithmetic (integer, real

numbers)• Logic: and, or, xor, shift• Branch, compare, procedure call

Traditional RAM: Same unit cost for all operations

Instructions

Realistic?

Useful?

Another useful model of computation: The Turing machine

Page 45: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

M

P

Main example: The RAM (Random-Access Machine)

Aka von Neumann architecture or stored program computer, data and program in same memory („programs as data“)

John von Neumann (1903-57), Report on EDVAC, 1945; also Eckert&Mauchly, ENIAC

„von Neumann bottleneck“: Program anddata separate from CPU, processing rate bounded by memory rate.

John W. Backus: Can Programming Be Liberated From the von Neumann Style? A Functional Style and its Algebra of Programs. Commun. ACM 21(8): 613-641 (1978), Turing Award Lecture, 1977

Page 46: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

M

P

Main example: The RAM (Random-Access Machine)

Processor (ALU, PC, registers) capable ofexecuting instructions stored in memory on data in memory

Non-uniform memory cost (hierarchical memors: cache’s)

Realistic? Perhaps more so

Cache

Page 47: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

M

P

Parallel RAM variations

M M M…

Increased memory bandwidth: Vector computer, ALU operateson vectors instead of scalars

Page 48: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

P

Parallel RAM variations

M

PP P

Shared-memory model, bus based

„von Neumann bottleneck“

Page 49: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

P

Parallel RAM variations

M

PP P

Shared-memory model, multi-port, network based, …

• How are the processors synchronized?• What can the memory do?• What are the costs?

Page 50: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

P

Parallel RAM variations

M

PP P

All processors operate under control of the same clock: Synchronous, shared-memory model

Parallel RAM (PRAM):• All processors work in lock-step (all same program, or

individual programs)• Unit-time instruction and memory access (uniform)

Page 51: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

P

Parallel RAM variations

M

PP P

PRAM conflict resolution: What happens if several processors in the same time step access the same memory location?

• EREW PRAM: Not allowed, neither read nor write• CREW PRAM: Concurrent reads allowed, concurrent writes

not• CRCW PRAM: Both concurrent read and write

Page 52: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

P

Parallel RAM variations

M

PP P

CRCW PRAM conflict resolution:

• COMMON: All processors that conclict must write the same value

• ARBITRARY: One write succeeds• PRIORITY: A priority scheme determines which

Realistic?

Useful?

Page 53: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

P

Parallel RAM variations

M

PP P

Realistic?

Useful?

Joseph JáJá: An Introduction to Parallel Algorithms. Addison-Wesley 1992, ISBN 0-201-54856-9

Not realistic (could possibly be emulated), but extremely useful theoretical tool (late 70ties-mid 90ties): Techniques and lower bounds

Not this lecture

Page 54: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

par (0<=i&&i<n) b[i] = true; // a[i] could bepar (0<=i&&i<n, 0<=j&&j<n)if (a[i]<a[j]) b[i] = false; // a[i] is not

par (0<=i&&i<n) if (b[i]) x = a[i];

PRAM Example: Fast maximum finding

Problem: Given n numbers in shared memory array a, find the maximum

Idea: Perform all n2 comparisons (a[i],a[j]) in parallel, eliminate those numbers that cannot be maximum

1.2.

3.

Page 55: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

par (0<=i&&i<n) b[i] = true; // a[i] could bepar (0<=i&&i<n, 0<=j&&j<n)if (a[i]<a[j]) b[i] = false; // a[i] is not

par (0<=i&&i<n) if (b[i]) x = a[i];

1.2.

3.

Three parallel steps, O(1) time with n, n2 and n PRAM processors, respectivelyCRCW capability needed in Steps 2 and 3

Theorem:On a Common CRCW PRAM, the maximum of n numbers can be found in O(1) time steps and O(n2) operations in total

Page 56: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Theorem:On a Common CRCW PRAM, the maximum of n numbers can be found in O(1) time steps and O(n2) operations in total

Note: Constant time algorithm with polynomial resources (operations, processors

Is this a good algorithm? Answer later

Page 57: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

P

M

PP P

M M M…

Shared-memory model, banked, network based, not synchronous

Parallel RAM variations

• How to synchronize: Atomic operations, barriers• Memory consistency: When can some processor “see” what

some other processor has written into memory?

Page 58: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

UMA (Uniform Memory Access):Access time to memory location is independent of location andaccessing processor, e.g., O(1), O(log M), …

NUMA (Non-Uniform Memory Access):Access time dependent on processor and location.

P

M

PP P

P

M

PP P

M M M…

Parallel memory access cost

Locality: For each processor, somelocations can be accessed faster thanother locations

Example: PRAM is UMA

Page 59: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

P

M

PP P

M M M…

Parallel, distributed memory RAM

Interconnection Network

Memory is distributed over processors, each processor can directly access only its own, local memory

Page 60: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Architecture model defines resources; describes• Composition of processor, functional units: ALU, FPU,

registers, w-bit words vs. unlimited, Vector Unit (MMX, SSE, AVX)

• Types of instructions• Memory system, caches• …

Execution model/cost model specifies• How instructions are executed• (relative) Cost of instructions, memory accesses• …

Level of detail/formality dependent on purpose:What is to be studied (complexity theory, algorithms design, …), what counts (instructions, memory accesses, …)

Page 61: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Parallel architecture model defines• Synchronization between processors• Synchronization operations• Atomic operations, shared resources (memory, registers)• Communication mechanisms: network topology, properties• Memory: shared, distributed, hierarchical, …• …

Cost model defines• Cost of synchronization, atomic operations• Cost of communication (latency, bandwidth, …)• …

Page 62: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Yet another parallel architecture model:

John on Neumann, Arthur W. Burks: Theory of self-reproducingautomata, 1966H. T. Kung: Why systolic architectures? IEEE Computer 15(1): 37-46, 1982

State of cell (i,j) in next step determined by• Own state• State of neighboors in some

neighborhood, e.g., (i,j-1), (i+1,j), (i,j+1), (i-1,j)s

Cellular automaton, systolic array, … : Simple processors withoutmemory (finite state automata, FSA), operate in lock step on (potentially infinite) grid, local communication only

Page 63: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Flynn‘s taxonomy:Orthognal classification of (parallel) architectures/models

SISDSingle Instruction Single Data

MISDMultiple Instruction Single Data

SIMDSingle Instruction Multiple Data

MIMDMultiple Instruction MultipleData

M. J. Flynn: Some computer organizations and theireffectiveness. IEEE Trans. Comp. C-21(9):948-960, 1972

Intruction stream

Dat

a st

ream

Page 64: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

SISD: Single processor, single stream of instructions, operateson single stream of data. Sequential architecture (e.g. RAM)

SIMD: Single processor, single stream of operations, operateson multiple data per instruction. Example: traditional vectorcomputer, PRAM (some variants)

MISD: Multiple instructions operate on single data stream.Example: Pipelined architectures, streaming architectures(?), systolic arrays (70ties architetural idea).

MIMD: Multiple instruction streams, multiple data streams

Some say: Empty

Page 65: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

M

P

SISD

M

P

M MM

SIMD

Typical Flynn classification instances

M

P P P P

M M M… M

P P P P

M M M

Communication network

MIMD

M P P P P

… MISD (?)

Page 66: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Programming model:Abstraction, close to programming language, defining parallel resources, management of parallel resources, parallelizationparadigms, memory structure, memory model, synchronizationand communication features, and their semantics

Execution model: When and how parallelism in programmingmodel is effected

Concrete program (C, C++, Java, Haskell, Fortran,…)

Concrete architecture

Page 67: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Programming model:Abstraction, close to programming language, defining parallel resources, management of parallel resources, parallelizationparadigms, memory structure, memory model, synchronizationand communication features, and their semantics

Parallel programming language, or library („interface“) is theconcrete implementation of one (or more: multi-modal, hybrid) parallel programming model(s)

Cost of operations: May be specified with programming model; often by architecture/computational model

Execution model: When and how parallelism in programmingmodel is effected

Page 68: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

• Parallel resources, entities, units: Processes, threads, tasks, …• Expression of parallelism: Explicit or implicit• Level and granularity of parallelism

• Memory model: Shared, distributed, hybrid• Memory semantics („when operations take effect/become

visible“)• Data structures, data distributions

• Methods of synchronization (implicit/explicit)• Methods and modes of communication

Parallel programming model defines, e.g.,

Page 69: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

(*)SPMD: Single Program, Multiple Data(**)PGAS: Partitioned Global Address Space

Examples:

1. Threads, shared memory, block distributed arrays, fork-joinparallelism

2. Distributed memory, explicit message passing, collectivecommunication, one-sided communication („RDMA“), PGAS(**)

3. Data parallel SIMD, SPMD(*)4. …

Concrete libraries/languages: pthreads, OpenMP, MPI, UPC, TBB, …

Not this lecture

Page 70: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Same Program Multiple Data (SPMD)

F. Darema at al.: A single-program-multiple-data computational model for EPEX/FORTRAN, 1988

Restricted MIMD model: All processors execute same program

• May do so asynchronously, different processors may be in different parts of program at any given time

• Same objects (procedures, variables) exist for all processors, “remote procedure call”, “active messages, “remote-memory access” make sense

Programming model concepts: Active messages, remote procedure call, … (MPI: RMA/one-sided communication)

Not this lecture

Page 71: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Programming language/library/interface/paradigm

Programming model

Architecture model

„Real“ Hardware

Different architecture models canrealize given programming model

Closer fit: More efficient use ofarchitecture

Challenge: Programming model that isuseful and close to „realistic“ architecture models to enable realisticanalysis/prediction

OpenMP MPI

Challenge: Language that convenientlyrealizes programming model

Algorithmssupport, „run-time“

Cilk

Page 72: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Examples:

OpenMP programming interface/language for shared-memory model, intended for shared memory architectures; mostly „Data parallel“

Can be implemented with DSM (Distributed Shared Memory) on distributed memory architectures; but performance usually not good. Requires DSM implementation/algorithms

MPI interface/library for distributed memory model, can be used on shared-memory architectures, too. Needs algorithmic support (e.g., „collective operations“)

Cilk language (extended C) for shared-memory model, for shared-memory architectures; „Task parallel“. Need run-time (e.g., „work-stealing“)

See later

Page 73: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Performance: Basic observations and goals

Model:p dedicated parallel processors collaborate to solve given problem of input size n.

Focus worst-case complexities and execution times

• Processors work independently (local memory, program)• Communication and coordination incurs some overhead

Let some computational problem P(n) given• Seq: Sequential algorithm and implementation solving P• Par: Parallel algorithm and implementation solving P

Page 74: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Challenge and one main goal of parallel processing

… not to make life easy

Be faster than commonly used (good, best known, best possible…) sequential algorithm and implementation, utilizing the p processors efficiently

Page 75: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Let• Tseq(n): Time for 1 processor to solve P(n)• Tpar(p,n): Time for p processors to solve P(n)

The speedup of Tpar over Tseq is defined as

Sp(n) = Tseq(n)/Tpar(p,n)

Speedup measures the gain in moving from sequential to parallel computation (Note: Parameters p and n)

Main goal: Speeding up computations by parallel processing

Goal: Achieve as large speed-up as possible (for some n, all n) for as many processors as possible

Page 76: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

What is „time“ (number of instructions)?

• Time for some algorithm for solving problem?• Time for a specific algorithm for solving problem?

• Time for best known algorithm for problem?• Time for best possible algorithm for problem?

• Time for specific input of size n, average case, worst case, …?• Asymptotic time, large n, large p?

• Do constants matter, e.g. O(f(p,n)) or 25n/p+3ln (4 (p/n))… ?

What exactly is Tseq(n), Tpar(p,n)?

Page 77: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Tseq(n):

• Theory: Number of instructions (or other critical cost measure) to be executed in the worst case for inputs of size n

• The number of instructions carried out is often termed WORK

• Practice: Measured time (or other parameter) of execution over some inputs (experiment design)

Choose sequential algorithm (theory), choose an implementation of this algorithm (practice)

Theory and practice:Always state baseline sequential algorithm/implementation

Page 78: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Examples (theory):

Tseq(n) = O(n):

Tseq(n,m) = O(n+m):

Tseq(n) = O(n log n):Tseq(n,m) = O(n log n + m):Tseq(n) = O(n3):

Finding maximum of n numbers in unsorted array; prefix-sumsMerging of two sequences; BFS/DFS in graphComparison-based sortingSingle-source Shortest Path (SSSP)Matrix multiplication, input two nxnmatrices…

Standard, worst-case, asymptotic complexities

Cormen, Leiserson, Rivest, Stein: Introduction to Algorithms. 3rd

ed., MIT Press, 2009

Can be solved in o(n3), Strassen etc.

Page 79: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

New issue with modern processors:Is time always the same thing? Clock frequency may not be constant, e.g., can depend on load of system, energy cap, “turbomode”, etc.. Such factors difficult to control

Practice:• Construct good input examples to measure running time

Tseq(n); experimental methodology• Worst-case not always possible, not always interesting; best

case?• Experimental methods to get stable, accurate Tseq(n):

Repeat measurements many times (thumb rule: average over at least 30 repetitions)

Experimental science:Always some assumptions about repeatability, regularity, determinism, …

Page 80: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Parallel performance in theory

Definition:Let Tseq(n) be the (worst-case) time of best possible/best known sequential algorithm, andTpar(n,p) the (worst-case) time of parallel algorithm. The absolute speed-up of Tpar(n,p) with p processors over Tseq(n) is

Sp(n) = Tseq(n)/Tpar(p,n)

Observation (proof follows):Best-possible, absolute speed-up is linear in p

Goal: Obtain (linear) speedup for as large p as possible (as function of problem size n), for as many n as possible

Page 81: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

For speedup (and other complexity measures), distinguish:

• Problem P to be solved (mathematical specification)

• Some algorithm A to solve P• Best possible (lower bound) algorithm A* for P, best known

algorithm A+ for G: The complexity of P

• Implementation of A on some machine M

Page 82: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Subproblem:Sum of n/p elements

Subproblem: sum of n/p elements

Subproblem:sum of n/p elements

Subproblem: sum of n/p elements

Problem: sum of two n-element vectors

Example: “data parallel” (SIMD) computation

for (i=0; i<n; i++) {a[i] = b[i]+c[i];

}

Algorithm/implementation:

Tpar(p,n) = Tseq(n)/p

Sp(n) = p

Best possible parallelization: sequential work divided evenly across p processors

Complexity: Tseq(n) = Θ(n)

Page 83: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Subproblem:Sum of n/p elements

Subproblem: sum of n/p elements

Subproblem:sum of n/p elements

Subproblem: sum of n/p elements

Problem: sum of two n-element vectors

Example: “data parallel” (SIMD) computation

for (i=0; i<n; i++) {a[i] = b[i]+c[i];

}

• Perfect speedup• “Embarassingly parallel”• “Pleasantly parallel”

Tpar(p,n) = c(n/p) for constant c≥1

Algorithm/implementation:

Page 84: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Work

TimeTseq(n)

start

The work, measured in instructions and/or time, that has to be carried out for problem of size n

stop

Page 85: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

p processors

Perfect parallelization: Sequential work evenly divided between p processors, no overhead, so Tpar(p,n) = Tseq(n)/p

Perfect speedup

Sp(n) = Tseq(n)/(Tseq(n)/p) = p

… and very rare in practice

TimeTseq(n)

Page 86: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Ti

p processors

Sequential work is unevenly dividedbetween p processors:Load imbalance, Tpar(p,n)>Tseq(n)/p, even though ∑Ti(n)=Tseq(n)

start

stopTpar is time for slowest processor to complete, all processors assumed to start at same time

Define Tpar(p,n) = max Ti(n) over all processors, starting at the same time

TimeTseq(n)

Page 87: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Ti

p processors

Wpar(n) = ∑Ti(n) is called the work of the parallel algorithm = total number of instructions performed by the p processors

The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: Total time in which the p processors are reserved (: have to be paid for)

Area C(n) = p*Tpar(p,n)

TimeTseq(n)

Page 88: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

“Theorem:”Perfect speedup Sp(n) = p is best possible and cannot be exceeded

“Proof”:Assume Sp(n) > p for some n. Tseq(n)/Tpar(p,n) > p implies Tseq(n) > p*Tpar(p,n). A better sequential algorithm could be constructed by simulating the parallel algorithm on a single processor. The instructions of the p processors are carried out in some correct order, one after another on the sequential processor. This contradicts that Tseq(n) was best possible/known time.

Reminder:Speedup is calculated (measured) relative to “best” sequential implementation/algorithm

Page 89: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Ti

p processors

By assumption C(n) = p*Tpar(p,n) < Tseq(n)

Simulation A: one step of P1, one step of P2, …, one step of P(p-1), one step of P1, …, for C(n) iterationsSimulation B: steps of P1 until communication/synchronization, steps of P2 until communication/synchronization, …

TimeTseq(n)

Page 90: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Ti

1 processor

By assumption C(n) = p*Tpar(p,n) < Tseq(n)

Both simulations yield a new, sequential algorithm Tsim(n) with Tsim(n)<Tseq(n)

This contradicts that Tseq(n) was time of best possible/best known sequential algorithm

TimeTseq(n)

Page 91: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Lesson: Parallelism offers only „modest potential“, speed-up cannot be more than p on p processors

Lawrence Snyder: Type architecture, shared memory and the corollary of modest potential. Annual Review of Computer Science, 1986

Construction shows that the total parallel work must be at least as large as the sequential work Tseq, otherwise, better sequential algorithm can be constructed.

Crucial assumptions: Sequential simulation is possible (enough memory to hold problem and state of parallel processors), sequential memory behaves as parallel memory, …

This s NOT TRUE for real systems and real problems

Page 92: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Ti

1 processor

TimeTseq(n) Aside: Such simulations are actually

sometimes done, and can be very useful to understand (model) and debug parallel algorithms

Some such simulation tools:

• SimGrid (INRIA)• LogPOPSim (Hoefler et al.)• …

Possible bachelor thesis subject

Page 93: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Wpar(n) = ∑Ti(n) is called the parallel work of the parallel algorithm = total number of instructions performed by some number of processors

The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: Total time in which the p processors are occupied

Definition:Parallel algorithm is called cost-optimal if C(n) = O(Tseq(n)). A cost-optimal algorithm has linear (perhaps perfect) speedup

Definition:Parallel algorithm is called work-optimal if Wpar(n) = O(Tseq(n)). A work-optimal algorithm has potential for linear speedup (for some number of processors)

Page 94: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Proof (linear-speed up of cost-optimal algorithm):Given cost-optimal parallel algorithm with p*Tpar(p,n) = c*Tseq(n) = O(Tseq(n))

This implies Tpar(p,n) = c*Tseq(n)/p, so

Sp(n) = Tseq(n)/Tpar(p,n) = p/c

The constant factor c captures the overheads and load imbalance (see later) of the parallel algorithm relative to best sequential algorithm. The smaller c, the closer the speedup to perfect

Page 95: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Ti

p processors

Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)

TimeTseq(n)

Page 96: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Ti

p processors

Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)

Execute on smaller number of processors, such that ∑Ti(n) = p*Tpar(p,n) = O(Tseq(n))

TimeTseq(n)

Page 97: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Proof idea (work-optimal algorithm can have linear speed-up):

1. Work-optimal algorithm2. Schedule work-items Ti(n) on p processors, such that

p*Tpar(p,n) = O(Tseq(n))3. With this number of processors, algorithm is cost-optimal4. Cost-optimal algorithms have linear speed-up

Parallel algorithms’ design goal:Work-optimal parallel algorithm with as small Tpar(n) as possible (and therefore large parallelism: many processors can be utilized)

The scheduling in step 2 is possible in principle, but may not be trivial

Page 98: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

par (0<=i&&i<n) b[i] = true; // a[i] could bepar (0<=i&&i<n, 0<=j&&j<n)if (a[i]<a[j]) b[i] = false; // a[i] is not

par (0<=i&&i<n) if (b[i]) x = a[i];

Example: CRCW PRAM Maximum Finding algorithm

O(n2) operations (work), but sequential maximum finding requires only O(n) operations

Speedup with perfect parallelization (we do not know now whether this is possible

Sp(n) = O(n)/O(n2/p) = O(p/n)

(Only small) linear speedup for fixed n; but not independent of n

Bad!

Page 99: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Example: Another non work-optimal algorithm

DumbSort with T(n) = O(n2) that can be perfectly parallelized, Tpar(p,n) = O(n2/p)

Well-known that Tseq(n) = O(n log n), with many algorithms andgood implementations

Sp(n) = n log n/n2/p = p(log n)/n

Tpar(p,n) < Tseq(n) n2/p < n log n n/p < log n p > n/log n

(Small) linear speedup for fixed n (but not independent of n)

Break-even: When is parallel algorithm faster than sequential?

Non work-optimal algorithm: Speed-up decreases with n

Bad!

Page 100: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Best known/best possible parallel algorithm often difficult to parallelize• no redundant work (that could have been done in parallel)• tight dependencies (that forces things to be done one after

another)

Lesson:Usually does not make sense to parallelize an inferior algorithm(although sometimes much easier)

Lesson from much hard work in theory and practice:Parallel solution of a given problem often requires a new algorithmic idea!

But: Many algorithms often have a lot of potential for easy parallelization (loops, independent functions, …), so why not?

Also: Non-work optimal algorithms can sometimes be useful, as subroutine

Page 101: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

The ratio

SRelp(n) = Tpar(1,n)/Tpar(p,n)

is called relative speedup, and expresses how well the parallel algorithm utilizes p processors (scalability)

Relative speedup NOT to be confused with absolute speedup. Absolute speedup expresses how much can be gained over the best (known/possible) sequential implementation by parallelization.

Absolute vs. relative speedup

Absolute speed-up is what eventually matters

Page 102: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Note:Literature (papers and books) is not always clear about this distinction.

It is easier to achieve and document good relative speedup.

Reporting speed-up relative to an inferior, sequential implementation is incorrect!

Absolute vs. relative speedup

Page 103: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Definition:T∞(n): The smallest possible running time of parallel algorithm given arbitrarily many processors. Per definition T∞(n) ≤ Tpar(p,n) for all p. Relative speedup is limited by

SRelp(n) = Tpar(1,n)/Tpar(p,n) ≤ Tpar(1,n)/T∞(n)

Definition:The ratio Tpar(1,n)/T∞(n) is called the parallelism of the parallel algorithm.

The parallelism is the largest number of processors that can be employed and still give linear, relative speedup (assume Tpar(1,n)/T∞(n)<p in equation above)

Page 104: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

par (0<=i&&i<n) b[i] = true; // a[i] could bepar (0<=i&&i<n, 0<=j&&j<n)if (a[i]<a[j]) b[i] = false; // a[i] is not

par (0<=i&&i<n) if (b[i]) x = a[i];

Example: CRCW PRAM Maximum Finding algorithm

O(n2) operations (work), but sequential maximum finding requires only O(n) operations

SRelp(n) = O(n2)/O(n2/p) = O(p)Parallelism: O(n2)/O(1) = n2

This (terrible) parallel algorithm has linear relative speedup for p up to n2 processors

Page 105: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

par (0<=i&&i<n) b[i] = true; // a[i] could bepar (0<=i&&i<n, 0<=j&&j<n)if (a[i]<a[j]) b[i] = false; // a[i] is not

par (0<=i&&i<n) if (b[i]) x = a[i];

Example: CRCW PRAM Maximum Finding algorithm

This (terrible) algorithm has linear relative speedup for p up to n2 processors

Nevertheless: Useful as a building block

There exist a work-optimal CRCW PRAM algorithm that runs in O(log log n) steps requiring only O(n) operations

Advanced material: Last fact about PRAM in this lecture

Page 106: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Absolute vs. relative speedup

Work-optimality property:

For work-optimal algorithms, absolute and relative speedup coincides: Tpar(1,n) = O(Tseq(n))

Page 107: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Parallel performance in practice

Speedup is an empirical quantity, “measured time”, based on experiment (benchmark)

Tseq(n): Running time for “reasonable”, good, best available, sequential implementation, on “reasonable” inputs

Tpar(p,n): parallel running time measured for a number of experiments with different typical (worst-case?) inputs

Sp(n) = Tseq(n)/Tpar(p,n)

Speedup is typically not independent of problem size n, and problem instance

Page 108: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Time

Ti

p processors

Oi

Parallelization most often incurs overheads:• Algorithmic: Parallel algorithm may do

more work• Coordination: Communication and

synchronization• …

Tpar(p,n) Note T(1,n)≥Tseq(n)

Tseq(n)

Page 109: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Absolute vs. relative speedup and scalability

Good scalability and relative speedup Tpar(1,n)/Tpar(p,n) = Θ(p)

Example: 0.1p ≤ Tpar(1,n)/Tpar(p,n) ≤ 0.5p is reported, sounds good

Even for work-optimal Tpar(1,n) = 100Tseq(n) = O(Tseq(n)) it would take at least 200 processors to be as fast as the sequential algorithm with the reported

But what if Tpar(1,n) = 100Tseq(n)? Or Tseq(n) = O(n) but Tpar(p,n) = O((n log n)/p + log n)?

Page 110: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

David H. Bailey:Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers. Supercomputing Review, Aug. 1991, pp. 54-55

Torsten Hoefler, Roberto Belli: Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. SC 2015: 73:1-73:12

Empirical, relative speedup without absolute performance baseline (and comparison to reasonable, sequential algorithm and implementation) is misleading

Page 111: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Practical problems: Measuring time and work

Ti

Oi

start

stop

Tpar defined as time for the last processor to complete assuming all procesors start at the same time

Problem 1:To measure Tpar, efficient, accurate synchronization is needed

“barrier synchronization”

Page 112: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Practical problems: Measuring time and work

Ti

Oi

start

stop

Tpar defined as time for the last processor to complete assuming all procesors start at the same time

Problem 2:Is this relevant? Are applications synchronized? Why not just measure the work?

Page 113: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Practical problems: Measuring time and work

Ti

Oi

Tpar defined as time for the last processor to complete assuming all procesors start at the same time

Problem 3: How does the time spent per processor and the work depend on the synchronization (a slow processor may cause other processes to wait)

Work = ∑Ti

Page 114: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Practical problems: Measuring time and work

Ti

Oi

Tpar defined as time for the last processor to complete assuming all procesors start at the same time

Problem 3: How does the time spent per processor and the work depend on the synchronization (a slow processor may cause other processes to wait)

Work = ∑Ti

dependency

Page 115: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Problem 1: Most parallel interfaces include synchronization mechanisms (“barrier”) that may or may not be accurate. There is still lots of work on temporal, accurate synchronization

Problem 2: Difficult to answer

Problem 3: Not much known, most algorithms and implementations assume and depend on some synchrony when evaluating performance

Sascha Hunold, Alexandra Carpen-Amarie: Reproducible MPI Benchmarking is Still Not as Easy as You Think. IEEE Trans. Parallel Distrib. Syst. 27(12): 3617-3630 (2016)

Page 116: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n)

If algorithm is cost-optimal, p*Tpar(p,n) = k*Tseq(n), speedup becomes imperfect, but still linear, Sp(n) = p/k

Tseq(n)

Page 117: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n)

Idle

NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution

Tseq(n)

Page 118: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n)Idle

NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution

Tseq(n)

Oi

Idle

Page 119: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n)Idle

Tseq(n)

Oi

Idle

Typical overhead by communication and coordination.

The (smallest) time between coordination periods is called the granularity of the parallel computation

Page 120: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

• “Coarse grained” parallel computation/algorithm: Time/number of instructions between coordination intervals (synchronization operations, communication operations) is large (relative to total time or work)

• “Fine grained” parallel computation/algorithm: Time/number of instructions between… is small

Granularity

Page 121: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n)Idle

Tseq(n)

Oi

Idle

Definition:Difference between max Ti(n) and min Ti(n) is the load imbalance. Achieving Ti(n) ≈ Tj(n) for all processors i,j called load balancing

Page 122: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Time

p processors

Oi

Tpar(p,n) close to Tseq(n)/p

Idle

Best parallelization has no load imbalance (and no overhead), so Tpar(p,n) = Tseq(n)/p

This is the best possible parallel time

Tseq(n)

Ti

Page 123: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

TimeTseq(n) = (s+r)Tseq(n)

Ti

p processors

Maximum speedup becomes severely limited, e.g., if overheads are sequential and a constant fraction

Tpar(p,n) ≥ s*Tseq(n)+r*Tseq(n)/p

Sequential overhead: constant fraction

Page 124: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Amdahls Law (parallel version):Let a program A contain a fraction r that can be “perfectly” parallelized, and a fraction s=(1-r) that is “purely sequential”, i.e., cannot be parallelized at all. The maximum achievable speedup is 1/s, independently of n

Proof:

• Tseq(n) = (s+r)*Tseq(n)• Tpar(p,n) = s*Tseq(n) + r*Tseq(n)/p

Sp(n) =Tseq(n)/(s*Tseq(n)+r*Tseq(n)/p) = 1/(s+r/p) -> 1/s, for p -> ∞

G. Amdahl: Validity of the single processor approach to achievinglarge scale computing capabilities. AFIPS 1967

Page 125: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Typical victims of Amdahl‘s law:

• Sequential input/output could be a constant fraction• Sequential initialization of global data structures• Sequential processing of „hard-to-parallelize“ parts of

algorithm, e.g., shared data structures

Amdahl‘s law limits speed-up in such cases, if they are a constantfraction of total time, independent of problem size

Page 126: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Example:

1. Processor 0: read input, some precomputation2. Split problem into n/p parts, send part i to processor i

3. All processors i: solve part i4. All processors i: send partial solution back to processor 0

Typical Amdahl, sequential bottleneck: Constant sequential fraction (3 out of 4 steps) limits speedup)

10n9n

Amdahl: s=0.1, SU at most 10

When interested in parallel aspects, input-output and problem splitting is often explicitly not measured (see projects)

Page 127: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

// Sequential initializationx = (int*)calloc(n*sizeof(int));…// Parallelizable partdo {

for (i=0; i<n; i++) {x[i] = f(i);

}// check for convergencedone = …;

} while (!done)

Example (contrived): K iterations beforeconvergence, (parallel) convergence check cheap, f(i) fast O(1)…

Sp(n) -> 1+K

Tseq(n) = n+K+Kn

Tpar(p,n) = n+K+Kn/p

Sequential fraction ≈ 1/(1+K)

Page 128: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

// Sequential initializationx = (int*)malloc(n*sizeof(int));…// Parallelizable partdo {

for (i=0; i<n; i++) {x[i] = f(i);

}// check for convergencedone = …;

} while (!done)

Example (contrived): K iterations beforeconvergence, (parallel) convergence check cheap, f(i) fast O(1)…

Sp(n) -> p when n>p, n->∞

Tseq(n) = 1+K+Kn

Tpar(p,n) = 1+K+Kn/p

Sequential part ≈ 1/(1+n)

Note:

If sequential part is constant(but not fraction), Amdahl‘slaw does not limit SU

Page 129: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

// Sequential initializationx = (int*)malloc(n*sizeof(int));…// Parallelizable partdo {

for (i=0; i<n; i++) {x[i] = f(i);

}// check for convergencedone = …;

} while (!done)

Example (contrived, but…): K iterations beforeconvergence, (parallel) convergence check cheap, f(i) fast O(1)…

Tseq(n) = 1+K+Kn

Tpar(p,n) = 1+K+Kn/p

Lesson:Be careful with systemfunctions (calloc, malloc)

Sequential part ≈ 1/(1+n)

Sp(n) -> p when n>p, n->∞

Page 130: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Avoiding Amdahl: Scaled speedup

Sequential, strictly non-parallelizable part is often not a constant fraction of the total execution time (number of instructions)

The sequential part may be constant, or grow only slowly with problem size n. Thus, to maintain good speedup, problem size should be increased with p

Page 131: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Assume Tseq(n) = t(n)+T(n) with sequential part t(n) and perfectly parallelizable part T(n).

Assume t(n)/T(n) -> 0 for n-> ∞

Tpar(p,n) = t(n)+T(n)/p

Sp(n) = (t(n)+T(n))/(t(n)+T(n)/p) = (t(n)/T(n)+1)/(t(n)/T(n)+1/p) -> 1/(1/p) = p

Now, speedup as a function of p and n is:

for n -> ∞

Page 132: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Lesson:Depending on how fast t(n)/T(n) converges, linear speed-up can be achieved by increasing problem size n

Definition:Speedup as function of p and n, with sequential and parallelizable times t(n) and T(n) is called scaled speed-up

Page 133: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Remarks:

• E(p,n) ≤ 1, since Sp(n) = Tseq(n)/Tpar(n,p) ≤ p• E(p,n) = constant: linear speedup

Definition:The efficiency of a parallel algorithm is ratio of best possible parallel time to actual parallel time for given p and n:

E(p,n) = (Tseq(n)/p)/Tpar(p,n) = Sp(n)/p = Tseq(n)/(p*Tpar(p,n))

cost

• Cost-optimal algorithms have constant efficiency

Page 134: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Scalability

Definition:A parallel algorithm/implementation is strongly scaling ifSp(n) = Θ(p) (linear, independent of (sufficiently large) n)

Definition:A parallel algorithm/implementation is weakly scaling if there isa slow-growing function f(p), such that for n = Ω(f(p)), E(p,n) remains constant. The function f is called the iso-efficiencyfunction

„Maintain efficiency by increasing problem size as f(p) or more“

Page 135: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

John L. Gustafson: Reevaluating Amdahl's Law. Commun. ACM 31(5): 532-533 (1988)

Ananth Grama, Anshul Gupta, Vipin Kumar: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Transactions Par. Dist. Computing. 1(3): 12-21 (1993)

Scaled speed-up idea from many sources; some are

Page 136: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Summary: Stating parallel performance

It is convenient to state parallel performance and scalability of a parallel algorithm/implementation as

Tpar(p,n) = O(T(n)/p+t(p,n))

T(n) represents the parallel part, t(p,n) the non-parallel part of the algorithm beyond which no improvement is possible, regardless of how many processors are used (parallelism = T(n)/t(p,n) )

The work of the algorithm is W= O(p (T(n)/p+t(p,n)) ) = O(T(n)+pt(p,n))

The algorithm is work-optimal if T(n) is O(Tseq(n))

Page 137: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

// Sequential initializationx = (int*)malloc(n*sizeof(int));…// Parallelizable partdo {

for (i=0; i<n; i++) {x[i] = f(i);

}// check for convergencedone = …;

} while (!done)

Assume convergence check takes O(log p) time

Tpar(p,n) = Kn/p+K log p

E(p,n) ≈ Kn/(Kn+pK log p)

For n≥plog p, E(p,n) ≥ 1/2

Example (again):

Weakly scalable, n has to increase as O(p log p) to maintainconstant efficiency, O(log p) per processor (if balanced)

Page 138: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Examples (Tpar, Speedup, Optimality, Efficiency):

1. Tpar0(p,n) = n/p+1

2. Tpar1(p,n) = n/p+log p

3. Tpar2(p,n) = n/p+log2 p

4. Tpar3(p,n) = n/p+√p

5. Tpar4(p,n) = n/p+p

Linear time computation, Tseq(n) = n (constants ignored)

Embarassingly “data parallel” computation, constant overheadlogarithmic overhead, e.g. convergence check

Linear overhead, e.g. data exchange

Typical, good, work-optimal parallelizations

Page 139: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

n=128

Page 140: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

1. Tpar0(p,n) = n/p+1:p*Tpar(p,n) = n+p = O(n) for p=O(n)

2. Tpar1(p,n) = n/p+log p:p*Tpar(p,n) = n+p log p = O(n) for p log p = O(n)

3. Tpar2(p,n) = n/p+log2 p:p*Tpar(p,n) = n+p log2 p = O(n) for p log2 p=O(n)

4. Tpar3(p,n) = n/p+√p:p*Tpar(p,n) = n+p√p = O(n) for p√p=O(n)

5. Tpar4(p,n) = n/p+p:p*Tpar(p,n) = n+p2 = O(n) for p2=O(n)

Cost-optimality:

Page 141: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

n=128

Page 142: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

n=128, but p up to 256

Page 143: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

n=16384 (=16K=1282)

Page 144: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

n=2097152 (=2M=1283)

Page 145: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

n=128

Page 146: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

n=16384 (=16K=1272)

Page 147: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

n=2097152 (=2M=1273)

Page 148: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Efficiency, weak scaling

1. Tpar0(p,n) = n/p+1: f0(p) = e/(1-e)*p

2. Tpar1(p,n) = n/p+log p: f1(p) = e/(1-e)*(p log p)

3. Tpar2(p,n) = n/p+log2 p: f2(p) = e/(1-e)*p log2 p

4. Tpar3(p,n) = n/p+√p: f3(p) = e/(1-e)*(p√p)

5. Tpar4(p,n) = n/p+p: f4(p) = e/(1-e)*p2

To maintain constant efficiency e=Tseq(n)/(p*Tpar(p,n)), n has to increase as

Page 149: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Maintained efficiency = 0.9 (90%)

Page 150: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Matrix-vector multiplication parallelizations

Tseq(n) = n2

Tpar0(p,n) = n2/p + nTpar1(p,n) = n2/p + n + log p

Page 151: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

n=100

Page 152: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

n=1000

Page 153: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Some non work-optimal parallel algorithms

Amdahl case, linear sequential running time:

TparA(p,n) = 0.9n/p+0.1n

1. Tseq(n) = n log n Tpar(p,n) = n2/p+1

2. Tseq(n) = n Tpar(p,n) = (n log n)/p+1

Page 154: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

n1 = 128n2 = 16384

Page 155: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Limitations of empirical speedup measure

Empirical speedup assumes that Tseq(n) can be measured.

For very large n and p, this may not be the case: A large HPC system has much more (distributed) main memory than any single-processor system

Scalability measured by other means:• Stepwise speedup (1-1000 processors, 1000-100,000

processors)• Other notions of efficiency

Page 156: Introduction to Parallel Computing · WS17/18 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson TräffWS17/18

Lecture summary, checklist

• Models of parallel computation: Architecture, programming, cost

• Flynn’s taxonomy: MIMD, SIMD; SPMD

• Sequential baseline

• Speedup (in theory and practice)

• Work, Cost optimality

• Amdahl’s law

• Scaled speed-up, strong and weak scaling