introduction to parallel computing · ws17/18 ©jesper larsson träff linguistic: how are the...
TRANSCRIPT
©Jesper Larsson TräffWS17/18
Introduction to Parallel ComputingIntroduction, problems, models, performance
Jesper Larsson Trä[email protected]
TU WienParallel Computing
©Jesper Larsson TräffWS17/18
Mobile phones: „dual core“, „quad core“ (2012?), …
June 2012: IBM BlueGene/Q, 1572864 cores, 16PF
“Never mind that there's little-to-no software that can take advantage of four processing cores, Xtreme Notebooks has released the first quad-core laptop in the U.S (2007)”.
June 2016: 10,649,600 cores, 93 PFLOPS (125 PFLOPS peak)
Parallelism everywhere! How to use all these resources? Limits?
…octa-core (2016, Samsung Galaxy 7)
Practical Parallel Computing
©Jesper Larsson TräffWS17/18
As per ca. 2010:No sequential computer systems anymore (almost): multi-core, GPU/accelerator enhanced, …
Why?
June 2011: Fujitsu K, 705024 cores, 11PF
Practical Parallel Computing
©Jesper Larsson TräffWS17/18
June 2011: Fujitsu K, 705024 cores, 11PF
As per 2010:a) All applications must be parallelized; or b) enough independentapplications that can run at the same time
Challenge: How do I speed up windows…?
Practical Parallel Computing
©Jesper Larsson TräffWS17/18
“I spoke to five experts in the course of preparing this article, and they all emphasized the need for the developers who actually program the apps and games to code with multithreaded execution in mind”
7 myths about quad-core phones (SmartphonesUnlocked)by Jessica DolcourtApril 8, 2012 12:00 PM PDT
The Free Lunch Is Over A Fundamental Turn Toward Concurrency in SoftwareBy Herb SutterThe biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency.This article appeared in Dr. Dobb's Journal,30(3), March 2005
…many similar quotations can (still) be found
Practical Parallel Computing
©Jesper Larsson TräffWS17/18
The Free Lunch Is Over A Fundamental Turn Toward Concurrency in SoftwareBy Herb SutterThe biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency.This article appeared in Dr. Dobb's Journal,30(3), March 2005
“Free lunch” (as in “There is no such thing as a free lunch”):
Exponential increase in single-core performance (18-24 month doubling rate, Moore’s “Law”), no software changes needed to exploit the faster processors
©Jesper Larsson TräffWS17/18
“The free lunch” was over…” ca. 2005
• Clock speed limitaround 2003
• Power consumptionlimit from 2000
• Instructions/clocklimit late 90ties
But: Numbers oftransistors/chip cancontinue to increase(>1Billion)
Solution(?): Put more cores on chip „multi-core revolution“
Kunle Olukotun (Stanford), ca. 2010
©Jesper Larsson TräffWS17/18
Parallelism challenge:Solve problem p timesfaster on p (slower, moreenergy efficient) cores
Kunle Olukotun (Stanford), ca. 2010
“The free lunch” was over…” ca. 2005
• Clock speed limitaround 2003
• Power consumptionlimit from 2000
• Instructions/clocklimit late 90ties
©Jesper Larsson TräffWS17/18
Kunle Olukotun, 2010, Karl Rupp (TU Wien), 2015
©Jesper Larsson TräffWS17/18
Intel “Skylake”: 4-28 cores
AMD “Naples”: 8-32 cores
New(er) server processors (2017)
©Jesper Larsson TräffWS17/18
Single-core performance does still increase, somewhat, but much slower… Henk Poley, 2014, www.preshing.com
Average, normalized, SPEC benchmark numbers, see www.spec.org
©Jesper Larsson TräffWS17/18
2006:Factor 103-104
increase in integer performance over 1978 high-performance processor
Single-core processor development (“free lunch”) made life verydifficult for parallel computing in the 90ties
From Hennessy/Patterson, Computer Architecture
©Jesper Larsson TräffWS17/18
Architectural ideas driving sequential performance increase
• Deep pipelining• Superscalar execution (>1 instruction/clock) through multiple-
functional units…• Data parallelism through SIMD units (old HPC idea: vector
processing), same instruction on multiple data/clock• Out-of-order execution, speculative execution• Branch prediction• Caches (pre-fetching)• Simplified/better instruction sets (better for compiler)
• SMT/Hyperthreading
Increase in clock-frequency (“technological advances”) alone (factor 20-200?) alone does not explain performance increase:
Mostly fully transparent, at most the compiler needs to care: the “free lunch”
©Jesper Larsson TräffWS17/18
• Very diverse parallel architectures• No single, commonly agreed upon
abstract model (for designing and analyzing algorithms)
Many different programming paradigms (models), different programming frameworks (interfaces, languages)
• Has been so• Is still so (largely)
©Jesper Larsson TräffWS17/18
Theoretical Parallel Computing
Parallelism was always there! What can be done? Limits?
Assume p processors instead of just one, reasonably connected (memory, network, …)
• How much faster can some given problem be solved? Can some problems be solved better?
• How? New algorithms? New techniques?• Can all problems be solved faster? Are there problems that
cannot be solved faster with more processors?• Which assumptions are reasonable?• Does parallelism give new insights into nature of computation?
©Jesper Larsson TräffWS17/18
Algorithm in model
Concrete program (C, C++, Java, Haskell, Fortran,…)
Sequential computing
Concrete architecture
is difficult. Analysis in model (often) has somerelation to concreteexecution
“Gap”: Standard model (RAM) may abstract from many, concrete features of “real” architecture
©Jesper Larsson TräffWS17/18
Algorithm in model
Concrete program (C, C++, Java, Haskell, Fortran,…)
Sequential computing
Concrete architecture
is difficult. Analysis in model (often) has somerelation to concreteexecution
vs. Parallel computing
Algorithmin model A
Algorithmin model B
Algorithmin model Z
Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)
Conc. Arch. A Conc. Arch. Z…
©Jesper Larsson TräffWS17/18
is extremely difficult. Analysis in model mayhave little relation toconcrete execution
Parallel computing
Algorithmin model A
Algorithmin model B
Algorithmin model Z
Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)
Conc. Arch. A Conc. Arch. Z
Huge gap:No „standard“ abstractmodel (e.g., RAM),
…
©Jesper Larsson TräffWS17/18
Extremely difficult: Analysis in model mayhave little relation toconcrete execution
Algorithmin model A
Algorithmin model B
Algorithmin model Z
Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)
Conc. Arch. A Conc. Arch. Z
Challenges:
• Algorithmic: Not all problems seems to beeasily parallelizable
• Portability: Support for same language/interface on different architectures (e.g. MPI in HPC)
• Performance portability (?)
Parallel computing
…
©Jesper Larsson TräffWS17/18
Algorithmin model A
Algorithmin model B
Algorithmin model Z
Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)
Conc. Arch. A Conc. Arch. Z
Elements of Parallel computing:
• Algorithmic: Find theparallelism
• Linguistic: Express theparallelism
• Practical: Validate theparallelism(correctness, performance)
• Technical challenge: run-time/compiler-support support forparadigm
Parallel computing
…
©Jesper Larsson TräffWS17/18
Parallel computing: Accomplish something with a coordinated setof processors under control of a parallel algorithm
• It is inevitable: multi-core revolution, GPGPU paradigm, …
• It‘s interesting, challenging, highly non-trivial, full ofsurprises
• Key discipline of computer science (von Neumann, golden theory decade: 1980-early 90ies)
• It‘s ubiquituous (gates, architecture: Pipelines, ILP, TLP, systems: operating systems, software), not alwaystransparent
• It‘s useful: Large, extremely computationally intensive problems, Scientific Computing, HPC
• …
Why study parallel computing?
©Jesper Larsson TräffWS17/18
Parallel computing:The discipline of efficiently utilizing dedicated parallel resources (processors, memories, …) to solve single, given computational problem.
Specifically:Parallel resources with significant inter-communication capabilities, for problems with non-trivial communication and computational demands
Buzz words: tightly coupled, dedicated parallel system; multi-core processor, GPGPU, High-Performance Computing (HPC), …
Focus on properties of solution (time, size, energy, …) to given, individual problem
©Jesper Larsson TräffWS17/18
Distributed computing:The discipline of making independent, non-dedicated resourcesavailable and coorperative toward solving specified problemcomplexes.
Buzz words: internet, grid, cloud, agents, autonomous computing, mobile computing, …
Typical concerns: correctness, availability, progress, security, integrity, privacy, robustness, fault tolerance, …
©Jesper Larsson TräffWS17/18
Concurrent computing:The discipline of managing and reasoning about interactingprocesses that may (or may not) progress simultaneously
Buzz words: operating systems, synchronization, interprocesscommunication, locks, semaphores, autonomous computing, processcalculi, CSP, CCS, pi-calculus, lock/wait-freeness, …
Typical concerns: correctness (often formal), e.g. deadlock-freedom, starvation-freedom, mutual exclusion, fairness
©Jesper Larsson TräffWS17/18
Parallel vs. Concurrent computing
Process
ResourceProc
Process ProcessProcess …
Concurrent computing: Focus on coordination of access to/usage of shared resources (to solve given, computational problem)
Proc
Memory (locks, semaphores, data structures), device, …
Given problem: Specification, algorithm, data
Adopted from Madan Musuvathi
©Jesper Larsson TräffWS17/18
Subproblem
Proc
Subproblem SubproblemSubproblem
ProblemSpecification, algorithm, data
Parallel computing: Focus on dividing given problem (specification, algorithm, data) into subproblems that can be solved by dedicated processors (in coordination)
Proc Proc Proc
Coordination:synchronization, communication
©Jesper Larsson TräffWS17/18
The “problem” of parallelization
Subproblem Subproblem SubproblemSubproblem
Problem
How to divide given problem into subproblems that can be solved in parallel?
• Specification• Algorithm?• Data?
• How is the computation divided? Coordination necessary? Does the sequential algorithm help/suffice?
• Where are the data? Is communication necessary?
©Jesper Larsson TräffWS17/18
Aspects of parallelization
• Algorithmic: How to divide computation into independent parts that can be executed in parallel? What kinds of shared resources are necessary? Which kinds of coordination? How can overheads be minimized (redundancy, coordination, synchronization)?
• Scheduling/Mapping: How are independent parts of the computation assigned to processors?
• Load balancing: How can independent parts be assigned to processors such that all resources are utilized efficiently?
• Communication: When must processors communicate? How?• Synchronization: When must processors agree/wait?
©Jesper Larsson TräffWS17/18
Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming language, interface, library)
Pragmatic/practical: How does the actual, parallel machine look? What is a reasonable, abstract model?
Architectural: Which kinds of parallel machines can be realized? How do they look?
©Jesper Larsson TräffWS17/18
Levels of parallelization
Architecture: gates, logical and functional units
Computational: Instruction Level Parallelism (ILP)
SIMD: Explicit vector processing
Functional, explicit: threads (possibly concurrent, sharing architecture resources), cores, processors
“Parallel computing”
Not here
Large-scale: coarse grained tasks, multi-level, coupled application parallelism
Not here
©Jesper Larsson TräffWS17/18
Can’t we just leave it all to the compiler/hardware?
Automatic parallelization(?)
• “high-level” problem specification• Sequential program
Efficient code that can be executed on parallel, multi-core processor
• Successsful only to a limited extent:• Compilers cannot invent a different algorithm• Hardware parallelism not likely to go much further
Samuel P. Midkiff: Automatic Parallelization: An Overview of Fundamental Compiler Techniques. Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers 2012
©Jesper Larsson TräffWS17/18
Explicitly, parallel programming (today):
• Explicitly parallel code in some parallel language• Support from parallel libraries• (Domain specific languages)
• Compiler does as much as compiler can do
Lot’s of interesting problems and tradeoffs, active area of research
©Jesper Larsson TräffWS17/18
Some „given, individual problems“ for this lecture
Problem 1:
Matrix-vector multiplication: Given nxm matrix A, m elementvector v, compute n element vector u = Av
x =A[i,j] u[i] = ∑A[i,j]v[j]
Dimensions n,mlarge: obviouslysome parallelism
How to ∑ in parallel?Access to vector?
……
©Jesper Larsson TräffWS17/18
Problem 1a:
Matrix-matrix multiplication: Given nxk matrix A, kxm matrix B, compute nxm matrix product C = AB
x =
C[i,j] = ∑A[i,k]B[k,j]
©Jesper Larsson TräffWS17/18
Problem 1b:
Solving sets of linear equations. Given matrix A and vector b, find x such that Ax = b
Preprocess A such that solution to Ax = b can easily be found for any B (LU factorization, …)
©Jesper Larsson TräffWS17/18
Problem 1c:
Sparse matrix-matrix multiplication: Given nxk matrix A, kxmmatrix B, compute nxm matrix product C = AB
x =
C[i,j] = ∑A[i,k]B[k,j]
©Jesper Larsson TräffWS17/18
Problem 2:
Stencil computation, given nxm matrix A, update as follows, anditerate until some convergence criteria is fulfilled
A[i,j]
iterate {for all (i,j) {A[i,j] <-
(A[i-1,j]+A[i+1,j]+A[i,j-1]+A[i,j+1])/4}
} until (convergence)
with suitable handlingof matrix border
Looks well-behaved, „embarassingly parallel“? Data distribution? Conflicts on updates?
Use: Discretization and solution of certain parallel differential equations (PDEs); image processing; …
5-point stencil in 2d
©Jesper Larsson TräffWS17/18
Problem 3:
Merge two sorted arrays of size n and m into a sorted array ofsize n+m
Easy to do sequentially, but sequentialalgorithm looks… sequential
Problem 3a:Sort by merging
©Jesper Larsson TräffWS17/18
Problem 4:
Computation of all prefix-sums: Given an array A of size n ofelements of some type S with an associative operations +, compute for all indices 0≤i<n the prefix-sums in array B
B[i] = ∑0≤j<iA[j] Implies solution to problem ofcomputing just ∑. Parallel?
A: 1 2 3 4 5 6 7 …
B: x 1 3 6 10 15 21 28 …All prefix-sums
©Jesper Larsson TräffWS17/18
Problem 5:
Sorting a sequence of objects („reals“, integers, objects withorder relation) stored in array
Hopefully parallel merge solution can be ofhelp. Other approaches? Quicksort? Integers (counting, bucket, radix)
©Jesper Larsson TräffWS17/18
Parallel computing as a (theoretical) CS discipline
• Than what?
• What is a reasonable „model“ for parallel computing? • How fast can a given problem be solved? How many resources
can be productively exploited?
• Are there problems that cannot be solved in parallel? Fast? At all?
• …
(Traditional) objective: Solve a given computational problem faster
©Jesper Larsson TräffWS17/18
Architecture, machine, computing, programming models
Architecture/machine model:Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction.
Computational/execution model:(formal) framework for the design and analysis of algorithms for the computational system; cost model.
Concrete program (C, C++, Java, Haskell, Fortran,…)
Concrete architecture
©Jesper Larsson TräffWS17/18
M
P
Main example: The RAM (Random-Access Machine)
Processor (ALU, PC, registers) capable ofexecuting instructions stored in memory on data in memory
Execution of instruction, access to memory:
First assumption: Unit cost for all operations
©Jesper Larsson TräffWS17/18
M
P
Main example: The RAM (Random-Access Machine)
• Memory: Load, store• Arithmetic (integer, real
numbers)• Logic: and, or, xor, shift• Branch, compare, procedure call
Traditional RAM: Same unit cost for all operations
Instructions
Realistic?
Useful?
Another useful model of computation: The Turing machine
©Jesper Larsson TräffWS17/18
M
P
Main example: The RAM (Random-Access Machine)
Aka von Neumann architecture or stored program computer, data and program in same memory („programs as data“)
John von Neumann (1903-57), Report on EDVAC, 1945; also Eckert&Mauchly, ENIAC
„von Neumann bottleneck“: Program anddata separate from CPU, processing rate bounded by memory rate.
John W. Backus: Can Programming Be Liberated From the von Neumann Style? A Functional Style and its Algebra of Programs. Commun. ACM 21(8): 613-641 (1978), Turing Award Lecture, 1977
©Jesper Larsson TräffWS17/18
M
P
Main example: The RAM (Random-Access Machine)
Processor (ALU, PC, registers) capable ofexecuting instructions stored in memory on data in memory
Non-uniform memory cost (hierarchical memors: cache’s)
Realistic? Perhaps more so
Cache
©Jesper Larsson TräffWS17/18
M
P
Parallel RAM variations
M M M…
Increased memory bandwidth: Vector computer, ALU operateson vectors instead of scalars
©Jesper Larsson TräffWS17/18
P
Parallel RAM variations
M
PP P
Shared-memory model, bus based
„von Neumann bottleneck“
©Jesper Larsson TräffWS17/18
P
Parallel RAM variations
M
PP P
Shared-memory model, multi-port, network based, …
• How are the processors synchronized?• What can the memory do?• What are the costs?
©Jesper Larsson TräffWS17/18
P
Parallel RAM variations
M
PP P
All processors operate under control of the same clock: Synchronous, shared-memory model
Parallel RAM (PRAM):• All processors work in lock-step (all same program, or
individual programs)• Unit-time instruction and memory access (uniform)
©Jesper Larsson TräffWS17/18
P
Parallel RAM variations
M
PP P
PRAM conflict resolution: What happens if several processors in the same time step access the same memory location?
• EREW PRAM: Not allowed, neither read nor write• CREW PRAM: Concurrent reads allowed, concurrent writes
not• CRCW PRAM: Both concurrent read and write
©Jesper Larsson TräffWS17/18
P
Parallel RAM variations
M
PP P
CRCW PRAM conflict resolution:
• COMMON: All processors that conclict must write the same value
• ARBITRARY: One write succeeds• PRIORITY: A priority scheme determines which
Realistic?
Useful?
©Jesper Larsson TräffWS17/18
P
Parallel RAM variations
M
PP P
Realistic?
Useful?
Joseph JáJá: An Introduction to Parallel Algorithms. Addison-Wesley 1992, ISBN 0-201-54856-9
Not realistic (could possibly be emulated), but extremely useful theoretical tool (late 70ties-mid 90ties): Techniques and lower bounds
Not this lecture
©Jesper Larsson TräffWS17/18
par (0<=i&&i<n) b[i] = true; // a[i] could bepar (0<=i&&i<n, 0<=j&&j<n)if (a[i]<a[j]) b[i] = false; // a[i] is not
par (0<=i&&i<n) if (b[i]) x = a[i];
PRAM Example: Fast maximum finding
Problem: Given n numbers in shared memory array a, find the maximum
Idea: Perform all n2 comparisons (a[i],a[j]) in parallel, eliminate those numbers that cannot be maximum
1.2.
3.
©Jesper Larsson TräffWS17/18
par (0<=i&&i<n) b[i] = true; // a[i] could bepar (0<=i&&i<n, 0<=j&&j<n)if (a[i]<a[j]) b[i] = false; // a[i] is not
par (0<=i&&i<n) if (b[i]) x = a[i];
1.2.
3.
Three parallel steps, O(1) time with n, n2 and n PRAM processors, respectivelyCRCW capability needed in Steps 2 and 3
Theorem:On a Common CRCW PRAM, the maximum of n numbers can be found in O(1) time steps and O(n2) operations in total
©Jesper Larsson TräffWS17/18
Theorem:On a Common CRCW PRAM, the maximum of n numbers can be found in O(1) time steps and O(n2) operations in total
Note: Constant time algorithm with polynomial resources (operations, processors
Is this a good algorithm? Answer later
©Jesper Larsson TräffWS17/18
P
M
PP P
M M M…
Shared-memory model, banked, network based, not synchronous
Parallel RAM variations
• How to synchronize: Atomic operations, barriers• Memory consistency: When can some processor “see” what
some other processor has written into memory?
©Jesper Larsson TräffWS17/18
UMA (Uniform Memory Access):Access time to memory location is independent of location andaccessing processor, e.g., O(1), O(log M), …
NUMA (Non-Uniform Memory Access):Access time dependent on processor and location.
P
M
PP P
P
M
PP P
M M M…
Parallel memory access cost
Locality: For each processor, somelocations can be accessed faster thanother locations
Example: PRAM is UMA
©Jesper Larsson TräffWS17/18
P
M
PP P
M M M…
Parallel, distributed memory RAM
Interconnection Network
Memory is distributed over processors, each processor can directly access only its own, local memory
©Jesper Larsson TräffWS17/18
Architecture model defines resources; describes• Composition of processor, functional units: ALU, FPU,
registers, w-bit words vs. unlimited, Vector Unit (MMX, SSE, AVX)
• Types of instructions• Memory system, caches• …
Execution model/cost model specifies• How instructions are executed• (relative) Cost of instructions, memory accesses• …
Level of detail/formality dependent on purpose:What is to be studied (complexity theory, algorithms design, …), what counts (instructions, memory accesses, …)
©Jesper Larsson TräffWS17/18
Parallel architecture model defines• Synchronization between processors• Synchronization operations• Atomic operations, shared resources (memory, registers)• Communication mechanisms: network topology, properties• Memory: shared, distributed, hierarchical, …• …
Cost model defines• Cost of synchronization, atomic operations• Cost of communication (latency, bandwidth, …)• …
©Jesper Larsson TräffWS17/18
Yet another parallel architecture model:
John on Neumann, Arthur W. Burks: Theory of self-reproducingautomata, 1966H. T. Kung: Why systolic architectures? IEEE Computer 15(1): 37-46, 1982
State of cell (i,j) in next step determined by• Own state• State of neighboors in some
neighborhood, e.g., (i,j-1), (i+1,j), (i,j+1), (i-1,j)s
Cellular automaton, systolic array, … : Simple processors withoutmemory (finite state automata, FSA), operate in lock step on (potentially infinite) grid, local communication only
©Jesper Larsson TräffWS17/18
Flynn‘s taxonomy:Orthognal classification of (parallel) architectures/models
SISDSingle Instruction Single Data
MISDMultiple Instruction Single Data
SIMDSingle Instruction Multiple Data
MIMDMultiple Instruction MultipleData
M. J. Flynn: Some computer organizations and theireffectiveness. IEEE Trans. Comp. C-21(9):948-960, 1972
Intruction stream
Dat
a st
ream
©Jesper Larsson TräffWS17/18
SISD: Single processor, single stream of instructions, operateson single stream of data. Sequential architecture (e.g. RAM)
SIMD: Single processor, single stream of operations, operateson multiple data per instruction. Example: traditional vectorcomputer, PRAM (some variants)
MISD: Multiple instructions operate on single data stream.Example: Pipelined architectures, streaming architectures(?), systolic arrays (70ties architetural idea).
MIMD: Multiple instruction streams, multiple data streams
Some say: Empty
©Jesper Larsson TräffWS17/18
M
P
SISD
M
P
M MM
SIMD
Typical Flynn classification instances
M
P P P P
M M M… M
P P P P
M M M
Communication network
MIMD
M P P P P
…
… MISD (?)
©Jesper Larsson TräffWS17/18
Programming model:Abstraction, close to programming language, defining parallel resources, management of parallel resources, parallelizationparadigms, memory structure, memory model, synchronizationand communication features, and their semantics
Execution model: When and how parallelism in programmingmodel is effected
Concrete program (C, C++, Java, Haskell, Fortran,…)
Concrete architecture
©Jesper Larsson TräffWS17/18
Programming model:Abstraction, close to programming language, defining parallel resources, management of parallel resources, parallelizationparadigms, memory structure, memory model, synchronizationand communication features, and their semantics
Parallel programming language, or library („interface“) is theconcrete implementation of one (or more: multi-modal, hybrid) parallel programming model(s)
Cost of operations: May be specified with programming model; often by architecture/computational model
Execution model: When and how parallelism in programmingmodel is effected
©Jesper Larsson TräffWS17/18
• Parallel resources, entities, units: Processes, threads, tasks, …• Expression of parallelism: Explicit or implicit• Level and granularity of parallelism
• Memory model: Shared, distributed, hybrid• Memory semantics („when operations take effect/become
visible“)• Data structures, data distributions
• Methods of synchronization (implicit/explicit)• Methods and modes of communication
Parallel programming model defines, e.g.,
©Jesper Larsson TräffWS17/18
(*)SPMD: Single Program, Multiple Data(**)PGAS: Partitioned Global Address Space
Examples:
1. Threads, shared memory, block distributed arrays, fork-joinparallelism
2. Distributed memory, explicit message passing, collectivecommunication, one-sided communication („RDMA“), PGAS(**)
3. Data parallel SIMD, SPMD(*)4. …
Concrete libraries/languages: pthreads, OpenMP, MPI, UPC, TBB, …
Not this lecture
©Jesper Larsson TräffWS17/18
Same Program Multiple Data (SPMD)
F. Darema at al.: A single-program-multiple-data computational model for EPEX/FORTRAN, 1988
Restricted MIMD model: All processors execute same program
• May do so asynchronously, different processors may be in different parts of program at any given time
• Same objects (procedures, variables) exist for all processors, “remote procedure call”, “active messages, “remote-memory access” make sense
Programming model concepts: Active messages, remote procedure call, … (MPI: RMA/one-sided communication)
Not this lecture
©Jesper Larsson TräffWS17/18
Programming language/library/interface/paradigm
Programming model
Architecture model
„Real“ Hardware
Different architecture models canrealize given programming model
Closer fit: More efficient use ofarchitecture
Challenge: Programming model that isuseful and close to „realistic“ architecture models to enable realisticanalysis/prediction
OpenMP MPI
Challenge: Language that convenientlyrealizes programming model
Algorithmssupport, „run-time“
Cilk
©Jesper Larsson TräffWS17/18
Examples:
OpenMP programming interface/language for shared-memory model, intended for shared memory architectures; mostly „Data parallel“
Can be implemented with DSM (Distributed Shared Memory) on distributed memory architectures; but performance usually not good. Requires DSM implementation/algorithms
MPI interface/library for distributed memory model, can be used on shared-memory architectures, too. Needs algorithmic support (e.g., „collective operations“)
Cilk language (extended C) for shared-memory model, for shared-memory architectures; „Task parallel“. Need run-time (e.g., „work-stealing“)
See later
©Jesper Larsson TräffWS17/18
Performance: Basic observations and goals
Model:p dedicated parallel processors collaborate to solve given problem of input size n.
Focus worst-case complexities and execution times
• Processors work independently (local memory, program)• Communication and coordination incurs some overhead
Let some computational problem P(n) given• Seq: Sequential algorithm and implementation solving P• Par: Parallel algorithm and implementation solving P
©Jesper Larsson TräffWS17/18
Challenge and one main goal of parallel processing
… not to make life easy
Be faster than commonly used (good, best known, best possible…) sequential algorithm and implementation, utilizing the p processors efficiently
©Jesper Larsson TräffWS17/18
Let• Tseq(n): Time for 1 processor to solve P(n)• Tpar(p,n): Time for p processors to solve P(n)
The speedup of Tpar over Tseq is defined as
Sp(n) = Tseq(n)/Tpar(p,n)
Speedup measures the gain in moving from sequential to parallel computation (Note: Parameters p and n)
Main goal: Speeding up computations by parallel processing
Goal: Achieve as large speed-up as possible (for some n, all n) for as many processors as possible
©Jesper Larsson TräffWS17/18
What is „time“ (number of instructions)?
• Time for some algorithm for solving problem?• Time for a specific algorithm for solving problem?
• Time for best known algorithm for problem?• Time for best possible algorithm for problem?
• Time for specific input of size n, average case, worst case, …?• Asymptotic time, large n, large p?
• Do constants matter, e.g. O(f(p,n)) or 25n/p+3ln (4 (p/n))… ?
What exactly is Tseq(n), Tpar(p,n)?
©Jesper Larsson TräffWS17/18
Tseq(n):
• Theory: Number of instructions (or other critical cost measure) to be executed in the worst case for inputs of size n
• The number of instructions carried out is often termed WORK
• Practice: Measured time (or other parameter) of execution over some inputs (experiment design)
Choose sequential algorithm (theory), choose an implementation of this algorithm (practice)
Theory and practice:Always state baseline sequential algorithm/implementation
©Jesper Larsson TräffWS17/18
Examples (theory):
Tseq(n) = O(n):
Tseq(n,m) = O(n+m):
Tseq(n) = O(n log n):Tseq(n,m) = O(n log n + m):Tseq(n) = O(n3):
…
Finding maximum of n numbers in unsorted array; prefix-sumsMerging of two sequences; BFS/DFS in graphComparison-based sortingSingle-source Shortest Path (SSSP)Matrix multiplication, input two nxnmatrices…
Standard, worst-case, asymptotic complexities
Cormen, Leiserson, Rivest, Stein: Introduction to Algorithms. 3rd
ed., MIT Press, 2009
Can be solved in o(n3), Strassen etc.
©Jesper Larsson TräffWS17/18
New issue with modern processors:Is time always the same thing? Clock frequency may not be constant, e.g., can depend on load of system, energy cap, “turbomode”, etc.. Such factors difficult to control
Practice:• Construct good input examples to measure running time
Tseq(n); experimental methodology• Worst-case not always possible, not always interesting; best
case?• Experimental methods to get stable, accurate Tseq(n):
Repeat measurements many times (thumb rule: average over at least 30 repetitions)
Experimental science:Always some assumptions about repeatability, regularity, determinism, …
©Jesper Larsson TräffWS17/18
Parallel performance in theory
Definition:Let Tseq(n) be the (worst-case) time of best possible/best known sequential algorithm, andTpar(n,p) the (worst-case) time of parallel algorithm. The absolute speed-up of Tpar(n,p) with p processors over Tseq(n) is
Sp(n) = Tseq(n)/Tpar(p,n)
Observation (proof follows):Best-possible, absolute speed-up is linear in p
Goal: Obtain (linear) speedup for as large p as possible (as function of problem size n), for as many n as possible
©Jesper Larsson TräffWS17/18
For speedup (and other complexity measures), distinguish:
• Problem P to be solved (mathematical specification)
• Some algorithm A to solve P• Best possible (lower bound) algorithm A* for P, best known
algorithm A+ for G: The complexity of P
• Implementation of A on some machine M
©Jesper Larsson TräffWS17/18
Subproblem:Sum of n/p elements
Subproblem: sum of n/p elements
Subproblem:sum of n/p elements
Subproblem: sum of n/p elements
Problem: sum of two n-element vectors
Example: “data parallel” (SIMD) computation
for (i=0; i<n; i++) {a[i] = b[i]+c[i];
}
Algorithm/implementation:
Tpar(p,n) = Tseq(n)/p
Sp(n) = p
Best possible parallelization: sequential work divided evenly across p processors
Complexity: Tseq(n) = Θ(n)
©Jesper Larsson TräffWS17/18
Subproblem:Sum of n/p elements
Subproblem: sum of n/p elements
Subproblem:sum of n/p elements
Subproblem: sum of n/p elements
Problem: sum of two n-element vectors
Example: “data parallel” (SIMD) computation
for (i=0; i<n; i++) {a[i] = b[i]+c[i];
}
• Perfect speedup• “Embarassingly parallel”• “Pleasantly parallel”
Tpar(p,n) = c(n/p) for constant c≥1
Algorithm/implementation:
©Jesper Larsson TräffWS17/18
Work
TimeTseq(n)
start
The work, measured in instructions and/or time, that has to be carried out for problem of size n
stop
©Jesper Larsson TräffWS17/18
p processors
Perfect parallelization: Sequential work evenly divided between p processors, no overhead, so Tpar(p,n) = Tseq(n)/p
Perfect speedup
Sp(n) = Tseq(n)/(Tseq(n)/p) = p
… and very rare in practice
TimeTseq(n)
©Jesper Larsson TräffWS17/18
Ti
p processors
Sequential work is unevenly dividedbetween p processors:Load imbalance, Tpar(p,n)>Tseq(n)/p, even though ∑Ti(n)=Tseq(n)
start
stopTpar is time for slowest processor to complete, all processors assumed to start at same time
Define Tpar(p,n) = max Ti(n) over all processors, starting at the same time
TimeTseq(n)
©Jesper Larsson TräffWS17/18
Ti
p processors
Wpar(n) = ∑Ti(n) is called the work of the parallel algorithm = total number of instructions performed by the p processors
The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: Total time in which the p processors are reserved (: have to be paid for)
Area C(n) = p*Tpar(p,n)
TimeTseq(n)
©Jesper Larsson TräffWS17/18
“Theorem:”Perfect speedup Sp(n) = p is best possible and cannot be exceeded
“Proof”:Assume Sp(n) > p for some n. Tseq(n)/Tpar(p,n) > p implies Tseq(n) > p*Tpar(p,n). A better sequential algorithm could be constructed by simulating the parallel algorithm on a single processor. The instructions of the p processors are carried out in some correct order, one after another on the sequential processor. This contradicts that Tseq(n) was best possible/known time.
Reminder:Speedup is calculated (measured) relative to “best” sequential implementation/algorithm
©Jesper Larsson TräffWS17/18
Ti
p processors
By assumption C(n) = p*Tpar(p,n) < Tseq(n)
Simulation A: one step of P1, one step of P2, …, one step of P(p-1), one step of P1, …, for C(n) iterationsSimulation B: steps of P1 until communication/synchronization, steps of P2 until communication/synchronization, …
TimeTseq(n)
©Jesper Larsson TräffWS17/18
Ti
1 processor
By assumption C(n) = p*Tpar(p,n) < Tseq(n)
Both simulations yield a new, sequential algorithm Tsim(n) with Tsim(n)<Tseq(n)
This contradicts that Tseq(n) was time of best possible/best known sequential algorithm
TimeTseq(n)
©Jesper Larsson TräffWS17/18
Lesson: Parallelism offers only „modest potential“, speed-up cannot be more than p on p processors
Lawrence Snyder: Type architecture, shared memory and the corollary of modest potential. Annual Review of Computer Science, 1986
Construction shows that the total parallel work must be at least as large as the sequential work Tseq, otherwise, better sequential algorithm can be constructed.
Crucial assumptions: Sequential simulation is possible (enough memory to hold problem and state of parallel processors), sequential memory behaves as parallel memory, …
This s NOT TRUE for real systems and real problems
©Jesper Larsson TräffWS17/18
Ti
1 processor
TimeTseq(n) Aside: Such simulations are actually
sometimes done, and can be very useful to understand (model) and debug parallel algorithms
Some such simulation tools:
• SimGrid (INRIA)• LogPOPSim (Hoefler et al.)• …
Possible bachelor thesis subject
©Jesper Larsson TräffWS17/18
Wpar(n) = ∑Ti(n) is called the parallel work of the parallel algorithm = total number of instructions performed by some number of processors
The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: Total time in which the p processors are occupied
Definition:Parallel algorithm is called cost-optimal if C(n) = O(Tseq(n)). A cost-optimal algorithm has linear (perhaps perfect) speedup
Definition:Parallel algorithm is called work-optimal if Wpar(n) = O(Tseq(n)). A work-optimal algorithm has potential for linear speedup (for some number of processors)
©Jesper Larsson TräffWS17/18
Proof (linear-speed up of cost-optimal algorithm):Given cost-optimal parallel algorithm with p*Tpar(p,n) = c*Tseq(n) = O(Tseq(n))
This implies Tpar(p,n) = c*Tseq(n)/p, so
Sp(n) = Tseq(n)/Tpar(p,n) = p/c
The constant factor c captures the overheads and load imbalance (see later) of the parallel algorithm relative to best sequential algorithm. The smaller c, the closer the speedup to perfect
©Jesper Larsson TräffWS17/18
Ti
p processors
Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)
TimeTseq(n)
©Jesper Larsson TräffWS17/18
Ti
p processors
Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)
Execute on smaller number of processors, such that ∑Ti(n) = p*Tpar(p,n) = O(Tseq(n))
TimeTseq(n)
©Jesper Larsson TräffWS17/18
Proof idea (work-optimal algorithm can have linear speed-up):
1. Work-optimal algorithm2. Schedule work-items Ti(n) on p processors, such that
p*Tpar(p,n) = O(Tseq(n))3. With this number of processors, algorithm is cost-optimal4. Cost-optimal algorithms have linear speed-up
Parallel algorithms’ design goal:Work-optimal parallel algorithm with as small Tpar(n) as possible (and therefore large parallelism: many processors can be utilized)
The scheduling in step 2 is possible in principle, but may not be trivial
©Jesper Larsson TräffWS17/18
par (0<=i&&i<n) b[i] = true; // a[i] could bepar (0<=i&&i<n, 0<=j&&j<n)if (a[i]<a[j]) b[i] = false; // a[i] is not
par (0<=i&&i<n) if (b[i]) x = a[i];
Example: CRCW PRAM Maximum Finding algorithm
O(n2) operations (work), but sequential maximum finding requires only O(n) operations
Speedup with perfect parallelization (we do not know now whether this is possible
Sp(n) = O(n)/O(n2/p) = O(p/n)
(Only small) linear speedup for fixed n; but not independent of n
Bad!
©Jesper Larsson TräffWS17/18
Example: Another non work-optimal algorithm
DumbSort with T(n) = O(n2) that can be perfectly parallelized, Tpar(p,n) = O(n2/p)
Well-known that Tseq(n) = O(n log n), with many algorithms andgood implementations
Sp(n) = n log n/n2/p = p(log n)/n
Tpar(p,n) < Tseq(n) n2/p < n log n n/p < log n p > n/log n
(Small) linear speedup for fixed n (but not independent of n)
Break-even: When is parallel algorithm faster than sequential?
Non work-optimal algorithm: Speed-up decreases with n
Bad!
©Jesper Larsson TräffWS17/18
Best known/best possible parallel algorithm often difficult to parallelize• no redundant work (that could have been done in parallel)• tight dependencies (that forces things to be done one after
another)
Lesson:Usually does not make sense to parallelize an inferior algorithm(although sometimes much easier)
Lesson from much hard work in theory and practice:Parallel solution of a given problem often requires a new algorithmic idea!
But: Many algorithms often have a lot of potential for easy parallelization (loops, independent functions, …), so why not?
Also: Non-work optimal algorithms can sometimes be useful, as subroutine
©Jesper Larsson TräffWS17/18
The ratio
SRelp(n) = Tpar(1,n)/Tpar(p,n)
is called relative speedup, and expresses how well the parallel algorithm utilizes p processors (scalability)
Relative speedup NOT to be confused with absolute speedup. Absolute speedup expresses how much can be gained over the best (known/possible) sequential implementation by parallelization.
Absolute vs. relative speedup
Absolute speed-up is what eventually matters
©Jesper Larsson TräffWS17/18
Note:Literature (papers and books) is not always clear about this distinction.
It is easier to achieve and document good relative speedup.
Reporting speed-up relative to an inferior, sequential implementation is incorrect!
Absolute vs. relative speedup
©Jesper Larsson TräffWS17/18
Definition:T∞(n): The smallest possible running time of parallel algorithm given arbitrarily many processors. Per definition T∞(n) ≤ Tpar(p,n) for all p. Relative speedup is limited by
SRelp(n) = Tpar(1,n)/Tpar(p,n) ≤ Tpar(1,n)/T∞(n)
Definition:The ratio Tpar(1,n)/T∞(n) is called the parallelism of the parallel algorithm.
The parallelism is the largest number of processors that can be employed and still give linear, relative speedup (assume Tpar(1,n)/T∞(n)<p in equation above)
©Jesper Larsson TräffWS17/18
par (0<=i&&i<n) b[i] = true; // a[i] could bepar (0<=i&&i<n, 0<=j&&j<n)if (a[i]<a[j]) b[i] = false; // a[i] is not
par (0<=i&&i<n) if (b[i]) x = a[i];
Example: CRCW PRAM Maximum Finding algorithm
O(n2) operations (work), but sequential maximum finding requires only O(n) operations
SRelp(n) = O(n2)/O(n2/p) = O(p)Parallelism: O(n2)/O(1) = n2
This (terrible) parallel algorithm has linear relative speedup for p up to n2 processors
©Jesper Larsson TräffWS17/18
par (0<=i&&i<n) b[i] = true; // a[i] could bepar (0<=i&&i<n, 0<=j&&j<n)if (a[i]<a[j]) b[i] = false; // a[i] is not
par (0<=i&&i<n) if (b[i]) x = a[i];
Example: CRCW PRAM Maximum Finding algorithm
This (terrible) algorithm has linear relative speedup for p up to n2 processors
Nevertheless: Useful as a building block
There exist a work-optimal CRCW PRAM algorithm that runs in O(log log n) steps requiring only O(n) operations
Advanced material: Last fact about PRAM in this lecture
©Jesper Larsson TräffWS17/18
Absolute vs. relative speedup
Work-optimality property:
For work-optimal algorithms, absolute and relative speedup coincides: Tpar(1,n) = O(Tseq(n))
©Jesper Larsson TräffWS17/18
Parallel performance in practice
Speedup is an empirical quantity, “measured time”, based on experiment (benchmark)
Tseq(n): Running time for “reasonable”, good, best available, sequential implementation, on “reasonable” inputs
Tpar(p,n): parallel running time measured for a number of experiments with different typical (worst-case?) inputs
Sp(n) = Tseq(n)/Tpar(p,n)
Speedup is typically not independent of problem size n, and problem instance
©Jesper Larsson TräffWS17/18
Time
Ti
p processors
Oi
Parallelization most often incurs overheads:• Algorithmic: Parallel algorithm may do
more work• Coordination: Communication and
synchronization• …
Tpar(p,n) Note T(1,n)≥Tseq(n)
Tseq(n)
©Jesper Larsson TräffWS17/18
Absolute vs. relative speedup and scalability
Good scalability and relative speedup Tpar(1,n)/Tpar(p,n) = Θ(p)
Example: 0.1p ≤ Tpar(1,n)/Tpar(p,n) ≤ 0.5p is reported, sounds good
Even for work-optimal Tpar(1,n) = 100Tseq(n) = O(Tseq(n)) it would take at least 200 processors to be as fast as the sequential algorithm with the reported
But what if Tpar(1,n) = 100Tseq(n)? Or Tseq(n) = O(n) but Tpar(p,n) = O((n log n)/p + log n)?
©Jesper Larsson TräffWS17/18
David H. Bailey:Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers. Supercomputing Review, Aug. 1991, pp. 54-55
Torsten Hoefler, Roberto Belli: Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. SC 2015: 73:1-73:12
Empirical, relative speedup without absolute performance baseline (and comparison to reasonable, sequential algorithm and implementation) is misleading
©Jesper Larsson TräffWS17/18
Practical problems: Measuring time and work
Ti
Oi
start
stop
Tpar defined as time for the last processor to complete assuming all procesors start at the same time
Problem 1:To measure Tpar, efficient, accurate synchronization is needed
“barrier synchronization”
©Jesper Larsson TräffWS17/18
Practical problems: Measuring time and work
Ti
Oi
start
stop
Tpar defined as time for the last processor to complete assuming all procesors start at the same time
Problem 2:Is this relevant? Are applications synchronized? Why not just measure the work?
©Jesper Larsson TräffWS17/18
Practical problems: Measuring time and work
Ti
Oi
Tpar defined as time for the last processor to complete assuming all procesors start at the same time
Problem 3: How does the time spent per processor and the work depend on the synchronization (a slow processor may cause other processes to wait)
Work = ∑Ti
©Jesper Larsson TräffWS17/18
Practical problems: Measuring time and work
Ti
Oi
Tpar defined as time for the last processor to complete assuming all procesors start at the same time
Problem 3: How does the time spent per processor and the work depend on the synchronization (a slow processor may cause other processes to wait)
Work = ∑Ti
dependency
©Jesper Larsson TräffWS17/18
Problem 1: Most parallel interfaces include synchronization mechanisms (“barrier”) that may or may not be accurate. There is still lots of work on temporal, accurate synchronization
Problem 2: Difficult to answer
Problem 3: Not much known, most algorithms and implementations assume and depend on some synchrony when evaluating performance
Sascha Hunold, Alexandra Carpen-Amarie: Reproducible MPI Benchmarking is Still Not as Easy as You Think. IEEE Trans. Parallel Distrib. Syst. 27(12): 3617-3630 (2016)
©Jesper Larsson TräffWS17/18
Time
Ti
p processors
Oi
Tpar(p,n) Note T(1,n)≥Tseq(n)
If algorithm is cost-optimal, p*Tpar(p,n) = k*Tseq(n), speedup becomes imperfect, but still linear, Sp(n) = p/k
Tseq(n)
©Jesper Larsson TräffWS17/18
Time
Ti
p processors
Oi
Tpar(p,n) Note T(1,n)≥Tseq(n)
Idle
NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution
Tseq(n)
©Jesper Larsson TräffWS17/18
Time
Ti
p processors
Oi
Tpar(p,n) Note T(1,n)≥Tseq(n)Idle
NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution
Tseq(n)
Oi
Idle
©Jesper Larsson TräffWS17/18
Time
Ti
p processors
Oi
Tpar(p,n) Note T(1,n)≥Tseq(n)Idle
Tseq(n)
Oi
Idle
Typical overhead by communication and coordination.
The (smallest) time between coordination periods is called the granularity of the parallel computation
©Jesper Larsson TräffWS17/18
• “Coarse grained” parallel computation/algorithm: Time/number of instructions between coordination intervals (synchronization operations, communication operations) is large (relative to total time or work)
• “Fine grained” parallel computation/algorithm: Time/number of instructions between… is small
Granularity
©Jesper Larsson TräffWS17/18
Time
Ti
p processors
Oi
Tpar(p,n) Note T(1,n)≥Tseq(n)Idle
Tseq(n)
Oi
Idle
Definition:Difference between max Ti(n) and min Ti(n) is the load imbalance. Achieving Ti(n) ≈ Tj(n) for all processors i,j called load balancing
©Jesper Larsson TräffWS17/18
Time
p processors
Oi
Tpar(p,n) close to Tseq(n)/p
Idle
Best parallelization has no load imbalance (and no overhead), so Tpar(p,n) = Tseq(n)/p
This is the best possible parallel time
Tseq(n)
Ti
©Jesper Larsson TräffWS17/18
TimeTseq(n) = (s+r)Tseq(n)
Ti
p processors
Maximum speedup becomes severely limited, e.g., if overheads are sequential and a constant fraction
Tpar(p,n) ≥ s*Tseq(n)+r*Tseq(n)/p
Sequential overhead: constant fraction
©Jesper Larsson TräffWS17/18
Amdahls Law (parallel version):Let a program A contain a fraction r that can be “perfectly” parallelized, and a fraction s=(1-r) that is “purely sequential”, i.e., cannot be parallelized at all. The maximum achievable speedup is 1/s, independently of n
Proof:
• Tseq(n) = (s+r)*Tseq(n)• Tpar(p,n) = s*Tseq(n) + r*Tseq(n)/p
Sp(n) =Tseq(n)/(s*Tseq(n)+r*Tseq(n)/p) = 1/(s+r/p) -> 1/s, for p -> ∞
G. Amdahl: Validity of the single processor approach to achievinglarge scale computing capabilities. AFIPS 1967
©Jesper Larsson TräffWS17/18
Typical victims of Amdahl‘s law:
• Sequential input/output could be a constant fraction• Sequential initialization of global data structures• Sequential processing of „hard-to-parallelize“ parts of
algorithm, e.g., shared data structures
Amdahl‘s law limits speed-up in such cases, if they are a constantfraction of total time, independent of problem size
©Jesper Larsson TräffWS17/18
Example:
1. Processor 0: read input, some precomputation2. Split problem into n/p parts, send part i to processor i
3. All processors i: solve part i4. All processors i: send partial solution back to processor 0
Typical Amdahl, sequential bottleneck: Constant sequential fraction (3 out of 4 steps) limits speedup)
10n9n
Amdahl: s=0.1, SU at most 10
When interested in parallel aspects, input-output and problem splitting is often explicitly not measured (see projects)
©Jesper Larsson TräffWS17/18
// Sequential initializationx = (int*)calloc(n*sizeof(int));…// Parallelizable partdo {
for (i=0; i<n; i++) {x[i] = f(i);
}// check for convergencedone = …;
} while (!done)
Example (contrived): K iterations beforeconvergence, (parallel) convergence check cheap, f(i) fast O(1)…
Sp(n) -> 1+K
Tseq(n) = n+K+Kn
Tpar(p,n) = n+K+Kn/p
Sequential fraction ≈ 1/(1+K)
©Jesper Larsson TräffWS17/18
// Sequential initializationx = (int*)malloc(n*sizeof(int));…// Parallelizable partdo {
for (i=0; i<n; i++) {x[i] = f(i);
}// check for convergencedone = …;
} while (!done)
Example (contrived): K iterations beforeconvergence, (parallel) convergence check cheap, f(i) fast O(1)…
Sp(n) -> p when n>p, n->∞
Tseq(n) = 1+K+Kn
Tpar(p,n) = 1+K+Kn/p
Sequential part ≈ 1/(1+n)
Note:
If sequential part is constant(but not fraction), Amdahl‘slaw does not limit SU
©Jesper Larsson TräffWS17/18
// Sequential initializationx = (int*)malloc(n*sizeof(int));…// Parallelizable partdo {
for (i=0; i<n; i++) {x[i] = f(i);
}// check for convergencedone = …;
} while (!done)
Example (contrived, but…): K iterations beforeconvergence, (parallel) convergence check cheap, f(i) fast O(1)…
Tseq(n) = 1+K+Kn
Tpar(p,n) = 1+K+Kn/p
Lesson:Be careful with systemfunctions (calloc, malloc)
Sequential part ≈ 1/(1+n)
Sp(n) -> p when n>p, n->∞
©Jesper Larsson TräffWS17/18
Avoiding Amdahl: Scaled speedup
Sequential, strictly non-parallelizable part is often not a constant fraction of the total execution time (number of instructions)
The sequential part may be constant, or grow only slowly with problem size n. Thus, to maintain good speedup, problem size should be increased with p
©Jesper Larsson TräffWS17/18
Assume Tseq(n) = t(n)+T(n) with sequential part t(n) and perfectly parallelizable part T(n).
Assume t(n)/T(n) -> 0 for n-> ∞
Tpar(p,n) = t(n)+T(n)/p
Sp(n) = (t(n)+T(n))/(t(n)+T(n)/p) = (t(n)/T(n)+1)/(t(n)/T(n)+1/p) -> 1/(1/p) = p
Now, speedup as a function of p and n is:
for n -> ∞
©Jesper Larsson TräffWS17/18
Lesson:Depending on how fast t(n)/T(n) converges, linear speed-up can be achieved by increasing problem size n
Definition:Speedup as function of p and n, with sequential and parallelizable times t(n) and T(n) is called scaled speed-up
©Jesper Larsson TräffWS17/18
Remarks:
• E(p,n) ≤ 1, since Sp(n) = Tseq(n)/Tpar(n,p) ≤ p• E(p,n) = constant: linear speedup
Definition:The efficiency of a parallel algorithm is ratio of best possible parallel time to actual parallel time for given p and n:
E(p,n) = (Tseq(n)/p)/Tpar(p,n) = Sp(n)/p = Tseq(n)/(p*Tpar(p,n))
cost
• Cost-optimal algorithms have constant efficiency
©Jesper Larsson TräffWS17/18
Scalability
Definition:A parallel algorithm/implementation is strongly scaling ifSp(n) = Θ(p) (linear, independent of (sufficiently large) n)
Definition:A parallel algorithm/implementation is weakly scaling if there isa slow-growing function f(p), such that for n = Ω(f(p)), E(p,n) remains constant. The function f is called the iso-efficiencyfunction
„Maintain efficiency by increasing problem size as f(p) or more“
©Jesper Larsson TräffWS17/18
John L. Gustafson: Reevaluating Amdahl's Law. Commun. ACM 31(5): 532-533 (1988)
Ananth Grama, Anshul Gupta, Vipin Kumar: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Transactions Par. Dist. Computing. 1(3): 12-21 (1993)
Scaled speed-up idea from many sources; some are
©Jesper Larsson TräffWS17/18
Summary: Stating parallel performance
It is convenient to state parallel performance and scalability of a parallel algorithm/implementation as
Tpar(p,n) = O(T(n)/p+t(p,n))
T(n) represents the parallel part, t(p,n) the non-parallel part of the algorithm beyond which no improvement is possible, regardless of how many processors are used (parallelism = T(n)/t(p,n) )
The work of the algorithm is W= O(p (T(n)/p+t(p,n)) ) = O(T(n)+pt(p,n))
The algorithm is work-optimal if T(n) is O(Tseq(n))
©Jesper Larsson TräffWS17/18
// Sequential initializationx = (int*)malloc(n*sizeof(int));…// Parallelizable partdo {
for (i=0; i<n; i++) {x[i] = f(i);
}// check for convergencedone = …;
} while (!done)
Assume convergence check takes O(log p) time
Tpar(p,n) = Kn/p+K log p
E(p,n) ≈ Kn/(Kn+pK log p)
For n≥plog p, E(p,n) ≥ 1/2
Example (again):
Weakly scalable, n has to increase as O(p log p) to maintainconstant efficiency, O(log p) per processor (if balanced)
©Jesper Larsson TräffWS17/18
Examples (Tpar, Speedup, Optimality, Efficiency):
1. Tpar0(p,n) = n/p+1
2. Tpar1(p,n) = n/p+log p
3. Tpar2(p,n) = n/p+log2 p
4. Tpar3(p,n) = n/p+√p
5. Tpar4(p,n) = n/p+p
Linear time computation, Tseq(n) = n (constants ignored)
Embarassingly “data parallel” computation, constant overheadlogarithmic overhead, e.g. convergence check
Linear overhead, e.g. data exchange
Typical, good, work-optimal parallelizations
©Jesper Larsson TräffWS17/18
n=128
©Jesper Larsson TräffWS17/18
1. Tpar0(p,n) = n/p+1:p*Tpar(p,n) = n+p = O(n) for p=O(n)
2. Tpar1(p,n) = n/p+log p:p*Tpar(p,n) = n+p log p = O(n) for p log p = O(n)
3. Tpar2(p,n) = n/p+log2 p:p*Tpar(p,n) = n+p log2 p = O(n) for p log2 p=O(n)
4. Tpar3(p,n) = n/p+√p:p*Tpar(p,n) = n+p√p = O(n) for p√p=O(n)
5. Tpar4(p,n) = n/p+p:p*Tpar(p,n) = n+p2 = O(n) for p2=O(n)
Cost-optimality:
©Jesper Larsson TräffWS17/18
n=128
©Jesper Larsson TräffWS17/18
n=128, but p up to 256
©Jesper Larsson TräffWS17/18
n=16384 (=16K=1282)
©Jesper Larsson TräffWS17/18
n=2097152 (=2M=1283)
©Jesper Larsson TräffWS17/18
n=128
©Jesper Larsson TräffWS17/18
n=16384 (=16K=1272)
©Jesper Larsson TräffWS17/18
n=2097152 (=2M=1273)
©Jesper Larsson TräffWS17/18
Efficiency, weak scaling
1. Tpar0(p,n) = n/p+1: f0(p) = e/(1-e)*p
2. Tpar1(p,n) = n/p+log p: f1(p) = e/(1-e)*(p log p)
3. Tpar2(p,n) = n/p+log2 p: f2(p) = e/(1-e)*p log2 p
4. Tpar3(p,n) = n/p+√p: f3(p) = e/(1-e)*(p√p)
5. Tpar4(p,n) = n/p+p: f4(p) = e/(1-e)*p2
To maintain constant efficiency e=Tseq(n)/(p*Tpar(p,n)), n has to increase as
©Jesper Larsson TräffWS17/18
Maintained efficiency = 0.9 (90%)
©Jesper Larsson TräffWS17/18
Matrix-vector multiplication parallelizations
Tseq(n) = n2
Tpar0(p,n) = n2/p + nTpar1(p,n) = n2/p + n + log p
©Jesper Larsson TräffWS17/18
n=100
©Jesper Larsson TräffWS17/18
n=1000
©Jesper Larsson TräffWS17/18
Some non work-optimal parallel algorithms
Amdahl case, linear sequential running time:
TparA(p,n) = 0.9n/p+0.1n
1. Tseq(n) = n log n Tpar(p,n) = n2/p+1
2. Tseq(n) = n Tpar(p,n) = (n log n)/p+1
©Jesper Larsson TräffWS17/18
n1 = 128n2 = 16384
©Jesper Larsson TräffWS17/18
Limitations of empirical speedup measure
Empirical speedup assumes that Tseq(n) can be measured.
For very large n and p, this may not be the case: A large HPC system has much more (distributed) main memory than any single-processor system
Scalability measured by other means:• Stepwise speedup (1-1000 processors, 1000-100,000
processors)• Other notions of efficiency
©Jesper Larsson TräffWS17/18
Lecture summary, checklist
• Models of parallel computation: Architecture, programming, cost
• Flynn’s taxonomy: MIMD, SIMD; SPMD
• Sequential baseline
• Speedup (in theory and practice)
• Work, Cost optimality
• Amdahl’s law
• Scaled speed-up, strong and weak scaling