text: structured parallel programming: patterns for efficient computation m. mccool, a. robison, j....

48
TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms Dr. Ranette Halverson 1

Upload: sara-lambert

Post on 04-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

1

TEXT: Structured Parallel Programming: Patterns for Efficient Computation

M. McCool, A. Robison, J. Reinders

Chapter 1

CMPS 5433 – Parallel AlgorithmsDr. Ranette Halverson

Page 2: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

2

•Parallel? Sequential?• In general, what does Sequential mean? Parallel?• Is Parallel always better than Sequential?• Examples??• Parallel resources?• How Efficient is Parallel? • How can you measure Efficiency?

Introduction

Page 3: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

3

Automatic Parallelization

vs. Explicit Parallel Programming

•Vector instructions•Multithreaded cores•Multicore processors•Multiple processors•Graphics engines•Parallel co-processors•Pipelining•Multi-tasking

Parallel Computing Features

Page 4: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

4

•All programmers must be able to program for parallel computers!!•Design & Implement• Efficient, Reliable, Maintainable• Scalable

•Will try to avoid hardware issues…

All Modern Computers are Parallel

Page 5: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

5

•Patterns: valuable algorithmic structures commonly seen in efficient parallel programs•AKA: algorithm skeletons

• Examples of Patterns in Sequential Programming?• Linear search• Recursion ~~ Divide-and-Conquer• Greedy Algorithm

Patterns for Efficient Computation

Page 6: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

6

•Capture Algorithmic Intent•Avoid mapping algorithms to particular hardware• Focus on • Performance• Low-overhead implementation• Achieving efficiency & scalability

•Think Parallel

Goals of This Book

Page 7: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

7

• Scalable: ability to efficiently apply an algorithm to ‘any’ number of processors•Considerations:• Bottleneck: situation in which productive work is delayed

due to limited resources relative to amount of work; congestion occurring when work arrives more quickly than it can be processed • Overhead: cost of implementation of a (parallel) algorithm

that is not part of the solution itself – often due to the distribution or collection of data & results

Scalable

Page 8: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

8

•Act of putting set of operations into a specific order• Serial Semantics & Serial Illusion• But were computers really serial?• Have we become overly dependent on sequential strategy?

•Pipelining, Multi-tasking

Serialization (1.1)

Page 9: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

9

• Improved Performance• The Past• The Future•Why is my new faster computer not faster??

•Deterministic algorithms • End result vs. Intermediate results• Timing issues – Relative vs. Absoute

Can we ignore parallel strategies?

Page 10: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

10

• Structured Patterns for parallelism• Eliminate non-determinism as much as possible

Approach of THIS Book

Page 11: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

11

Consider: A = B + 7 Read Z C = X * 2Does order matter?? A = B + 7 Read A B = X * 2

• Programmers think serial (sequential)• List of instructions,

executed in serial order• Standard programming

training• Serial Trap: Assumption of

serial ordering

Serial Traps

Page 12: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

12

What does parallel_for imply? Are both correct? Will the results differ? What about time?

for (i=0; i<num_web_sites;++i)search (search_phrase, website[i]

parallel_for (i=0; i<num_web_sites;++i)search (search_phrase, website[i]

Example:Search web for a particular search phrase

Page 13: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

13

parallel_do{ A = B + 7 Read A B = X * 2}

parallel_do{ A = B + 7 Read Z C = X * 2}

Can we apply “parallel” to previous example?? Results?

Page 14: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

14

Complexity of a Parallel Algorithm• Time •How can we compare?•Work•How do we split problems

across processors?

Complexity of a Sequential Algorithm• Time complexity•Big Oh•What do we mean ?•Why do we measure

performance?

Performance (1.2)

Page 15: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

15

•Doesn’t really mean time•Assumptions about measuring “time”• All instructions take same amount of time• Computers are same speed• Are these assumptions true? • So, what is Time Complexity, really?

Time Complexity

Page 16: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

16

Suppose want to calculate & print 1000 payroll checks.Each employee’s pay is independent of others.

parallel_for (i<num_employees;++i) Calc_Pay(empl[i]);

Any problems with this?

Example: Payroll Checks

Page 17: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

17

•What if all employee data is stored in an array? Can it be accessed in parallel?•What about printing* the checks?•What if computer has 2000 processors?• Is this application a good candidate for a parallel

program?•How much faster can we do the job?

Payroll Checks (cont’d)

Page 18: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

18

•How can you possibly sum a single list of integers in parallel?•What is the optimal number of processors to use?•Consider number of operations.

Example: Sum 1000 Integers

Page 19: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

19

Considerations of parallel processing•Communication •How is data distributed? Results compiled?

• Shared memory vs. Local memory• One computer vs. Network (e.g. SETI Project)

•Work

Actual Time vs. Number of Instructions

Page 20: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

20

• The time required to perform the longest chain of tasks that must be performed sequentially• SPAN limits speed up that can be accomplished via

parallelism• See Figures 1.1 & 1.2•Provides a basis for optimization

SPAN = Critical Path

Page 21: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

21

• Shared memory model – book assumption• Communication is simplified• Causes bottlenecks ~~ Why? Locality

• Locality•Memory (data) accesses tend to be close together• Accesses near each other (time & space) tend to be cheaper

•Communication – best when none•More processors is not always better

Shared Memory & Locality (of reference)

Page 22: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

22

•Proposed in 1967•Provides upper bound on speed-up attainable via

parallel processing• Sequential vs. Parallel parts of a program• Sequential portion limits parallelism

Amdahl’s Law (Argument)

Page 23: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

23

Example: Consider program with 1000 operations• Are ops independent?• OFTEN, parallelizing adds

to the total number of operations to be performed

3 components to consider when parallelizing• Total Work• Span•Communication

Work-Span Model

Page 24: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

24

•Communication strategy for “reducing” number information from n to 1 workers•O(log p) = p is number of workers/processors• Example: 16 processors hold 1 integer each & integers

need to be added together. •How can we effectively do this?

Reduction

Page 25: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

25

• The “opposite” of Reduction•Communication strategy for distributing data from 1

worker to n workers• Example: 1 processor holds 1 integer & integer needs to

be distributed to the 15 others•How can we effectively do this?

Broadcast

Page 26: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

26

•A process that endures all processors have approximately the same amount of work/instructions• To ensure processors are not idle

Load Balancing

Page 27: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

27

•A measure of the degree to which a problem is broken down into smaller parts, particularly for assigning each part to a separate processor in a parallel computer•Coarse granularity: Few parts, each relatively large• Fine granularity: Many parts, each relatively small

Granularity

Page 28: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

28

Moore’s Law – 1965, Gordon More of Intel•Number of integrated circuits on silicon chip doubles

every 2 years (approximately)•Until 2004 – increased transistor switching speeds

increased clock rate increased performance•Note Figures 1.3 & 1.4 in text•BUT around 2003 no more increase in clock rate – 3 GHz

Pervasive Parallelism & Hardware Trends (1.3.1)

Page 29: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

29

• Power wall: power consumption growth with clock speed, non-linear growth• Instruction-level parallelism wall: no new low-level parallelism

(automatic) – only constant factor increase• Superscalar instructions• Very Large Instruction Word (by compiler)• Pipelining – max 10 stages

•Memory wall: difference between processor & memory speeds• Latency• Bandwidth

Walls limiting single processor performance

Page 30: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

30

• Virtual memory• Prefetching• Cache

• Benchmarks• Standard programs used to

compare performance on different computers or to demonstrate a computer’s capabilities

• Early Parallelism – WW2• Newer HW features• Larger word size• Superscalar• Vector• Multithreading• Parallel ALUs• Pipelines• GPUs

Historical Trends (1.3.2)

Page 31: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

31

• Serial Traps: unnecessary assumptions deriving from serial programming & execution• Programmer assumes serial execution so gives not

consideration to possible parallelism• Later, no possible to parallelize

•Automatic Parallelism ~ Different strategies• Absolute automatic – no help from programmer – compiler• Parallel constructs – “optional”

Explicit Parallel Programming (1.3.3)

Page 32: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

32

Rule 1: Parallelize a block of seq. instruct. with no repeated variables.Rule 2: Parallelize any block of seq. instruct. if repeated variables don’t change (not on left)

What about instruction 5? What can we do?

Consider the following simple program:1. A = B + C2. D = 5 * X + Y3. E = P + Z / 74. F = B + X5. G = A * 10

Examples of Automatic Parallelism

Page 33: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

33

•Pointers allow a data structure to be distributed across memory• Parallel analysis is very difficult

• Loops can accommodate or restrict or hide possible parallelization

void addme(int n, double a[n],b[n], c[n]{ int I;

for (I=0; I <n; ++I)a[I] = b[I] + c[I];}

Parallelization Problems

Page 34: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

34

double a[10]a[0]= 1addme(9,a+1,a,a)

void addme(int n, double a[n],b[n], c[n]{ int I; for (I=0; I <n; ++I)

a[I]=b[I]+c[I];}

Examples

Page 35: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

35

Mandatory Parallelism VS.

Optional Parallelism

“Explicit parallel programming constructs allow algorithms to be expressed without specifying unintended & unnecessary serial constraints”

void call me ( ){ foo ( );

bar ( ); }

void call me ( ){ cilk_spawn foo ( );

bar ( ); }

Examples

Page 36: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

36

•Patterns: commonly recurring strategies for dealing with particular problems• Tools for Parallel problems• Goals: parallel scalability, good data locality, reduced overhead

•Common in CS: OOP, Natural language proc., data structures, SW engineering

Structured Pattern-based Programming (1.4)

Page 37: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

37

•Abstractions: strategies or approaches which help to hide certain details & simplify problem solving• Implementation Patterns – low-level, system specific•Design Patterns – high-level abstraction•Algorithm Strategy Patterns – Algorithm Skeletons• Semantics – pattern = building block, task arrangement, data

dependencies, abstract• Implementation – for real machine, granularity, cache• Ideally, treat separately, but not always possible

Patterns

Page 38: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

38

Figure 1.1 – Overview of Parallel Patterns

Page 39: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

39

•Contemporary, Popular languages – not designed parallel• Need transition to parallel

•Desired Properties of Parallel Language (p. 21)• Performance – achievable, scalable, predictable, tunable• Productivity – expressive, composable, debuggable,

maintainable• Portability – functionality & performance across systems

(compilers & OS)

Parallel Programming Models (1.5)Desired Properties (1.5.1)

Page 40: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

40

Textbook – Focus on•C++ & parallel support• Intel support • Intel Threading Building Blocks (TBB) – Appendix C• C++ Template Library, open source & commercial

• Intel Cilk Plus – Appendix B• Compiler extensions fof C & C++, open source & commercial

•Other products available – Figure 1.2

Programming – C, C++

Page 41: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

41

•Avoid HW Mechanisms• Particularly vectors & threads

• Focus on • TASKS – opportunities for parallelism• DECOMPOSITION – breaking problem down• DESIGN of ALGORITHM – overall strategy

Abstractions vs. Mechanisms (1.5.2)

Page 42: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

42

“(Parallel) programming should focus on decomposition of problem & design of algorithm rather than specific mechanisms by which it will be parallelized.” p. 23

Reasons to avoid HW specifics (mechanisms)•Reduced portability•Difficult to manage nested parallelism•Mechanisms vary by machine

Abstractions, not Mechanisms

Page 43: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

43

•Key to scalability – Data Parallelism•Divide up DATA not CODE!•Data Parallelism: any form of parallelism in which the

amount of work grows with the size of the problem•Regular Data Parallelism: subcategory of D.P. which

maps efficiently to vector instructions•Parallel languages contain constructs for D.P.

Regular Data Parallelism (1.5.3)

Page 44: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

44

The ability to use a feature in a program without regard to other features being used elsewhere• Issues:• Incompatibility• Inability to support hierarchical composition (nesting)

•Oversubscription: situation in nested parallelism in which a very large number of threads are created• Can lead to failure, inefficiency, inconsistency

Composability (1.5.4)

Page 45: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

45

Data

Thr1~~

Stack

Thr2~~

Stack

Thr3~~

Stack

• Smallest sequence of program instructions that can be managed•Program Processes

Threads• “Cheap” Context Switch•Multiple Processors

Thread (p. 387)Process

Page 46: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

46

•Portable: the ability to run on a variety of HW with little adjustments• Very desirable; C, C++, Java are portable languages

•Performance Portability: the ability to maintain performance levels when run on a variety of HW• Trade-offs General/Portable Specific/Performance

Portability & Performance (1.5.5 & 1.5.6)

Page 47: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

47

•Determinism vs. Non-determinism•Safety – ensuring only correct orderings occur•Serially Consistent•Maintainability

Issues (1.5.7)

Page 48: TEXT: Structured Parallel Programming: Patterns for Efficient Computation M. McCool, A. Robison, J. Reinders Chapter 1 CMPS 5433 – Parallel Algorithms

48

Exam 1 on Chapter 1