text: structured parallel programming: patterns for efficient computation m. mccool, a. robison, j....
TRANSCRIPT
1
TEXT: Structured Parallel Programming: Patterns for Efficient Computation
M. McCool, A. Robison, J. Reinders
Chapter 1
CMPS 5433 – Parallel AlgorithmsDr. Ranette Halverson
2
•Parallel? Sequential?• In general, what does Sequential mean? Parallel?• Is Parallel always better than Sequential?• Examples??• Parallel resources?• How Efficient is Parallel? • How can you measure Efficiency?
Introduction
3
Automatic Parallelization
vs. Explicit Parallel Programming
•Vector instructions•Multithreaded cores•Multicore processors•Multiple processors•Graphics engines•Parallel co-processors•Pipelining•Multi-tasking
Parallel Computing Features
4
•All programmers must be able to program for parallel computers!!•Design & Implement• Efficient, Reliable, Maintainable• Scalable
•Will try to avoid hardware issues…
All Modern Computers are Parallel
5
•Patterns: valuable algorithmic structures commonly seen in efficient parallel programs•AKA: algorithm skeletons
• Examples of Patterns in Sequential Programming?• Linear search• Recursion ~~ Divide-and-Conquer• Greedy Algorithm
Patterns for Efficient Computation
6
•Capture Algorithmic Intent•Avoid mapping algorithms to particular hardware• Focus on • Performance• Low-overhead implementation• Achieving efficiency & scalability
•Think Parallel
Goals of This Book
7
• Scalable: ability to efficiently apply an algorithm to ‘any’ number of processors•Considerations:• Bottleneck: situation in which productive work is delayed
due to limited resources relative to amount of work; congestion occurring when work arrives more quickly than it can be processed • Overhead: cost of implementation of a (parallel) algorithm
that is not part of the solution itself – often due to the distribution or collection of data & results
Scalable
8
•Act of putting set of operations into a specific order• Serial Semantics & Serial Illusion• But were computers really serial?• Have we become overly dependent on sequential strategy?
•Pipelining, Multi-tasking
Serialization (1.1)
9
• Improved Performance• The Past• The Future•Why is my new faster computer not faster??
•Deterministic algorithms • End result vs. Intermediate results• Timing issues – Relative vs. Absoute
Can we ignore parallel strategies?
10
• Structured Patterns for parallelism• Eliminate non-determinism as much as possible
Approach of THIS Book
11
Consider: A = B + 7 Read Z C = X * 2Does order matter?? A = B + 7 Read A B = X * 2
• Programmers think serial (sequential)• List of instructions,
executed in serial order• Standard programming
training• Serial Trap: Assumption of
serial ordering
Serial Traps
12
What does parallel_for imply? Are both correct? Will the results differ? What about time?
for (i=0; i<num_web_sites;++i)search (search_phrase, website[i]
parallel_for (i=0; i<num_web_sites;++i)search (search_phrase, website[i]
Example:Search web for a particular search phrase
13
parallel_do{ A = B + 7 Read A B = X * 2}
parallel_do{ A = B + 7 Read Z C = X * 2}
Can we apply “parallel” to previous example?? Results?
14
Complexity of a Parallel Algorithm• Time •How can we compare?•Work•How do we split problems
across processors?
Complexity of a Sequential Algorithm• Time complexity•Big Oh•What do we mean ?•Why do we measure
performance?
Performance (1.2)
15
•Doesn’t really mean time•Assumptions about measuring “time”• All instructions take same amount of time• Computers are same speed• Are these assumptions true? • So, what is Time Complexity, really?
Time Complexity
16
Suppose want to calculate & print 1000 payroll checks.Each employee’s pay is independent of others.
parallel_for (i<num_employees;++i) Calc_Pay(empl[i]);
Any problems with this?
Example: Payroll Checks
17
•What if all employee data is stored in an array? Can it be accessed in parallel?•What about printing* the checks?•What if computer has 2000 processors?• Is this application a good candidate for a parallel
program?•How much faster can we do the job?
Payroll Checks (cont’d)
18
•How can you possibly sum a single list of integers in parallel?•What is the optimal number of processors to use?•Consider number of operations.
Example: Sum 1000 Integers
19
Considerations of parallel processing•Communication •How is data distributed? Results compiled?
• Shared memory vs. Local memory• One computer vs. Network (e.g. SETI Project)
•Work
Actual Time vs. Number of Instructions
20
• The time required to perform the longest chain of tasks that must be performed sequentially• SPAN limits speed up that can be accomplished via
parallelism• See Figures 1.1 & 1.2•Provides a basis for optimization
SPAN = Critical Path
21
• Shared memory model – book assumption• Communication is simplified• Causes bottlenecks ~~ Why? Locality
• Locality•Memory (data) accesses tend to be close together• Accesses near each other (time & space) tend to be cheaper
•Communication – best when none•More processors is not always better
Shared Memory & Locality (of reference)
22
•Proposed in 1967•Provides upper bound on speed-up attainable via
parallel processing• Sequential vs. Parallel parts of a program• Sequential portion limits parallelism
Amdahl’s Law (Argument)
23
Example: Consider program with 1000 operations• Are ops independent?• OFTEN, parallelizing adds
to the total number of operations to be performed
3 components to consider when parallelizing• Total Work• Span•Communication
Work-Span Model
24
•Communication strategy for “reducing” number information from n to 1 workers•O(log p) = p is number of workers/processors• Example: 16 processors hold 1 integer each & integers
need to be added together. •How can we effectively do this?
Reduction
25
• The “opposite” of Reduction•Communication strategy for distributing data from 1
worker to n workers• Example: 1 processor holds 1 integer & integer needs to
be distributed to the 15 others•How can we effectively do this?
Broadcast
26
•A process that endures all processors have approximately the same amount of work/instructions• To ensure processors are not idle
Load Balancing
27
•A measure of the degree to which a problem is broken down into smaller parts, particularly for assigning each part to a separate processor in a parallel computer•Coarse granularity: Few parts, each relatively large• Fine granularity: Many parts, each relatively small
Granularity
28
Moore’s Law – 1965, Gordon More of Intel•Number of integrated circuits on silicon chip doubles
every 2 years (approximately)•Until 2004 – increased transistor switching speeds
increased clock rate increased performance•Note Figures 1.3 & 1.4 in text•BUT around 2003 no more increase in clock rate – 3 GHz
Pervasive Parallelism & Hardware Trends (1.3.1)
29
• Power wall: power consumption growth with clock speed, non-linear growth• Instruction-level parallelism wall: no new low-level parallelism
(automatic) – only constant factor increase• Superscalar instructions• Very Large Instruction Word (by compiler)• Pipelining – max 10 stages
•Memory wall: difference between processor & memory speeds• Latency• Bandwidth
Walls limiting single processor performance
30
• Virtual memory• Prefetching• Cache
• Benchmarks• Standard programs used to
compare performance on different computers or to demonstrate a computer’s capabilities
• Early Parallelism – WW2• Newer HW features• Larger word size• Superscalar• Vector• Multithreading• Parallel ALUs• Pipelines• GPUs
Historical Trends (1.3.2)
31
• Serial Traps: unnecessary assumptions deriving from serial programming & execution• Programmer assumes serial execution so gives not
consideration to possible parallelism• Later, no possible to parallelize
•Automatic Parallelism ~ Different strategies• Absolute automatic – no help from programmer – compiler• Parallel constructs – “optional”
Explicit Parallel Programming (1.3.3)
32
Rule 1: Parallelize a block of seq. instruct. with no repeated variables.Rule 2: Parallelize any block of seq. instruct. if repeated variables don’t change (not on left)
What about instruction 5? What can we do?
Consider the following simple program:1. A = B + C2. D = 5 * X + Y3. E = P + Z / 74. F = B + X5. G = A * 10
Examples of Automatic Parallelism
33
•Pointers allow a data structure to be distributed across memory• Parallel analysis is very difficult
• Loops can accommodate or restrict or hide possible parallelization
void addme(int n, double a[n],b[n], c[n]{ int I;
for (I=0; I <n; ++I)a[I] = b[I] + c[I];}
Parallelization Problems
34
double a[10]a[0]= 1addme(9,a+1,a,a)
void addme(int n, double a[n],b[n], c[n]{ int I; for (I=0; I <n; ++I)
a[I]=b[I]+c[I];}
Examples
35
Mandatory Parallelism VS.
Optional Parallelism
“Explicit parallel programming constructs allow algorithms to be expressed without specifying unintended & unnecessary serial constraints”
void call me ( ){ foo ( );
bar ( ); }
void call me ( ){ cilk_spawn foo ( );
bar ( ); }
Examples
36
•Patterns: commonly recurring strategies for dealing with particular problems• Tools for Parallel problems• Goals: parallel scalability, good data locality, reduced overhead
•Common in CS: OOP, Natural language proc., data structures, SW engineering
Structured Pattern-based Programming (1.4)
37
•Abstractions: strategies or approaches which help to hide certain details & simplify problem solving• Implementation Patterns – low-level, system specific•Design Patterns – high-level abstraction•Algorithm Strategy Patterns – Algorithm Skeletons• Semantics – pattern = building block, task arrangement, data
dependencies, abstract• Implementation – for real machine, granularity, cache• Ideally, treat separately, but not always possible
Patterns
38
Figure 1.1 – Overview of Parallel Patterns
39
•Contemporary, Popular languages – not designed parallel• Need transition to parallel
•Desired Properties of Parallel Language (p. 21)• Performance – achievable, scalable, predictable, tunable• Productivity – expressive, composable, debuggable,
maintainable• Portability – functionality & performance across systems
(compilers & OS)
Parallel Programming Models (1.5)Desired Properties (1.5.1)
40
Textbook – Focus on•C++ & parallel support• Intel support • Intel Threading Building Blocks (TBB) – Appendix C• C++ Template Library, open source & commercial
• Intel Cilk Plus – Appendix B• Compiler extensions fof C & C++, open source & commercial
•Other products available – Figure 1.2
Programming – C, C++
41
•Avoid HW Mechanisms• Particularly vectors & threads
• Focus on • TASKS – opportunities for parallelism• DECOMPOSITION – breaking problem down• DESIGN of ALGORITHM – overall strategy
Abstractions vs. Mechanisms (1.5.2)
42
“(Parallel) programming should focus on decomposition of problem & design of algorithm rather than specific mechanisms by which it will be parallelized.” p. 23
Reasons to avoid HW specifics (mechanisms)•Reduced portability•Difficult to manage nested parallelism•Mechanisms vary by machine
Abstractions, not Mechanisms
43
•Key to scalability – Data Parallelism•Divide up DATA not CODE!•Data Parallelism: any form of parallelism in which the
amount of work grows with the size of the problem•Regular Data Parallelism: subcategory of D.P. which
maps efficiently to vector instructions•Parallel languages contain constructs for D.P.
Regular Data Parallelism (1.5.3)
44
The ability to use a feature in a program without regard to other features being used elsewhere• Issues:• Incompatibility• Inability to support hierarchical composition (nesting)
•Oversubscription: situation in nested parallelism in which a very large number of threads are created• Can lead to failure, inefficiency, inconsistency
Composability (1.5.4)
45
Data
Thr1~~
Stack
Thr2~~
Stack
Thr3~~
Stack
• Smallest sequence of program instructions that can be managed•Program Processes
Threads• “Cheap” Context Switch•Multiple Processors
Thread (p. 387)Process
46
•Portable: the ability to run on a variety of HW with little adjustments• Very desirable; C, C++, Java are portable languages
•Performance Portability: the ability to maintain performance levels when run on a variety of HW• Trade-offs General/Portable Specific/Performance
Portability & Performance (1.5.5 & 1.5.6)
47
•Determinism vs. Non-determinism•Safety – ensuring only correct orderings occur•Serially Consistent•Maintainability
Issues (1.5.7)
48
Exam 1 on Chapter 1