chap. 4 part 2 - uoguelph.ca
TRANSCRIPT
Chap. 4 Part 2
CIS*3090 Fall 2016
Fall 2016 CIS*3090 Parallel Programming 1
Highlights pp94-end
Two more Peril-L pseudocode features
Full/empty variables
Reduce/scan meta-operations
Three approaches to formulating parallelism
Fixed, scalable, unlimited
Fall 2016 CIS*3090 Parallel Programming 2
Synchronized memory
Create full/empty variable by putting apostrophe after name (always global)
int t’ = 0; //fill it initially
Like mailbox, has both value and state
State can be full or empty, duh!
Reading full empty; filling empty full
Reading empty or filling full blocks till
possible to complete
This is where auto inter-thread sync comes in!
Fall 2016 CIS*3090 Parallel Programming 3
Uses of full/empty variables
Abstracts away the explicit synchronization at pseudocode level
Like length-1 producer/consumer queue
Good way for master to control workers
Implementation
Easy to make with mutex and associated variable (shared mem)
Cluster (next slide…)
Cray has these in memory HW! Fall 2016 CIS*3090 Parallel Programming 4
Full/empty variables for cluster?
Not so obvious…
Consider a Pilot channel as the variable
PI_Read nicely implements variable read
Blocks till other process writes
PI_Write doesn’t usually block (in practice)
The data is sitting “in the channel” and the caller of PI_Write (usually) continues
May be sufficient for full/empty semantics
Drawback: single data value inefficient
use of high-lambda messages Fall 2016 CIS*3090 Parallel Programming 5
Cray MTA supercomputer
Massively parallel
128 threads/CPU
No data caches no coherency issues
Just run another thread during mem. access
Supported by high memory bandwidth
Role of full/empty bits
Ensures order of writing/reading threads
“Trap on write” reschedules waiting reader
Fall 2016 CIS*3090 Parallel Programming 6
Reduce and Scan “meta” operations
Reduction: applying same operator to elements of data set single result
Scan: applying same operator to elements series of intermediate
results (as well as reduced result) Operator is associative & normally commutative!
APL notation “/” & “\”
Online APLette interpreter http://lparis45.free.fr/apl.html
Fall 2016 CIS*3090 Parallel Programming 7
Reduce/scan
To collect/aggregate results from threads inside forall(){ … }
Case 1: each thread has a local count variable, and result is local
total = +/count;
Reduction carried out using each thread’s variable, and all threads get a copy of the sum in their local total variable
Implies effect of barrier sync since all threads’ contributions have to be available
Fall 2016 CIS*3090 Parallel Programming 8
Reduce/scan Case 2: result is global
total = +/count;
Reduction carried out as before, but result goes into global variable
Implies barrier sync and exclusive access to update global variable
Case 3: data (array) is global total = +/countarray;
Same as for local data, but no need to sync
Each thread gets the same (local) result
Example: compare count 3s C code Fig 1.11 with
Peril-L Fig 4.1
Fall 2016 CIS*3090 Parallel Programming 9
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 10
Figure 1.11 cp 4.1 The count3s_thread()
for our third Count 3s solution using
private_count array elements.
total = +/priv_count;
Three approaches to “formulating parallelism”
One of most valuable concepts in book, comes up over and over!
Means: 3 ways to think about parallel resources
Fixed, scalable, unlimited
Deciding no. of threads/processors to base the algo on
Best: Plan scalable solution from get-go
Fall 2016 CIS*3090 Parallel Programming 11
Two senses of “scalable”!
Scalable design vs. performance
At design time, program capable of automatically using all the processors you give it without having to be recoded
At run time, how close to perfect efficiency (=1) your program is as no. of processors increased
You may program a scalable design, but find that in practice the extra processors don’t yield much speedup! (various reasons we know)
Fall 2016 CIS*3090 Parallel Programming 12
Third sense of “scalable”
Refers to increasing the problem size
Ideally, we would like linear scaling
4x bigger data 4x run time
Depends on O() of algo, not only parallel design
Moral: pay attention when someone claims their program is “scalable”
Which kind are they claiming??
Fall 2016 CIS*3090 Parallel Programming 13
“Fixed” parallelism
Pre-decide how many threads to utilize in algo.
Limitation is obvious
If program gets more processors, can’t exploit them
Fall 2016 CIS*3090 Parallel Programming 14
“Unlimited” parallelism
In count 3s problem…
Assume to have as many processors (P) as array elements (n)
Possibly huge no. but maximally parallel! (p98)
Consequence: Each processor does microscopic amount of work
Just testing whether “its” element is a 3
Leaves the heavy lifting to reduce +/ function
Fall 2016 CIS*3090 Parallel Programming 15
A hypothetical approach
Want to expose the maximum amount of parallelism in the computation
But how to utilize on a real platform?
“Aggregate the algo’s concurrency into sequential code to match the actual available parallelism”
Means, serialize it to suit the actual no. of cores
Fall 2016 CIS*3090 Parallel Programming 16
“serializing based on unlimited parallelism is expensive”!
In real world, we won’t have P=n, usually P<<n
Go ahead and create n threads anyway multiplexed onto P processors
Inefficient due to overheads (thread creation, scheduling)
Or, just create P threads
But won’t be feasible for compiler to reorganize the n-fold algo into P-fold
Fall 2016 CIS*3090 Parallel Programming 17
Win/lose of “unlimited” parallelism approach
Mapping n threads onto P processors
Effect of reducing parallelism you thought you had in the parallel algo
May be deceived about O() efficiency
Still, can be useful for exploiting very fine-grained parcels of work
If HW designed for large quantity of very light-weight threads GPUs
OpenMP compiler packages loop iterations, farms out to P threads
Fall 2016 CIS*3090 Parallel Programming 18
Best advice
When we control the parallelism on typical SMP or clusters
Best off using scalable design approach!
Author’s tip (p99):
“Identifying parallelism is usually not the difficult part!
“Difficulties lie in structuring the parallelism to manage and reduce interactions among threads” leads to performance loss
Fall 2016 CIS*3090 Parallel Programming 19
“Scalable” all-purpose approach
STEP 1: “examine how components of problem (data structures, work load, etc.) grow with n where:
n = size of the total problem, in terms of data set size, etc.
A1 bang queries: n = no. of records (collisions, vehicles, people)
Fall 2016 CIS*3090 Parallel Programming 20
“Scalable” all-purpose approach
STEP 2: “identify a set of substantial subproblems that can be solved independently”
A1 bang: one query on a data partition qualifies
Substantial computation due to size of partition
Partitions don’t depend on each other
Simply require some accumulation at end
Size of a subproblem s can be any amount
No. of subproblems S = n/s Fall 2016 CIS*3090 Parallel Programming 21
Nature of subproblems
If data parallel
Subproblems are the same code operating on different data
If task parallel
Subproblems can be different code
Or can use hybrid breakdown
Pipeline with some data-parallel stages
Fall 2016 CIS*3090 Parallel Programming 22
“Scalable” all-purpose approach
STEP 3: at run time with P processors, ideally set S = P, then…
Each processor/thread works on its own subproblem
Eliminates idle (S<P) and oversubscribed (S>P) processors
If not practical, dynamically adjust the subproblem size s at run time to n/P
Fall 2016 CIS*3090 Parallel Programming 23
Scalable work distribution
Our program discovers P (PI_Configure)
If S > P, distribute the subproblems as evenly as possible (aka “load balancing”; more about this later)
If S < P, either don’t use so many processors, or go back and break down into smaller subproblems (redesign)
For A1 bang, not an issue since we know (large) size of file beforehand
But for very short test file < P records? Fall 2016 CIS*3090 Parallel Programming 24
Exploiting “Locality Principle”
Re design of subproblems
“substantial”
Appropriate granularity balances overhead with exploitation of available parallelism
“independent”
Reduces communication cost
“natural”
Minimizes additional computation needed to divide up data & merge back results
Fall 2016 CIS*3090 Parallel Programming 25
Case study: sorting n records in global memory
Lots of different sorting algos!
Find one with parallelism to exploit re key operation: comparing strings
Success in thinking up approaches seems to be based on:
Background knowledge of problem and applicable algos
Literature/web survey
Experience, intelligence/cleverness, luck!
Fall 2016 CIS*3090 Parallel Programming 26
Unlimited par. approach (Fig 4.4)
Since string compares are key…
Looking for max. no. of comparisons that can be done in parallel:
½ of strings vs. other ½ n/2 processes
aka “Odd/Even Interchange Sort”
Good ‘ole “Bubble Sort” done in parallel
At O(n2), no more attractive than bubble sort, but easy to parallelize!
Fairest performance metric: compare time to fastest sort (quicksort), not serial bubble sort!
Fall 2016 CIS*3090 Parallel Programming 27
Odd/Even Interchange Sort
Note alternating phases
Identical code, just different odd vs. even starting points
Key features:
Each thread’s comparison independent of others’
Uses Boolean &&/ reduction to collect local “done” flags at end of phase 2
Implicit barrier at end of each forall
Fall 2016 CIS*3090 Parallel Programming 28
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4-29
Figure 4.4 Odd/Even Interchange to
alphabetize a list L of records on field x.
Downside of algo
“Lots of copying” (the exchanges)
Will create churn on low speed global memory refs.
Would localization help?
No, since data is only used momentarily, and 50/50 chance of writeback
Fall 2016 CIS*3090 Parallel Programming 30
Fixed par. approach (Fig 4.5)
Identify a fixed parameter
No. of letters in alphabet = 26!
Plan: Let each process handle one letter
First phase (of 2): partition the records into 26 batches by starting letter
Each thread works on 1/26 portion
Count ALL letter frequencies (not just thread’s own letter at this phase)
Use +/ reduction to tally the count for its letter from other threads’ local count array element
Fall 2016 CIS*3090 Parallel Programming 31
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 32
Figure 4.5 Fixed 26-way parallel solution to
alphabetizing. The function letRank(x) returns the
0-origin rank of the Latin letter x. index
myLet
End of phase 1
Now, knowing size of letter X’s batch…
Thread X can allocate enough storage for those records and copy them over
Second phase: thread X sorts letter X’s batch
Serious load imbalance!
Heavy lifting done by “alphabetizeInPlace()” taking different time for each thread
Purpose of +\ scan
Fall 2016 CIS*3090 Parallel Programming 33
Performance
Benefit:
Minimizes movement of global data to just 2 moves
Much better than odd/even exchange!
Downsides:
We can see problem of load imbalance
Anyway, it’s not scalable (past P=26)
Fall 2016 CIS*3090 Parallel Programming 34
Scalable par. approach
Batcher’s Bitonic Sort!
Based on sequential mergesort, similar to how Odd/Even based on bubble sort
Based on HW network, naturally parallel
Parallel opportunity
Merge sort divides the work into batches, sorts them separately, then has a merge phase to recombine them
Parallel version can really sort batches independently!
Fall 2016 CIS*3090 Parallel Programming 35
Batcher’s algo
Requires 2m threads t
Split input into t batches, sort those independently: t-way parallel so far
Merge algo, m phases keep all t busy
A phase selects different pairs of threads to merge ascending or descending order
Peril-L version (Fig. 4.7)
Threads exchange keys through buffers
Uses full/empty variables to sync threads
Fall 2016 CIS*3090 Parallel Programming 36
Lots of parallel algos out there!
Difficult to invent
Research problem
Published in journals and books (like famous Knuth “The Art of Computer Programming”
Hopefully find one that will help you
Could be famous for inventing new ones!
Fall 2016 CIS*3090 Parallel Programming 37
Summary
Now have a useful pseudocode for describing parallel algorithms
Realistically takes memory ref. costs into account
Adds some useful bells/whistles:
“exclusive” “full/empty vars”
“parallel operations reduce & scan”
Have useful general approach to analyzing a problem for scalable parallel design
Fall 2016 CIS*3090 Parallel Programming 38