chap. 4 part 2 - uoguelph.ca

Chap. 4 Part 2

CIS*3090 Fall 2016

Fall 2016 CIS*3090 Parallel Programming 1

Highlights pp94-end

Two more Peril-L pseudocode features

Full/empty variables

Reduce/scan meta-operations

Three approaches to formulating parallelism

Fixed, scalable, unlimited


Synchronized memory

Create full/empty variable by putting apostrophe after name (always global)

int t’ = 0; //fill it initially

Like mailbox, has both value and state

State can be full or empty, duh!

Reading full empty; filling empty full

Reading empty or filling full blocks till

possible to complete

This is where auto inter-thread sync comes in!


Uses of full/empty variables

Abstracts away the explicit synchronization at pseudocode level

Like length-1 producer/consumer queue

Good way for master to control workers

Implementation

Easy to make with mutex and associated variable (shared mem)

Cluster (next slide…)

Cray has these in memory HW! Fall 2016 CIS*3090 Parallel Programming 4

Full/empty variables for cluster?

Not so obvious…

Consider a Pilot channel as the variable

PI_Read nicely implements variable read

Blocks till other process writes

PI_Write doesn’t usually block (in practice)

The data is sitting “in the channel” and the caller of PI_Write (usually) continues

May be sufficient for full/empty semantics

Drawback: single data value inefficient

use of high-lambda messages Fall 2016 CIS*3090 Parallel Programming 5

Cray MTA supercomputer

Massively parallel

128 threads/CPU

No data caches no coherency issues

Just run another thread during mem. access

Supported by high memory bandwidth

Role of full/empty bits

Ensures order of writing/reading threads

“Trap on write” reschedules waiting reader


Reduce and Scan “meta” operations

Reduction: applying same operator to elements of data set single result

Scan: applying same operator to elements series of intermediate

results (as well as reduced result) Operator is associative & normally commutative!

APL notation “/” & “\”

Online APLette interpreter http://lparis45.free.fr/apl.html


http://lparis45.free.fr/apl.html

http://lparis45.free.fr/apl.html

Reduce/scan

To collect/aggregate results from threads inside forall(){ … }

Case 1: each thread has a local count variable, and result is local

total = +/count;

Reduction carried out using each thread’s variable, and all threads get a copy of the sum in their local total variable

Implies effect of barrier sync since all threads’ contributions have to be available


Reduce/scan Case 2: result is global

total = +/count;

Reduction carried out as before, but result goes into global variable

Implies barrier sync and exclusive access to update global variable

Case 3: data (array) is global total = +/countarray;

Same as for local data, but no need to sync

Each thread gets the same (local) result

Example: compare count 3s C code Fig 1.11 with

Peril-L Fig 4.1


Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 10

Figure 1.11 cp 4.1 The count3s_thread()

for our third Count 3s solution using

private_count array elements.

total = +/priv_count;

Three approaches to “formulating parallelism”

One of most valuable concepts in book, comes up over and over!

Means: 3 ways to think about parallel resources

Fixed, scalable, unlimited

Deciding no. of threads/processors to base the algo on

Best: Plan scalable solution from get-go


Two senses of “scalable”!

Scalable design vs. performance

At design time, program capable of automatically using all the processors you give it without having to be recoded

At run time, how close to perfect efficiency (=1) your program is as no. of processors increased

You may program a scalable design, but find that in practice the extra processors don’t yield much speedup! (various reasons we know)


Third sense of “scalable”

Refers to increasing the problem size

Ideally, we would like linear scaling

4x bigger data 4x run time

Depends on O() of algo, not only parallel design

Moral: pay attention when someone claims their program is “scalable”

Which kind are they claiming??


“Fixed” parallelism

Pre-decide how many threads to utilize in algo.

Limitation is obvious

If program gets more processors, can’t exploit them


“Unlimited” parallelism

In count 3s problem…

Assume to have as many processors (P) as array elements (n)

Possibly huge no. but maximally parallel! (p98)

Consequence: Each processor does microscopic amount of work

Just testing whether “its” element is a 3

Leaves the heavy lifting to reduce +/ function


A hypothetical approach

Want to expose the maximum amount of parallelism in the computation

But how to utilize on a real platform?

“Aggregate the algo’s concurrency into sequential code to match the actual available parallelism”

Means, serialize it to suit the actual no. of cores


“serializing based on unlimited parallelism is expensive”!

In real world, we won’t have P=n, usually P<<n

Go ahead and create n threads anyway multiplexed onto P processors

Inefficient due to overheads (thread creation, scheduling)

Or, just create P threads

But won’t be feasible for compiler to reorganize the n-fold algo into P-fold


Win/lose of “unlimited” parallelism approach

Mapping n threads onto P processors

Effect of reducing parallelism you thought you had in the parallel algo

May be deceived about O() efficiency

Still, can be useful for exploiting very fine-grained parcels of work

If HW designed for large quantity of very light-weight threads GPUs

OpenMP compiler packages loop iterations, farms out to P threads


Best advice

When we control the parallelism on typical SMP or clusters

Best off using scalable design approach!

Author’s tip (p99):

“Identifying parallelism is usually not the difficult part!

“Difficulties lie in structuring the parallelism to manage and reduce interactions among threads” leads to performance loss


“Scalable” all-purpose approach

STEP 1: “examine how components of problem (data structures, work load, etc.) grow with n where:

n = size of the total problem, in terms of data set size, etc.

A1 bang queries: n = no. of records (collisions, vehicles, people)



STEP 2: “identify a set of substantial subproblems that can be solved independently”

A1 bang: one query on a data partition qualifies

Substantial computation due to size of partition

Partitions don’t depend on each other

Simply require some accumulation at end

Size of a subproblem s can be any amount

No. of subproblems S = n/s Fall 2016 CIS*3090 Parallel Programming 21

Nature of subproblems

If data parallel

Subproblems are the same code operating on different data

If task parallel

Subproblems can be different code

Or can use hybrid breakdown

Pipeline with some data-parallel stages



STEP 3: at run time with P processors, ideally set S = P, then…

Each processor/thread works on its own subproblem

Eliminates idle (S<P) and oversubscribed (S>P) processors

If not practical, dynamically adjust the subproblem size s at run time to n/P


Scalable work distribution

Our program discovers P (PI_Configure)

If S > P, distribute the subproblems as evenly as possible (aka “load balancing”; more about this later)

If S < P, either don’t use so many processors, or go back and break down into smaller subproblems (redesign)

For A1 bang, not an issue since we know (large) size of file beforehand

But for very short test file < P records? Fall 2016 CIS*3090 Parallel Programming 24

Exploiting “Locality Principle”

Re design of subproblems

“substantial”

Appropriate granularity balances overhead with exploitation of available parallelism

“independent”

Reduces communication cost

“natural”

Minimizes additional computation needed to divide up data & merge back results


Case study: sorting n records in global memory

Lots of different sorting algos!

Find one with parallelism to exploit re key operation: comparing strings

Success in thinking up approaches seems to be based on:

Background knowledge of problem and applicable algos

Literature/web survey

Experience, intelligence/cleverness, luck!


Unlimited par. approach (Fig 4.4)

Since string compares are key…

Looking for max. no. of comparisons that can be done in parallel:

½ of strings vs. other ½ n/2 processes

aka “Odd/Even Interchange Sort”

Good ‘ole “Bubble Sort” done in parallel

At O(n2), no more attractive than bubble sort, but easy to parallelize!

Fairest performance metric: compare time to fastest sort (quicksort), not serial bubble sort!


Odd/Even Interchange Sort

Note alternating phases

Identical code, just different odd vs. even starting points

Key features:

Each thread’s comparison independent of others’

Uses Boolean &&/ reduction to collect local “done” flags at end of phase 2

Implicit barrier at end of each forall


Downside of algo

“Lots of copying” (the exchanges)

Will create churn on low speed global memory refs.

Would localization help?

No, since data is only used momentarily, and 50/50 chance of writeback


Fixed par. approach (Fig 4.5)

Identify a fixed parameter

No. of letters in alphabet = 26!

Plan: Let each process handle one letter

First phase (of 2): partition the records into 26 batches by starting letter

Each thread works on 1/26 portion

Count ALL letter frequencies (not just thread’s own letter at this phase)

Use +/ reduction to tally the count for its letter from other threads’ local count array element


Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 32

Figure 4.5 Fixed 26-way parallel solution to

alphabetizing. The function letRank(x) returns the

0-origin rank of the Latin letter x. index

myLet

End of phase 1

Now, knowing size of letter X’s batch…

Thread X can allocate enough storage for those records and copy them over

Second phase: thread X sorts letter X’s batch

Serious load imbalance!

Heavy lifting done by “alphabetizeInPlace()” taking different time for each thread

Purpose of +\ scan


Performance

Benefit:

Minimizes movement of global data to just 2 moves

Much better than odd/even exchange!

Downsides:

We can see problem of load imbalance

Anyway, it’s not scalable (past P=26)


Scalable par. approach

Batcher’s Bitonic Sort!

Based on sequential mergesort, similar to how Odd/Even based on bubble sort

Based on HW network, naturally parallel

Parallel opportunity

Merge sort divides the work into batches, sorts them separately, then has a merge phase to recombine them

Parallel version can really sort batches independently!


Batcher’s algo

Requires 2m threads t

Split input into t batches, sort those independently: t-way parallel so far

Merge algo, m phases keep all t busy

A phase selects different pairs of threads to merge ascending or descending order

Peril-L version (Fig. 4.7)

Threads exchange keys through buffers

Uses full/empty variables to sync threads


Lots of parallel algos out there!

Difficult to invent

Research problem

Published in journals and books (like famous Knuth “The Art of Computer Programming”

Hopefully find one that will help you

Could be famous for inventing new ones!


http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming



Summary

Now have a useful pseudocode for describing parallel algorithms

Realistically takes memory ref. costs into account

Adds some useful bells/whistles:

“exclusive” “full/empty vars”

“parallel operations reduce & scan”

Have useful general approach to analyzing a problem for scalable parallel design


chap. 4 part 2 - uoguelph.ca

Documents