selections from cse332 data abstractions course at university of washington

1

Selections from CSE332 Data Abstractions course at University of Washington

1. Introduction to Multithreading & Fork-Join Parallelism– hashtable, vector sum

2. Analysis of Fork-Join Parallel Programs– Maps, vector addition, linked lists versus trees for parallelism,

work and span3. Parallel Prefix, Pack, and Sorting4. Shared-Memory Concurrency & Mutual Exclusion

– Concurrent bank account , OpenMP nested locks, OpenMP critical section

5. Programming with Locks and Critical Sections– Simple minded concurrent stack, hashtable revisited

6. Data Races and Memory Reordering, Deadlock, Reader/Writer Locks, Condition Variables

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency

Lecture 1Introduction to Multithreading & Fork-Join Parallelism

Dan Grossman

Last Updated: August 2011For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/

3Sophomoric Parallelism and Concurrency, Lecture 1

What to do with multiple processors?

• Next computer you buy will likely have 4 processors– Wait a few years and it will be 8, 16, 32, …– The chip companies have decided to do this (not a “law”)

• What can you do with them?– Run multiple totally different programs at the same time

• Already do that? Yes, but with time-slicing– Do multiple things at once in one program

• Our focus – more difficult• Requires rethinking everything from asymptotic

complexity to how to implement data-structure operations


Parallelism vs. ConcurrencyNote: Terms not yet standard but the perspective is essential

– Many programmers confuse these concepts

There is some connection:– Common to use threads for both– If parallel computations need access to shared resources,

then the concurrency needs to be managedFirst 3ish lectures on parallelism, then 3ish lectures on concurrency

Parallelism: Use extra resources to solve a problem faster

resources

Concurrency: Correctly and efficiently manage access to shared resources

requestswork

resource


An analogy

CS1 idea: A program is like a recipe for a cook– One cook who does one thing at a time! (Sequential)

Parallelism:– Have lots of potatoes to slice? – Hire helpers, hand out potatoes and knives– But too many chefs and you spend all your time coordinating

Concurrency:– Lots of cooks making different things, but only 4 stove burners– Want to allow access to all 4 burners, but not cause spills or

incorrect burner settings


Sum(int arr[], int len) { int res[4]; FORALL(i=0; i < 4; i++) { //parallel iterations res[i] = sumRange(arr,i*len/4,(i+1)*len/4); } return res[0]+res[1]+res[2]+res[3];}int sumRange(int arr[], int lo, int hi) { result = 0; for(j=lo; j < hi; j++) result += arr[j]; return result;}

Parallelism ExampleParallelism: Use extra computational resources to solve a problem

faster (increasing throughput via simultaneous execution)

Pseudocode for array sum– Bad style for reasons we’ll see, but may get roughly 4x speedup


Concurrency ExampleConcurrency: Correctly and efficiently manage access to shared

resources (from multiple possibly-simultaneous clients)

Pseudocode for a shared chaining hashtable– Prevent bad interleavings (correctness)– But allow some concurrent access (performance)

class Hashtable<K,V> { … void insert(K key, V value) { int bucket = …; prevent-other-inserts/lookups in table[bucket] do the insertion re-enable access to arr[bucket] } V lookup(K key) {

(like insert, but can allow concurrent lookups to same bucket)

}}


Shared memoryThe model we will assume is shared memory with explicit threads

Old story: A running program has– One call stack (with each stack frame holding local variables) – One program counter (current statement executing)– Static fields– Objects (created by new) in the heap (nothing to do with heap

data structure)

New story:– A set of threads, each with its own call stack & program counter

• No access to another thread’s local variables– Threads can (implicitly) share static fields / objects

• To communicate, write somewhere another thread reads


Shared memory

…

pc=…

…

pc=…

…

pc=…

…

Unshared:locals andcontrol

Shared:objects andstatic fields

Threads each have own unshared call stack and current statement – (pc for “program counter”) – local variables are numbers, null, or heap references

Any objects can be shared, but most are not


First attempt, Sum class - serial class Sum {

int ans; int len; int *arr; public: Sum (int a[], int num) { //constructor arr=a, ans = 0; for (int i=0; i < num; ++i) {

arr[i] = i+1; //initialize array } }

int sum (int lo, int hi) { int ans = 0; for(int i=lo; i < hi; ++i) { ans += arr[i]; } return ans; }}


Parallelism idea• Example: Sum elements of a large array • Idea Have 4 threads simultaneously sum 1/4 of the array

– Warning: Inferior first approach – explicitly using fixed number of threads

ans0 ans1 ans2 ans3 + ans

– Create 4 threads using openMP parallel– Determine thread ID, assign work based on thread ID– Accumulate partial sums for each thread– Add together their 4 answers for the final result


OpenMP basics

First learn some basics of OpenMP

1. Pragma based approach to parallelism2. Create pool of threads using #pragma omp parallel3. Use work sharing directives to allow each thread to do work4. To get started, we will explore these worksharing directives:

1. parallel for2. reduction3. single4. task5. taskwait

Most of these constructs have an analog in Intel® Cilk plus™ as well


Demo

Walk through of parallel sum code – SPMD version

Discussion points:•#pragma omp parallel

•num_threads clause


Demo

Walk through of OpenMP parallel for with reduction code

Discussion points:•#pragma omp parallel for

•reduction(+:ans) clause


Shared memory?

• Fork-join programs (thankfully) don’t require much focus on sharing memory among threads

• But in languages like C++, there is memory being shared. In our example:– lo, hi, arr fields written by “main” thread, read by helper

thread– ans field written by helper thread, read by “main” thread

• When using shared memory, you must avoid race conditions– With concurrency, we’ll learn other ways to synchronize later

such as the use of atomics, locks, critical sections


Divide and Conquer Approach

Want to use (only) processors “available to you now”

– Not used by other programs or threads in your program• Maybe caller is also using parallelism• Available cores can change even while your threads run

– If you have 3 processors available and using 3 threads would take time X, then creating 4 threads would take time 1.5X

// numThreads == numProcessors is bad// if some are needed for other thingsint sum(int arr[], int numThreads){ …}


Divide and Conquer Approach

Though unlikely for sum, in general sub-problems may take significantly different amounts of time

– Example: Apply method f to every array element, but maybe f is much slower for some data items• Example: Is a large integer prime?

– If we create 4 threads and all the slow data is processed by 1 of them, we won’t get nearly a 4x speedup• Example of a load imbalance


Divide and Conquer ApproachThe counterintuitive (?) solution to all these problems is to use lots of

threads, far more than the number of processors– But this will require changing our algorithm

ans0 ans1 … ansN ans

1. Forward-portable: Lots of helpers each doing a small piece2. Processors available: Hand out “work chunks” as you go

• If 3 processors available and have 100 threads, then ignoring constant-factor overheads, extra time is < 3%

3. Load imbalance: No problem if slow thread scheduled early enough• Variation probably small anyway if pieces of work are small


Divide-and-Conquer idea

This is straightforward to implement using divide-and-conquer– Parallelism for the recursive calls

+ + + + + + + +

+ + + +

+ ++


Demo

Walk through of Divide and Conquer code

Discussion points:•#pragma omp task•omp shared•omp firstprivate•omp taskwait


Demo

Walk through of load balanced OpenMP parallel for

Discussion points:•Omp scheduling(dynamic)


Lecture 2Analysis of Fork-Join Parallel Programs

Dan Grossman

Last Updated: August 2011For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/


Even easier: Maps (Data Parallelism)

• A map operates on each element of a collection independently to create a new collection of the same size– No combining results– For arrays, this is so trivial some hardware has direct support

• Canonical example: Vector addition

int vector_add(int lo, int hi){ FORALL(i=lo; i < hi; i++) { result[i] = arr1[i] + arr2[i]; } return 1;}


Maps in ForkJoin Framework

• Even though there is no result-combining, it still helps with load balancing to create many small tasks– Maybe not for vector-add but for more compute-intensive

maps– The forking is O(log n) whereas theoretically other approaches

to vector-add is O(1)

class VecAdd { int *arr1, *arr2, *res, len; public:

VecAdd(int _arr1[],int _arr2[],int _res[],int _num) { arr1 = _arr1; arr2 = _arr2; res = _res; len = _num; }

int vector_add( int lo, int hi) {#pragma omp parallel for for( int i=lo; i < hi; ++i) { res[i] = arr1[i] + arr2[i]; } return 1; }

} //end class definition


Maps and reductions

Maps and reductions: the “workhorses” of parallel programming

– By far the two most important and common patterns• Two more-advanced patterns in next lecture

– Learn to recognize when an algorithm can be written in terms of maps and reductions

– Use maps and reductions to describe (parallel) algorithms

– Programming them becomes “trivial” with a little practice• Exactly like sequential for-loops seem second-nature


Trees

• Maps and reductions work just fine on balanced trees– Divide-and-conquer each child rather than array subranges– Correct for unbalanced trees, but won’t get much speed-up

• Example: minimum element in an unsorted but balanced binary tree in O(log n) time given enough processors

• How to do the sequential cut-off?– Store number-of-descendants at each node (easy to

maintain)– Or could approximate it with, e.g., AVL-tree height


Linked lists

• Can you parallelize maps or reduces over linked lists?– Example: Increment all elements of a linked list– Example: Sum all elements of a linked list

b c d e f

front back

• Once again, data structures matter!

• For parallelism, balanced trees generally better than lists so that we can get to all the data exponentially faster O(log n) vs. O(n)– Trees have the same flexibility as lists compared to arrays


Work and Span

Let TP be the running time if there are P processors available

Two key measures of run-time:

• Work: How long it would take 1 processor = T1– Just “sequentialize” the recursive forking

• Span: How long it would take infinity processors = T– The longest dependence-chain– Example: O(log n) for summing an array since > n/2

processors is no additional help– Also called “critical path length” or “computational depth”


Lecture 4Shared-Memory Concurrency & Mutual Exclusion

Dan Grossman

Last Updated: May 2011For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/

30Sophomoric Parallelism & Concurrency, Lecture 4

Canonical Bank Account Example

Correct code in a single-threaded world

class BankAccount { private int balance = 0; int getBalance() { return balance; } void setBalance(int x) { balance = x; } void withdraw(int amount) { int b = getBalance(); if(amount > b) try {throw WithdrawTooLargeException;} setBalance(b – amount); } … // other operations like deposit, etc.}


A bad interleaving

Interleaved withdraw(100) calls on the same account– Assume initial balance == 150

int b = getBalance();

if(amount > b) try{throw …;}setBalance(b – amount);

int b = getBalance();if(amount > b) try{throw …;}setBalance(b – amount);

Thread 1 Thread 2

Tim

e

“Lost withdraw” – unhappy bank


What we need – an abstract data type for mutual exclusion • There are many ways out of this conundrum, but we need help from the

language

• One basic solution: Locks, or critical sections– OpenMP implements locks as well as critical sections. They do

similar things, but have subtle differences between them• #pragma omp critical

{ // allows only one thread access at a time in the code block // code block must have one entrance and one exit }• omp_lock_t myLock;

– omp locks have to be initialized before use.– Can be locked with omp_set_nest_lock, and unlocked with

omp_unset_nest_lock– Only one thread may hold the lock– Allows for exception handling and non structured jumps


Almost-correct pseudocode class BankAccount { private int balance = 0; private Lock lk = new Lock(); … void withdraw(int amount) {

lk.acquire(); /* may block */ int b = getBalance(); if(amount > b) try {throw WithdrawTooLargeException;} setBalance(b – amount); lk.release(); } // deposit would also acquire/release lk}


Some mistakes• A lock & critical sections are very primitive mechanisms

– Still up to you to use correctly to implement them

• Incorrect: Use different locks or different named ciritical sections for withdraw and deposit– Mutual exclusion works only when using same lock

• Poor performance: Use same lock for every bank account– No simultaneous operations on different accounts

• Incorrect: Forget to release a lock (blocks other threads forever!)– Previous slide is wrong because of the exception possibility!if(amount > b) {

lk.release(); // hard to remember! try {throw WithdrawTooLargeException;}}


Other operations

• If withdraw and deposit use the same lock, then simultaneous calls to these methods are properly synchronized

• But what about getBalance and setBalance?– Assume they’re public, which may be reasonable

• If they don’t acquire the same lock, then a race between setBalance and withdraw could produce a wrong result

• If they do acquire the same lock, then withdraw would block forever because it tries to acquire a lock it already has


Re-acquiring locks?

• Can’t let outside world callsetBalance1, not protected by locks

• Can’t have withdraw call setBalance2, because the locks would be nested

• We can use intricate re-entrant locking scheme or better yet re-structure the code. Nested locking is not recommended

int setBalance1(int x) { balance = x; }int setBalance2(int x) { lk.acquire(); balance = x; lk.release();}void withdraw(int amount) { lk.acquire(); … setBalanceX(b – amount); lk.release(); }


This code is easier to lock

• You may provide a setBalance() method for external use, but do NOT call it for the withdraw() member function.

• Instead, protect the direct modification of the data as shown here – avoiding nested locks

int setBalance(int x) { lk.acquire(); balance = x; lk.release();}void withdraw(int amount) { lk.acquire(); … balance = (b – amount); lk.release(); }


Demo

Walk through of load balanced BankAccount code

Discussion points:•#pragma omp critical


OpenMp Critical Section method

• Works great & is simple to implement• Can’t be used as nested function calls where each function sets

a critical section. For example foo() contains critical section which calls bar(), while bar() also contains critical section. This will not be allowed by OpenMP runtime!

• Generally better to avoid nested calls such as foo,bar example above, and as also shown in the Canonical Bank Account Example where withdraw() calls setBalance()

• Can’t be used with try/catch exception handling (due to multiple exists from code block)

• If try/catch is required, consider using scoped locking – example to follow.


Scoped Locking method

• Works great & is fairly simple to implement• More complicated than critical section method• Can be used as nested function calls where each function

acquires its lock. – Generally it is still better to avoid nested calls such as

foo,bar example on previous foil, and as also shown in the Canonical Bank Account Example where withdraw() calls setBalance()

• Can be used with try/catch exception handling

• The lock is released when the object or function goes out of scope.


Demo

Walk through of load balanced BankAccount code

Discussion points:•scoped locking using OpenMP nested locks


Lecture 5 Programming with Locks and Critical Sections

Dan Grossman

Last Updated: May 2011For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/


Demo

Walk through of load balanced stack code

Discussion points:•using OpenMP nested locks•Using OpenMP critical section


Example, using critical sectionsclass Stack<E> { private E[] array = (E[])new Object[SIZE]; int index = -1; bool isEmpty() { // unsynchronized: wrong?! return index==-1; } void push(E val) { #pragma omp critical {array[++index] = val;} } E pop() { E temp; #pragma omp critical {temp = array[index--];} return temp; } E peek() { // unsynchronized: wrong! return array[index]; } }


Why wrong?

• It looks like isEmpty and peek can “get away with this” since push and pop adjust the state “in one tiny step”

• But this code is still wrong and depends on language-implementation details you cannot assume– Even “tiny steps” may require multiple steps in the

implementation: array[++index] = val probably takes at least two steps

– Code has a data race, allowing very strange behavior • Important discussion in next lecture

• Moral: Don’t introduce a data race, even if every interleaving you can think of is correct


3 choicesFor every memory location (e.g., object field) in your program, you

must obey at least one of the following:1. Thread-local: Don’t use the location in > 1 thread2. Immutable: Don’t write to the memory location3. Synchronized: Use synchronization to control access to the

location

all memory thread-localmemory immutable

memory

need synchronization


Example

Suppose we want to change the value for a key in a hashtable without removing it from the table– Assume critical section guards the whole table

#pragma omp critical{ v1 = table.lookup(k); v2 = expensive(v1); table.remove(k); table.insert(k,v2);}

Papa Bear’s critical section was too long

(table locked during expensive call)


Example

Suppose we want to change the value for a key in a hashtable without removing it from the table– Assume critical section guards the whole table

#pragma omp critical{ v1 = table.lookup(k);}v2 = expensive(v1);#pragma omp critical{ table.remove(k); table.insert(k,v2);}

Mama Bear’s critical section was too short

(if another thread updated the entry, we will lose an update)


ExampleSuppose we want to change the value for a key in a hashtable

without removing it from the table– Assume critical section guards the whole table

done = false;while(!done) { #pragma omp critical { v1 = table.lookup(k); } v2 = expensive(v1); #pragma omp critical { if(table.lookup(k)==v1) { done = true; table.remove(k); table.insert(k,v2);}}}

Baby Bear’s critical section was just right

(if another updateoccurred, try ourupdate again)


Don’t roll your own

• It is rare that you should write your own data structure– Provided in standard libraries– Point of these lectures is to understand the key trade-offs

and abstractions

• Especially true for concurrent data structures– Far too difficult to provide fine-grained synchronization

without race conditions– Standard thread-safe libraries like TBB ConcurrentHashMap written by world experts

Guideline #5: Use built-in libraries whenever they meet your needs

selections from cse332 data abstractions course at university of washington

Documents

concurrency lecture

memory parallelism

parallelism exampleparallelism

speedup6sophomoric parallelism

sequential parallelism

concurrency exampleconcurrency

time coordinating concurrency

int len