selections from cse332 data abstractions course at university of washington

50
Selections from CSE332 Data Abstractions course at University of Washington 1. Introduction to Multithreading & Fork-Join Parallelism hashtable, vector sum 2. Analysis of Fork-Join Parallel Programs Maps, vector addition, linked lists versus trees for parallelism, work and span 3. Parallel Prefix, Pack, and Sorting 4. Shared-Memory Concurrency & Mutual Exclusion Concurrent bank account , OpenMP nested locks, OpenMP critical section 5. Programming with Locks and Critical Sections Simple minded concurrent stack, hashtable revisited 6. Data Races and Memory Reordering, Deadlock, Reader/Writer Locks, Condition Variables 1

Upload: sovann

Post on 25-Feb-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Selections from CSE332 Data Abstractions course at University of Washington. Introduction to Multithreading & Fork-Join Parallelism hashtable , vector sum Analysis of Fork-Join Parallel Programs Maps, vector addition, linked lists versus trees for parallelism, work and span - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Selections from CSE332  Data Abstractions  course at University  of  Washington

1

Selections from CSE332 Data Abstractions course at University of Washington

1. Introduction to Multithreading & Fork-Join Parallelism– hashtable, vector sum

2. Analysis of Fork-Join Parallel Programs– Maps, vector addition, linked lists versus trees for parallelism,

work and span3. Parallel Prefix, Pack, and Sorting4. Shared-Memory Concurrency & Mutual Exclusion

– Concurrent bank account , OpenMP nested locks, OpenMP critical section

5. Programming with Locks and Critical Sections– Simple minded concurrent stack, hashtable revisited

6. Data Races and Memory Reordering, Deadlock, Reader/Writer Locks, Condition Variables

Page 2: Selections from CSE332  Data Abstractions  course at University  of  Washington

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency

Lecture 1Introduction to Multithreading & Fork-Join Parallelism

Dan Grossman

Last Updated: August 2011For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/

Page 3: Selections from CSE332  Data Abstractions  course at University  of  Washington

3Sophomoric Parallelism and Concurrency, Lecture 1

What to do with multiple processors?

• Next computer you buy will likely have 4 processors– Wait a few years and it will be 8, 16, 32, …– The chip companies have decided to do this (not a “law”)

• What can you do with them?– Run multiple totally different programs at the same time

• Already do that? Yes, but with time-slicing– Do multiple things at once in one program

• Our focus – more difficult• Requires rethinking everything from asymptotic

complexity to how to implement data-structure operations

Page 4: Selections from CSE332  Data Abstractions  course at University  of  Washington

4Sophomoric Parallelism and Concurrency, Lecture 1

Parallelism vs. ConcurrencyNote: Terms not yet standard but the perspective is essential

– Many programmers confuse these concepts

There is some connection:– Common to use threads for both– If parallel computations need access to shared resources,

then the concurrency needs to be managedFirst 3ish lectures on parallelism, then 3ish lectures on concurrency

Parallelism: Use extra resources to solve a problem faster

resources

Concurrency: Correctly and efficiently manage access to shared resources

requestswork

resource

Page 5: Selections from CSE332  Data Abstractions  course at University  of  Washington

5Sophomoric Parallelism and Concurrency, Lecture 1

An analogy

CS1 idea: A program is like a recipe for a cook– One cook who does one thing at a time! (Sequential)

Parallelism:– Have lots of potatoes to slice? – Hire helpers, hand out potatoes and knives– But too many chefs and you spend all your time coordinating

Concurrency:– Lots of cooks making different things, but only 4 stove burners– Want to allow access to all 4 burners, but not cause spills or

incorrect burner settings

Page 6: Selections from CSE332  Data Abstractions  course at University  of  Washington

6Sophomoric Parallelism and Concurrency, Lecture 1

Sum(int arr[], int len) { int res[4]; FORALL(i=0; i < 4; i++) { //parallel iterations res[i] = sumRange(arr,i*len/4,(i+1)*len/4); } return res[0]+res[1]+res[2]+res[3];}int sumRange(int arr[], int lo, int hi) { result = 0; for(j=lo; j < hi; j++) result += arr[j]; return result;}

Parallelism ExampleParallelism: Use extra computational resources to solve a problem

faster (increasing throughput via simultaneous execution)

Pseudocode for array sum– Bad style for reasons we’ll see, but may get roughly 4x speedup

Page 7: Selections from CSE332  Data Abstractions  course at University  of  Washington

7Sophomoric Parallelism and Concurrency, Lecture 1

Concurrency ExampleConcurrency: Correctly and efficiently manage access to shared

resources (from multiple possibly-simultaneous clients)

Pseudocode for a shared chaining hashtable– Prevent bad interleavings (correctness)– But allow some concurrent access (performance)

class Hashtable<K,V> { … void insert(K key, V value) { int bucket = …; prevent-other-inserts/lookups in table[bucket] do the insertion re-enable access to arr[bucket] } V lookup(K key) {

(like insert, but can allow concurrent lookups to same bucket)

}}

Page 8: Selections from CSE332  Data Abstractions  course at University  of  Washington

8Sophomoric Parallelism and Concurrency, Lecture 1

Shared memoryThe model we will assume is shared memory with explicit threads

Old story: A running program has– One call stack (with each stack frame holding local variables) – One program counter (current statement executing)– Static fields– Objects (created by new) in the heap (nothing to do with heap

data structure)

New story:– A set of threads, each with its own call stack & program counter

• No access to another thread’s local variables– Threads can (implicitly) share static fields / objects

• To communicate, write somewhere another thread reads

Page 9: Selections from CSE332  Data Abstractions  course at University  of  Washington

9Sophomoric Parallelism and Concurrency, Lecture 1

Shared memory

pc=…

pc=…

pc=…

Unshared:locals andcontrol

Shared:objects andstatic fields

Threads each have own unshared call stack and current statement – (pc for “program counter”) – local variables are numbers, null, or heap references

Any objects can be shared, but most are not

Page 10: Selections from CSE332  Data Abstractions  course at University  of  Washington

10Sophomoric Parallelism and Concurrency, Lecture 1

First attempt, Sum class - serial class Sum {

int ans; int len; int *arr; public: Sum (int a[], int num) { //constructor arr=a, ans = 0; for (int i=0; i < num; ++i) {

arr[i] = i+1; //initialize array } }

int sum (int lo, int hi) { int ans = 0; for(int i=lo; i < hi; ++i) { ans += arr[i]; } return ans; }}

Page 11: Selections from CSE332  Data Abstractions  course at University  of  Washington

11Sophomoric Parallelism and Concurrency, Lecture 1

Parallelism idea• Example: Sum elements of a large array • Idea Have 4 threads simultaneously sum 1/4 of the array

– Warning: Inferior first approach – explicitly using fixed number of threads

ans0 ans1 ans2 ans3 + ans

– Create 4 threads using openMP parallel– Determine thread ID, assign work based on thread ID– Accumulate partial sums for each thread– Add together their 4 answers for the final result

Page 12: Selections from CSE332  Data Abstractions  course at University  of  Washington

12Sophomoric Parallelism and Concurrency, Lecture 1

OpenMP basics

First learn some basics of OpenMP

1. Pragma based approach to parallelism2. Create pool of threads using #pragma omp parallel3. Use work sharing directives to allow each thread to do work4. To get started, we will explore these worksharing directives:

1. parallel for2. reduction3. single4. task5. taskwait

Most of these constructs have an analog in Intel® Cilk plus™ as well

Page 13: Selections from CSE332  Data Abstractions  course at University  of  Washington

13Sophomoric Parallelism and Concurrency, Lecture 1

Demo

Walk through of parallel sum code – SPMD version

Discussion points:•#pragma omp parallel

•num_threads clause

Page 14: Selections from CSE332  Data Abstractions  course at University  of  Washington

14Sophomoric Parallelism and Concurrency, Lecture 1

Demo

Walk through of OpenMP parallel for with reduction code

Discussion points:•#pragma omp parallel for

•reduction(+:ans) clause

Page 15: Selections from CSE332  Data Abstractions  course at University  of  Washington

15Sophomoric Parallelism and Concurrency, Lecture 1

Shared memory?

• Fork-join programs (thankfully) don’t require much focus on sharing memory among threads

• But in languages like C++, there is memory being shared. In our example:– lo, hi, arr fields written by “main” thread, read by helper

thread– ans field written by helper thread, read by “main” thread

• When using shared memory, you must avoid race conditions– With concurrency, we’ll learn other ways to synchronize later

such as the use of atomics, locks, critical sections

Page 16: Selections from CSE332  Data Abstractions  course at University  of  Washington

16Sophomoric Parallelism and Concurrency, Lecture 1

Divide and Conquer Approach

Want to use (only) processors “available to you now”

– Not used by other programs or threads in your program• Maybe caller is also using parallelism• Available cores can change even while your threads run

– If you have 3 processors available and using 3 threads would take time X, then creating 4 threads would take time 1.5X

// numThreads == numProcessors is bad// if some are needed for other thingsint sum(int arr[], int numThreads){ …}

Page 17: Selections from CSE332  Data Abstractions  course at University  of  Washington

17Sophomoric Parallelism and Concurrency, Lecture 1

Divide and Conquer Approach

Though unlikely for sum, in general sub-problems may take significantly different amounts of time

– Example: Apply method f to every array element, but maybe f is much slower for some data items• Example: Is a large integer prime?

– If we create 4 threads and all the slow data is processed by 1 of them, we won’t get nearly a 4x speedup• Example of a load imbalance

Page 18: Selections from CSE332  Data Abstractions  course at University  of  Washington

18Sophomoric Parallelism and Concurrency, Lecture 1

Divide and Conquer ApproachThe counterintuitive (?) solution to all these problems is to use lots of

threads, far more than the number of processors– But this will require changing our algorithm

ans0 ans1 … ansN ans

1. Forward-portable: Lots of helpers each doing a small piece2. Processors available: Hand out “work chunks” as you go

• If 3 processors available and have 100 threads, then ignoring constant-factor overheads, extra time is < 3%

3. Load imbalance: No problem if slow thread scheduled early enough• Variation probably small anyway if pieces of work are small

Page 19: Selections from CSE332  Data Abstractions  course at University  of  Washington

19Sophomoric Parallelism and Concurrency, Lecture 1

Divide-and-Conquer idea

This is straightforward to implement using divide-and-conquer– Parallelism for the recursive calls

+ + + + + + + +

+ + + +

+ ++

Page 20: Selections from CSE332  Data Abstractions  course at University  of  Washington

20Sophomoric Parallelism and Concurrency, Lecture 1

Demo

Walk through of Divide and Conquer code

Discussion points:•#pragma omp task•omp shared•omp firstprivate•omp taskwait

Page 21: Selections from CSE332  Data Abstractions  course at University  of  Washington

21Sophomoric Parallelism and Concurrency, Lecture 1

Demo

Walk through of load balanced OpenMP parallel for

Discussion points:•Omp scheduling(dynamic)

Page 22: Selections from CSE332  Data Abstractions  course at University  of  Washington

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency

Lecture 2Analysis of Fork-Join Parallel Programs

Dan Grossman

Last Updated: August 2011For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/

Page 23: Selections from CSE332  Data Abstractions  course at University  of  Washington

23Sophomoric Parallelism and Concurrency, Lecture 2

Even easier: Maps (Data Parallelism)

• A map operates on each element of a collection independently to create a new collection of the same size– No combining results– For arrays, this is so trivial some hardware has direct support

• Canonical example: Vector addition

int vector_add(int lo, int hi){ FORALL(i=lo; i < hi; i++) { result[i] = arr1[i] + arr2[i]; } return 1;}

Page 24: Selections from CSE332  Data Abstractions  course at University  of  Washington

24Sophomoric Parallelism and Concurrency, Lecture 2

Maps in ForkJoin Framework

• Even though there is no result-combining, it still helps with load balancing to create many small tasks– Maybe not for vector-add but for more compute-intensive

maps– The forking is O(log n) whereas theoretically other approaches

to vector-add is O(1)

class VecAdd { int *arr1, *arr2, *res, len; public:

VecAdd(int _arr1[],int _arr2[],int _res[],int _num) { arr1 = _arr1; arr2 = _arr2; res = _res; len = _num; }

int vector_add( int lo, int hi) {#pragma omp parallel for for( int i=lo; i < hi; ++i) { res[i] = arr1[i] + arr2[i]; } return 1; }

} //end class definition

Page 25: Selections from CSE332  Data Abstractions  course at University  of  Washington

25Sophomoric Parallelism and Concurrency, Lecture 2

Maps and reductions

Maps and reductions: the “workhorses” of parallel programming

– By far the two most important and common patterns• Two more-advanced patterns in next lecture

– Learn to recognize when an algorithm can be written in terms of maps and reductions

– Use maps and reductions to describe (parallel) algorithms

– Programming them becomes “trivial” with a little practice• Exactly like sequential for-loops seem second-nature

Page 26: Selections from CSE332  Data Abstractions  course at University  of  Washington

26Sophomoric Parallelism and Concurrency, Lecture 2

Trees

• Maps and reductions work just fine on balanced trees– Divide-and-conquer each child rather than array subranges– Correct for unbalanced trees, but won’t get much speed-up

• Example: minimum element in an unsorted but balanced binary tree in O(log n) time given enough processors

• How to do the sequential cut-off?– Store number-of-descendants at each node (easy to

maintain)– Or could approximate it with, e.g., AVL-tree height

Page 27: Selections from CSE332  Data Abstractions  course at University  of  Washington

27Sophomoric Parallelism and Concurrency, Lecture 2

Linked lists

• Can you parallelize maps or reduces over linked lists?– Example: Increment all elements of a linked list– Example: Sum all elements of a linked list

b c d e f

front back

• Once again, data structures matter!

• For parallelism, balanced trees generally better than lists so that we can get to all the data exponentially faster O(log n) vs. O(n)– Trees have the same flexibility as lists compared to arrays

Page 28: Selections from CSE332  Data Abstractions  course at University  of  Washington

28Sophomoric Parallelism and Concurrency, Lecture 2

Work and Span

Let TP be the running time if there are P processors available

Two key measures of run-time:

• Work: How long it would take 1 processor = T1– Just “sequentialize” the recursive forking

• Span: How long it would take infinity processors = T– The longest dependence-chain– Example: O(log n) for summing an array since > n/2

processors is no additional help– Also called “critical path length” or “computational depth”

Page 29: Selections from CSE332  Data Abstractions  course at University  of  Washington

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency

Lecture 4Shared-Memory Concurrency & Mutual Exclusion

Dan Grossman

Last Updated: May 2011For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/

Page 30: Selections from CSE332  Data Abstractions  course at University  of  Washington

30Sophomoric Parallelism & Concurrency, Lecture 4

Canonical Bank Account Example

Correct code in a single-threaded world

class BankAccount { private int balance = 0; int getBalance() { return balance; } void setBalance(int x) { balance = x; } void withdraw(int amount) { int b = getBalance(); if(amount > b) try {throw WithdrawTooLargeException;} setBalance(b – amount); } … // other operations like deposit, etc.}

Page 31: Selections from CSE332  Data Abstractions  course at University  of  Washington

31Sophomoric Parallelism & Concurrency, Lecture 4

A bad interleaving

Interleaved withdraw(100) calls on the same account– Assume initial balance == 150

int b = getBalance();

if(amount > b) try{throw …;}setBalance(b – amount);

int b = getBalance();if(amount > b) try{throw …;}setBalance(b – amount);

Thread 1 Thread 2

Tim

e

“Lost withdraw” – unhappy bank

Page 32: Selections from CSE332  Data Abstractions  course at University  of  Washington

32Sophomoric Parallelism & Concurrency, Lecture 4

What we need – an abstract data type for mutual exclusion • There are many ways out of this conundrum, but we need help from the

language

• One basic solution: Locks, or critical sections– OpenMP implements locks as well as critical sections. They do

similar things, but have subtle differences between them• #pragma omp critical

{ // allows only one thread access at a time in the code block // code block must have one entrance and one exit }• omp_lock_t myLock;

– omp locks have to be initialized before use.– Can be locked with omp_set_nest_lock, and unlocked with

omp_unset_nest_lock– Only one thread may hold the lock– Allows for exception handling and non structured jumps

Page 33: Selections from CSE332  Data Abstractions  course at University  of  Washington

33Sophomoric Parallelism & Concurrency, Lecture 4

Almost-correct pseudocode class BankAccount { private int balance = 0; private Lock lk = new Lock(); … void withdraw(int amount) {

lk.acquire(); /* may block */ int b = getBalance(); if(amount > b) try {throw WithdrawTooLargeException;} setBalance(b – amount); lk.release(); } // deposit would also acquire/release lk}

Page 34: Selections from CSE332  Data Abstractions  course at University  of  Washington

34Sophomoric Parallelism & Concurrency, Lecture 4

Some mistakes• A lock & critical sections are very primitive mechanisms

– Still up to you to use correctly to implement them

• Incorrect: Use different locks or different named ciritical sections for withdraw and deposit– Mutual exclusion works only when using same lock

• Poor performance: Use same lock for every bank account– No simultaneous operations on different accounts

• Incorrect: Forget to release a lock (blocks other threads forever!)– Previous slide is wrong because of the exception possibility!if(amount > b) {

lk.release(); // hard to remember! try {throw WithdrawTooLargeException;}}

Page 35: Selections from CSE332  Data Abstractions  course at University  of  Washington

35Sophomoric Parallelism & Concurrency, Lecture 4

Other operations

• If withdraw and deposit use the same lock, then simultaneous calls to these methods are properly synchronized

• But what about getBalance and setBalance?– Assume they’re public, which may be reasonable

• If they don’t acquire the same lock, then a race between setBalance and withdraw could produce a wrong result

• If they do acquire the same lock, then withdraw would block forever because it tries to acquire a lock it already has

Page 36: Selections from CSE332  Data Abstractions  course at University  of  Washington

36Sophomoric Parallelism & Concurrency, Lecture 4

Re-acquiring locks?

• Can’t let outside world callsetBalance1, not protected by locks

• Can’t have withdraw call setBalance2, because the locks would be nested

• We can use intricate re-entrant locking scheme or better yet re-structure the code. Nested locking is not recommended

int setBalance1(int x) { balance = x; }int setBalance2(int x) { lk.acquire(); balance = x; lk.release();}void withdraw(int amount) { lk.acquire(); … setBalanceX(b – amount); lk.release(); }

Page 37: Selections from CSE332  Data Abstractions  course at University  of  Washington

37Sophomoric Parallelism & Concurrency, Lecture 4

This code is easier to lock

• You may provide a setBalance() method for external use, but do NOT call it for the withdraw() member function.

• Instead, protect the direct modification of the data as shown here – avoiding nested locks

int setBalance(int x) { lk.acquire(); balance = x; lk.release();}void withdraw(int amount) { lk.acquire(); … balance = (b – amount); lk.release(); }

Page 38: Selections from CSE332  Data Abstractions  course at University  of  Washington

38Sophomoric Parallelism and Concurrency, Lecture 1

Demo

Walk through of load balanced BankAccount code

Discussion points:•#pragma omp critical

Page 39: Selections from CSE332  Data Abstractions  course at University  of  Washington

39Sophomoric Parallelism & Concurrency, Lecture 4

OpenMp Critical Section method

• Works great & is simple to implement• Can’t be used as nested function calls where each function sets

a critical section. For example foo() contains critical section which calls bar(), while bar() also contains critical section. This will not be allowed by OpenMP runtime!

• Generally better to avoid nested calls such as foo,bar example above, and as also shown in the Canonical Bank Account Example where withdraw() calls setBalance()

• Can’t be used with try/catch exception handling (due to multiple exists from code block)

• If try/catch is required, consider using scoped locking – example to follow.

Page 40: Selections from CSE332  Data Abstractions  course at University  of  Washington

40Sophomoric Parallelism & Concurrency, Lecture 4

Scoped Locking method

• Works great & is fairly simple to implement• More complicated than critical section method• Can be used as nested function calls where each function

acquires its lock. – Generally it is still better to avoid nested calls such as

foo,bar example on previous foil, and as also shown in the Canonical Bank Account Example where withdraw() calls setBalance()

• Can be used with try/catch exception handling

• The lock is released when the object or function goes out of scope.

Page 41: Selections from CSE332  Data Abstractions  course at University  of  Washington

41Sophomoric Parallelism and Concurrency, Lecture 1

Demo

Walk through of load balanced BankAccount code

Discussion points:•scoped locking using OpenMP nested locks

Page 42: Selections from CSE332  Data Abstractions  course at University  of  Washington

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency

Lecture 5 Programming with Locks and Critical Sections

Dan Grossman

Last Updated: May 2011For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/

Page 43: Selections from CSE332  Data Abstractions  course at University  of  Washington

43Sophomoric Parallelism and Concurrency, Lecture 1

Demo

Walk through of load balanced stack code

Discussion points:•using OpenMP nested locks•Using OpenMP critical section

Page 44: Selections from CSE332  Data Abstractions  course at University  of  Washington

44Sophomoric Parallelism & Concurrency, Lecture 5

Example, using critical sectionsclass Stack<E> { private E[] array = (E[])new Object[SIZE]; int index = -1; bool isEmpty() { // unsynchronized: wrong?! return index==-1; } void push(E val) { #pragma omp critical {array[++index] = val;} } E pop() { E temp; #pragma omp critical {temp = array[index--];} return temp; } E peek() { // unsynchronized: wrong! return array[index]; } }

Page 45: Selections from CSE332  Data Abstractions  course at University  of  Washington

45Sophomoric Parallelism & Concurrency, Lecture 5

Why wrong?

• It looks like isEmpty and peek can “get away with this” since push and pop adjust the state “in one tiny step”

• But this code is still wrong and depends on language-implementation details you cannot assume– Even “tiny steps” may require multiple steps in the

implementation: array[++index] = val probably takes at least two steps

– Code has a data race, allowing very strange behavior • Important discussion in next lecture

• Moral: Don’t introduce a data race, even if every interleaving you can think of is correct

Page 46: Selections from CSE332  Data Abstractions  course at University  of  Washington

46Sophomoric Parallelism & Concurrency, Lecture 5

3 choicesFor every memory location (e.g., object field) in your program, you

must obey at least one of the following:1. Thread-local: Don’t use the location in > 1 thread2. Immutable: Don’t write to the memory location3. Synchronized: Use synchronization to control access to the

location

all memory thread-localmemory immutable

memory

need synchronization

Page 47: Selections from CSE332  Data Abstractions  course at University  of  Washington

47Sophomoric Parallelism & Concurrency, Lecture 5

Example

Suppose we want to change the value for a key in a hashtable without removing it from the table– Assume critical section guards the whole table

#pragma omp critical{ v1 = table.lookup(k); v2 = expensive(v1); table.remove(k); table.insert(k,v2);}

Papa Bear’s critical section was too long

(table locked during expensive call)

Page 48: Selections from CSE332  Data Abstractions  course at University  of  Washington

48Sophomoric Parallelism & Concurrency, Lecture 5

Example

Suppose we want to change the value for a key in a hashtable without removing it from the table– Assume critical section guards the whole table

#pragma omp critical{ v1 = table.lookup(k);}v2 = expensive(v1);#pragma omp critical{ table.remove(k); table.insert(k,v2);}

Mama Bear’s critical section was too short

(if another thread updated the entry, we will lose an update)

Page 49: Selections from CSE332  Data Abstractions  course at University  of  Washington

49Sophomoric Parallelism & Concurrency, Lecture 5

ExampleSuppose we want to change the value for a key in a hashtable

without removing it from the table– Assume critical section guards the whole table

done = false;while(!done) { #pragma omp critical { v1 = table.lookup(k); } v2 = expensive(v1); #pragma omp critical { if(table.lookup(k)==v1) { done = true; table.remove(k); table.insert(k,v2);}}}

Baby Bear’s critical section was just right

(if another updateoccurred, try ourupdate again)

Page 50: Selections from CSE332  Data Abstractions  course at University  of  Washington

50Sophomoric Parallelism & Concurrency, Lecture 5

Don’t roll your own

• It is rare that you should write your own data structure– Provided in standard libraries– Point of these lectures is to understand the key trade-offs

and abstractions

• Especially true for concurrent data structures– Far too difficult to provide fine-grained synchronization

without race conditions– Standard thread-safe libraries like TBB ConcurrentHashMap written by world experts

Guideline #5: Use built-in libraries whenever they meet your needs