multi-core programming - university of cambridge · suggested reading “the art of multiprocessor...

281
MULTI-CORE PROGRAMMING Tim Harris

Upload: others

Post on 15-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

MULTI-CORE PROGRAMMING

Tim Harris

Page 2: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview

Building shared memory data structures

Lists, queues, hashtables, …

Why?

Used directly by applications (e.g., in C/C++, Java, C#, …)

Used in the language runtime system (e.g., management of work, implementations of message passing, …)

Used in traditional operating systems (e.g., synchronization between top/bottom-half code)

Why not?

Don’t think of “threads + shared data structures” as a default/good/complete/desirable programming model

It’s better to have shared memory and not need it…

Non-blocking data structures and transactional memory 2 03/11/2011

Page 3: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

What do we care about?

Non-blocking data structures and transactional memory 3

How fast is it?

Is it correct? How easy is it to write?

When can I

use it?

How does it scale?

What does it mean to be correct?

e.g., if multiple concurrent threads are using iterators on a

shared data structure at the same time? Does it matter? Who is the

target audience? How much effort can they put into it? Is

implementing a data structure an undergrad programming

exercise? …or a research paper?

Between threads in the same process? Between processes sharing memory? Within an

interrupt handler? With/without some kind of runtime system support?

Suppose I have a sequential implementation (no

concurrency control at all): is the new implementation 5%

slower? 5x slower? 100x slower?

How does performance change as we increase the number of

threads? When does the implementation add or avoid

synchronization?

03/11/2011

Page 4: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

What do we care about?

Non-blocking data structures and transactional memory 4

How fast is it?

Is it correct? How easy is it to write?

When can I

use it?

How does it scale?

03/11/2011

Page 5: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

What should we do?

1. Be explicit about goals and trade-offs

A benefit in one dimension often has costs in another

Does a perf increase prevent a data structure being used in some particular setting?

Does a technique to make something easier to write make the implementation slower?

Do we care? It depends on the setting

2. Remember, parallel programming is rarely a recreational activity

The ultimate goal is to increase perf (time, or resources used)

Does an implementation scale well enough to out-perform a good sequential implementation?

Non-blocking data structures and transactional memory 5 03/11/2011

Page 6: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Suggested reading

“The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures, from both practical and theoretical perspectives

“Transactional memory, 2nd edition”, Harris, Larus, Rajwar – recently revamped survey of TM work, with 350+ references

“NOrec: streamlining STM by abolishing ownership records”, Dalessandro, Spear, Scott, PPoPP 2010

“Simplifying concurrent algorithms by exploiting transactional memory”, Dice, Lev, Marathe, Moir, Nussbaum, Olszewski, SPAA 2010

Non-blocking data structures and transactional memory 6 03/11/2011

Page 7: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

System model

Non-blocking data structures and transactional memory 7

Shared physical memory

Cache(s) Cache(s) Cache(s)

H/W threads

H/W threads

Multiple h/w threads (whether separate cores, or SMT)

Shared physical memory with hardware cache coherence

S/W threads multiplexed over h/w threads under OS control

03/11/2011

Page 8: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Multi-threaded single core

ALU

ALU

Main memory

1

2

3

4

5

...

Multi-threaded h/w

1

2

3

4

5

...

1

2

3

4

5

...

Multiple threads in a workload with:

- Poor spatial locality

- Frequent memory accesses

L2 cache (4MB)

L1 cache (64KB) ...

Non-blocking data structures and transactional memory 8 03/11/2011

Page 9: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Multi-threaded single core

Main memory

1

2

3

4

5

...

Multi-threaded h/w

1

2

3

4

5

...

Multiple threads with synergistic resource needs

L2 cache (4MB)

L1 cache (64KB)

ALU

ALU

Non-blocking data structures and transactional memory 9 03/11/2011

Page 10: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Core

Main memory

L1 cache L1 cache

ALU

ALU

ALU

ALU

Non-blocking data structures and transactional memory 10 03/11/2011

Page 11: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

Read

Non-blocking data structures and transactional memory 11 03/11/2011

Page 12: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

S

Read

Non-blocking data structures and transactional memory 12 03/11/2011

Page 13: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

S

S

Read

Non-blocking data structures and transactional memory 13 03/11/2011

Page 14: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

S

S

Read

Non-blocking data structures and transactional memory 14 03/11/2011

Page 15: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

S

S S

Read

Non-blocking data structures and transactional memory 15 03/11/2011

Page 16: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

S

S S

Write

Non-blocking data structures and transactional memory 16 03/11/2011

Page 17: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

S

... S

Write

Non-blocking data structures and transactional memory 17 03/11/2011

Page 18: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

...

... S

Write

Non-blocking data structures and transactional memory 18 03/11/2011

Page 19: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

...

... I

Write

Non-blocking data structures and transactional memory 19 03/11/2011

Page 20: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

E

M I

Write

Non-blocking data structures and transactional memory 20 03/11/2011

Page 21: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

E

M I

Write

Non-blocking data structures and transactional memory 21 03/11/2011

Page 22: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

E

M ...

Write

Non-blocking data structures and transactional memory 22 03/11/2011

Page 23: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

...

M ...

Write

Non-blocking data structures and transactional memory 23 03/11/2011

Page 24: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache

M

I ...

Write

Non-blocking data structures and transactional memory 24 03/11/2011

Page 25: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – common L2

1

2

3

4

5

... L2 cache

Single-threaded

core

Main memory

L1 cache L1 cache I E

Write

M

Non-blocking data structures and transactional memory 25 03/11/2011

Page 26: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

1

2

3

4

5

...

Multi-core h/w – separate L2

1

2

3

4

5

...

L1 cache

Single-threaded

core

L1 cache

Main memory

L2 cache L2 cache

Non-blocking data structures and transactional memory 26 03/11/2011

Page 27: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

1

2

3

4

5

...

Multi-core h/w – additional L3

1

2

3

4

5

...

Main memory

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

L2 cache L2 cache

L3 cache

Non-blocking data structures and transactional memory 27 03/11/2011

Page 28: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

1

2

3

4

5

...

Multi-threaded multi-core h/w

1

2

3

4

5

...

Main memory

1

2

3

4

5

...

1

2

3

4

5

...

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

L2 cache L2 cache

L3 cache

Non-blocking data structures and transactional memory 28 03/11/2011

Page 29: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

SMP multiprocessor

Single-threaded

core

1

2

3

4

5 ...

1

2

3

4

5 ...

L1 cache

Single-threaded

core

L1 cache

L2 cache L2 cache

Main memory

Non-blocking data structures and transactional memory 29 03/11/2011

Page 30: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Interconnect

NUMA multiprocessor

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

Memory & directory

L2 cache L2 cache

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

Memory & directory

L2 cache L2 cache

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

Memory & directory

L2 cache L2 cache

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

Memory & directory

L2 cache L2 cache

Non-blocking data structures and transactional memory 30 03/11/2011

Page 31: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Three kinds of parallel hardware

Multi-threaded cores

Increase utilization of a core or memory b/w

Peak ops/cycle fixed

Multiple cores

Increase ops/cycle

Don’t necessarily scale caches and off-chip resources proportionately

Multi-processor machines

Increase ops/cycle

Often scale cache & memory capacities and b/w proportionately

Non-blocking data structures and transactional memory 31 03/11/2011

Page 32: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

Transactional memory

Non-blocking data structures and transactional memory 32 03/11/2011

Page 33: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Test-and-set locks

TATAS locks & backoff

Queue-based locks

Hierarchical locks

Reader-writer locks

Lock-free programming

Transactional memory

Non-blocking data structures and transactional memory 33 03/11/2011

Page 34: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Test and set (pseudo-code)

bool testAndSet(bool *b) {

bool result;

atomic {

result = *b;

*b = TRUE;

}

return result;

}

Pointer to a location holding a boolean

value (TRUE/FALSE)

Read the current contents of the

location b points to…

…set the contents of *b to TRUE

Non-blocking data structures and transactional memory 03/11/2011 34

Page 35: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Test and set

time

• Suppose two threads use it at once

Thread 2:

Thread 1:

testAndSet(b)->true

testAndSet(b)->false

Non-blocking data structures and transactional memory 03/11/2011 35

Page 36: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

FALSE lock:

void acquireLock(bool *lock) { while (testAndSet(lock)) { /* Nothing */ } }

void releaseLock(bool *lock) { *lock = FALSE; }

Test and set lock

FALSE => lock available TRUE => lock held

Each call tries to acquire the lock, returning TRUE

if it is already held

NB: all this is pseudo-code, assuming SC

memory

Non-blocking data structures and transactional memory 03/11/2011 36

Page 37: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Test and set lock

FALSE lock:

void acquireLock(bool *lock) { while (testAndSet(lock)) { /* Nothing */ } }

void releaseLock(bool *lock) { *lock = FALSE; }

Thread 1

TRUE

Thread 2

Non-blocking data structures and transactional memory 03/11/2011 37

Page 38: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

What are the problems here?

testAndSet implementation

causes contention

Non-blocking data structures and transactional memory 03/11/2011 38

Page 39: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

Contention from testAndSet

L1 cache

Single-threaded

core

L1 cache

Main memory

L2 cache L2 cache

Non-blocking data structures and transactional memory 03/11/2011 39

Page 40: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

Main memory

L2 cache L2 cache

Multi-core h/w – separate L2

testAndSet(k)

k

k

Non-blocking data structures and transactional memory 03/11/2011 40

Page 41: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

Main memory

L2 cache L2 cache

Multi-core h/w – separate L2

testAndSet(k)

k

k

Non-blocking data structures and transactional memory 03/11/2011 41

Page 42: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

Main memory

L2 cache L2 cache

Multi-core h/w – separate L2

testAndSet(k)

k

k

Non-blocking data structures and transactional memory 03/11/2011 42

Does this still happen in practice? Do modern

CPUs avoid fetching the line in exclusive mode

on failing TAS?

Page 43: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

What are the problems here?

Spinning may waste resources while

waiting

No control over locking policy

testAndSet implementation

causes contention

Only supports mutual exclusion: not reader-

writer locking

Non-blocking data structures and transactional memory 03/11/2011 43

Page 44: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

General problem

No logical conflict between two failed lock acquires

Cache protocol introduces a physical conflict

For a good algorithm: only introduce physical conflicts if a logical conflict occurs

In a lock: successful lock-acquire & failed lock-acquire

In a set: successful insert(10) & failed insert(10)

Non-blocking data structures and transactional memory 03/11/2011 44

Page 45: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Test-and-set locks

TATAS locks & backoff

Queue-based locks

Hierarchical locks

Reader-writer locks

Lock-free programming

Transactional memory

Non-blocking data structures and transactional memory 45 03/11/2011

Page 46: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Test and test and set lock

FALSE lock:

void acquireLock(bool *lock) {

do {

while (*lock) { }

} while (testAndSet(lock));

}

void releaseLock(bool *lock) {

*lock = FALSE;

}

FALSE => lock available TRUE => lock held

Spin while the lock is held… only do

testAndSet when it is clear

Non-blocking data structures and transactional memory 03/11/2011 46

Page 47: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Performance

# Threads

Tim

e

Ideal

TATAS TAS

Based on Fig 7.4, Herlihy & Shavit, “The Art of Multiprocessor Programming”

Non-blocking data structures and transactional memory 03/11/2011 47

Page 48: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Stampedes

TRUE lock:

void acquireLock(bool *lock) {

do {

while (*lock) { }

} while (testAndSet(lock));

}

void releaseLock(bool *lock) {

*lock = FALSE;

}

Non-blocking data structures and transactional memory 03/11/2011 48

Page 49: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Back-off algorithms

1. Start by spinning, watching the lock for c

2. If the lock does not become free, spin locally for s (without watching the lock)

What should “c” be? What should “s” be?

Non-blocking data structures and transactional memory 03/11/2011 49

Page 50: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Time spent waiting “c”

Lower values:

Less time to build up a set of threads that will stampede

Less contention in the memory system, if remote reads incur a cost

Risk of a delay in noticing when the lock becomes free

Higher values:

Less likelihood of a delay between a lock being released and a waiting thread noticing

Non-blocking data structures and transactional memory 03/11/2011 50

Page 51: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Local spinning time “s”

Lower values:

More responsive to the lock becoming available

Higher values:

If the lock doesn’t become available then the thread makes fewer accesses to the shared variable

Non-blocking data structures and transactional memory 03/11/2011 51

Page 52: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Methodical approach

For a given workload and performance model:

What is the best that an oracle could do (i.e. given perfect knowledge of lock demands)?

How does a practical algorithm compare with this?

Look for an algorithm with a bound between its performance and that of the oracle

“Competitive spinning”

Non-blocking data structures and transactional memory 03/11/2011 52

Page 53: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Rule of thumb

Spin for a duration that’s comparable with the shortest back-off interval

Exponentially increase the per-thread back-off interval (resetting it when the lock is acquired)

Use a maximum back-off interval that is large enough that waiting threads don’t interfere with the other threads’ performance

Non-blocking data structures and transactional memory 03/11/2011 53

Page 54: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Test-and-set locks

TATAS locks & backoff

Queue-based locks

Hierarchical locks

Reader-writer locks

Lock-free programming

Transactional memory

Non-blocking data structures and transactional memory 54 03/11/2011

Page 55: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Queue-based locks

Lock holders queue up: immediately provides FCFS behavior

Each spins locally on a flag in their queue entry: no remote memory accesses while waiting

A lock release wakes the next thread directly: no stampede

Non-blocking data structures and transactional memory 03/11/2011 55

Page 56: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

MCS locks

lock:

FALSE FALSE FALSE

QNode 1 QNode 2 QNode 3

Head Tail

Local flag

Lock identifies tail

Non-blocking data structures and transactional memory 03/11/2011 56

Page 57: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

MCS lock acquire

lock:

FALSE void acquireMCS(mcs *lock, QNode *qn) {

QNode *prev;

qn->flag = false;

qn->next = NULL;

while (true) {

prev = lock->tail;

if (CAS(&lock->tail, prev, qn)) break;

}

if (prev != NULL) {

prev->next = qn;

while (!qn->flag) { } // Spin

} }

Find previous tail node

Atomically replace “prev” with “qn” in

the lock itself

Add link within the queue

Non-blocking data structures and transactional memory 03/11/2011 57

Page 58: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

MCS lock release

lock:

FALSE

void releaseMCS(mcs *lock, QNode *qn) {

if (lock->tail = qn) {

if (CAS(&lock->tail, qn, NULL)) return;

}

while (qn->next == NULL) { }

qn->next->flag = TRUE;

}

TRUE qn:

If we were at the tail then remove us

Wait for next lock holder to announce themselves;

signal them

Non-blocking data structures and transactional memory 03/11/2011 58

Page 59: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Test-and-set locks

TATAS locks & backoff

Queue-based locks

Hierarchical locks

Reader-writer locks

Lock-free programming

Transactional memory

Non-blocking data structures and transactional memory 59 03/11/2011

Page 60: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hierarchical locks

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cache

Memory bus

Memory

Non-blocking data structures and transactional memory 03/11/2011 60

Page 61: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hierarchical locks

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cache

Memory bus

Memory

Non-blocking data structures and transactional memory 03/11/2011 61

Page 62: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hierarchical locks

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cache

Memory bus

Memory

Pass lock “nearby” if

possible

Call this a “cluster” of

cores

Non-blocking data structures and transactional memory 03/11/2011 62

Page 63: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hierarchical TATAS with backoff

-1

lock:

void acquireLock(bool *lock) {

do {

holder = *lock;

if (holder != -1) {

if (holder == MY_CLUSTER) {

BackOff(SHORT);

} else {

BackOff(LONG);

}

}

} while (!CAS(lock, -1, MY_CLUSTER));

}

-1 => lock available n => lock held by cluster n

Non-blocking data structures and transactional memory 03/11/2011 63

Page 64: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hierarchical locks

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cache

Memory bus

Memory

Avoid this cycle repeating,

starving 5 & 7…

Non-blocking data structures and transactional memory 03/11/2011 64

Page 65: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hierarchical CLH queue lock

Local queue:

Lock identifies tail

TRUE

myNode myPred

NULL

Flag => successor must wait

Non-blocking data structures and transactional memory 03/11/2011 65

Thread private variables represent links implicitly

Based on hierarchical CLH lock of Luchangco, Nussbaum, Shavit

Page 66: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hierarchical CLH queue lock

Local queue:

Lock identifies tail

TRUE

myNode myPred

TRUE

myNode myPred

NULL

Flag => successor must wait

Non-blocking data structures and transactional memory 03/11/2011 66

Page 67: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hierarchical CLH queue lock

Local queue:

TRUE

myNode myPred

TRUE

myNode myPred

NULL

myNode myPred

TRUE

NULL Current lock

holder

Cluster master: sees lock is held, so waits a “combining delay”

Non-blocking data structures and transactional memory 03/11/2011 67

Global queue:

Page 68: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hierarchical CLH queue lock

Local queue:

TRUE

myNode myPred

TRUE

myNode myPred

Global queue: myNode myPred

TRUE

NULL Splice whole list to tail of global queue

Set “Tail When Spliced” flag: next local queue entry will be a new

cluster master

Non-blocking data structures and transactional memory 03/11/2011 68

Page 69: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Test-and-set locks

TATAS locks & backoff

Queue-based locks

Hierarchical locks

Reader-writer locks

Lock-free programming

Transactional memory

Non-blocking data structures and transactional memory 69 03/11/2011

Page 70: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Reader-writer locks (TATAS-like)

Non-blocking data structures and transactional memory 70

0

lock:

void acquireWrite(int *lock) {

do {

if ((*lock == 0) &&

(CAS(lock, 0, -1))) {

break;

} while (1);

}

void releaseWrite(int *lock) {

*lock = 0;

}

-1 => Locked for write

0 => Lock available

+n => Locked by n readers

void acquireRead(int *lock) {

do {

int oldVal = *lock;

if ((oldVal >= 0) &&

(CAS(lock, oldVal, oldVal+1))) {

break;

} } while (1);

}

void releaseRead(int *lock) {

FADD(lock, -1); // Atomic fetch-and-add

}

03/11/2011

Page 71: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

The problem with readers

Non-blocking data structures and transactional memory 71

int readCount() {

acquireRead(lock);

int result = count;

releaseRead(lock);

return result;

}

void incrementCount() {

acquireWrite(lock);

count++;

releaseWrite(lock);

}

Each acquireRead fetches the cache line holding the lock in exclusive mode

Again: acquireRead are not logically conflicting, but this introduces a physical confliect

The time spent managing the lock is likely to vastly dominate the actual time looking at the counter

Many workloads are read-mostly…

03/11/2011

Page 72: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Keeping readers separate

Non-blocking data structures and transactional memory 72

Owner Flag-1 Flag-2 Flag-3 Flag-N

Acquire write on core i:

CAS the owner from 0 to i

…then spin until all of the

flags are clear

…then check that the owner is 0

(if not then clear own flag and wait)

Acquire read on core i: set

own flag to true…

03/11/2011

Page 73: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Keeping readers separate

With care, readers do not need to synchronize with other readers

Extend the flags to be whole cache lines

Pack multiple locks flags for the same thread onto the same line

Exploit the cache structure in the machine: Dice & Shavit’s TLRW byte-lock

If “N” threads is very large..

Dedicate the flags to specific important threads

Replace the flags with ordinary multi-reader locks

Non-blocking data structures and transactional memory 73 03/11/2011

Page 74: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Read-Copy-Update (RCU)

Non-blocking data structures and transactional memory 74

Use locking to serialize updates (typically) …but allow readers to operate concurrently with updates

Ensure that readers don’t go wrong if they access data mid-update Have data structures reachable via a single root pointer:

update the root pointer rather than updating the data structure in-place

Ensure that updates don’t affect readers – e.g., initializing nodes before splicing them into a list, and retaining “next” pointers in deleted nodes

Exact semantics offered can be subtle (ongoing research direction)

Memory management problems common with lock-free data structures

03/11/2011

Page 75: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

What do we care about?

Non-blocking data structures and transactional memory 75

How fast is it?

Is it correct? How easy is it to write?

When can I

use it?

How does it scale?

03/11/2011

Page 76: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

What’s wrong with locks?

Lists without locks, linearizability

Lock-free progress

Hashtables

Skiplists

Queues

Reducing contention

Memory management

Transactional memory

Non-blocking data structures and transactional memory 76 03/11/2011

Page 77: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Ease of use Performance

Applicability Deadlock

Difficult to get right

Inhibit scaling

Convoy problems

Cost of some implementations

Non-composability

Priority inversion

Blocking

Non-blocking data structures and transactional memory 03/11/2011 77

What do people say is wrong with locks?

Page 78: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

What’s wrong with locks?

Lists without locks, linearizability

Lock-free progress

Hashtables

Skiplists

Queues

Reducing contention

Memory management

Transactional memory

Non-blocking data structures and transactional memory 78 03/11/2011

Page 79: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

What we’re building

A set of integers, represented by a sorted linked list

find(int) -> bool

insert(int) -> bool

delete(int) -> bool

Non-blocking data structures and transactional memory 03/11/2011 79

Page 80: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

The building blocks

read(addr) -> val

write(addr, val)

cas(addr, old-val, new-val) -> bool

(I’ll assume that memory is sequentially consistent, and ignore allocation / de-allocation for the moment)

Non-blocking data structures and transactional memory 03/11/2011 80

Page 81: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Searching a sorted list

find(20):

H 10 30 T

20?

find(20) -> false

Non-blocking data structures and transactional memory 03/11/2011 81

Page 82: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Inserting an item with CAS

insert(20):

H 10 30 T

20

30 20

insert(20) -> true

Non-blocking data structures and transactional memory 03/11/2011 82

Page 83: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Inserting an item with CAS

insert(20):

H 10 30 T

20

30 20

25

30 25

• insert(25):

Non-blocking data structures and transactional memory 03/11/2011 83

Page 84: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Searching and finding together

find(20)

H 10 30 T

-> false

20

20?

• insert(20) -> true

This thread saw 20 was not in the set...

...but this thread succeeded in putting

it in!

• Is this a correct implementation of a set?

• Should the programmer be surprised if this happens?

• What about more complicated mixes of operations?

Non-blocking data structures and transactional memory 03/11/2011 84

Page 85: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Correctness criteria

“If it finds like a set, inserts like a set, and

deletes like a set, then let’s call it a set...

Non-blocking data structures and transactional memory 03/11/2011 85

Page 86: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Sequential specification

Ignore the list for the moment, and focus on the set:

find(int) -> bool

insert(int) -> bool

delete(int) -> bool

10, 20, 30

10, 15, 20, 30

10, 15, 30 10, 15, 20, 30

insert(15)->true

insert(20)->false delete(20)->true

Sequential: we’re only considering one operation

on the set at a time

Specification: we’re saying what a set does, not what a list does,

or how it looks in memory

Non-blocking data structures and transactional memory 03/11/2011 86

Page 87: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Sequential specification

deleteany() -> int 10, 20, 30

deleteany()->10

20, 30

deleteany()->20

10, 30

This is still a sequential spec... just not a deterministic one

Non-blocking data structures and transactional memory 03/11/2011 87

Page 88: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

System model

Shared object (e.g. “set”)

find/insert/delete

Thread 1 Thread n ... Threads make

invocations and receive responses from the set

(~method calls/returns)

Primitive objects (e.g. “memory location”)

read/write/CAS ...the set is implemented by

making invocations and responses on memory

Non-blocking data structures and transactional memory 03/11/2011 88

Page 89: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

High level: sequential history

time

T1: in

sert(10)

-> t

rue

T2

: insert(20

)

-> t

rue

T1: fin

d(15)

-> f

alse

• No overlapping invocations:

10 10, 20 10, 20

Non-blocking data structures and transactional memory 03/11/2011 89

Page 90: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

High level: concurrent history

time

• Allow overlapping invocations:

Thread 2:

Thread 1:

insert(10)->true insert(20)->true

find(20)->false

Non-blocking data structures and transactional memory 03/11/2011 90

Page 91: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Linearizability

• Is there a correct sequential history:

• Same results as the concurrent one

• Consistent with the timing of the invocations/responses?

Non-blocking data structures and transactional memory 03/11/2011 91

Page 92: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: linearizable

time

Thread 2:

Thread 1:

insert(10)->true insert(20)->true

find(20)->false A valid sequential

history: this concurrent execution is OK

Non-blocking data structures and transactional memory 03/11/2011 92

Page 93: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: linearizable

time

Thread 2:

Thread 1:

insert(10)->true delete(10)->true

find(10)->false

Non-blocking data structures and transactional memory 03/11/2011 93

A valid sequential history: this concurrent

execution is OK

Page 94: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: not linearizable

time

Thread 2:

Thread 1:

insert(10)->true insert(10)->false

delete(10)->true

Non-blocking data structures and transactional memory 03/11/2011 94

Page 95: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Returning to our example

• find(20)

H 10 30 T

-> false

20

20?

• insert(20) -> true

Thread 2:

Thread 1:

insert(20)->true

find(20)->false

A valid sequential history: this concurrent execution

is OK

Non-blocking data structures and transactional memory 03/11/2011 95

Page 96: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Recurring technique

For updates:

Perform an essential step of an operation by a single atomic instruction

E.g. CAS to insert an item into a list

This forms a “linearization point”

For reads:

Identify a point during the operation’s execution when the result is valid

Not always a specific instruction

Non-blocking data structures and transactional memory 03/11/2011 96

Page 97: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Correctness (informal)

Non-blocking data structures and transactional memory 97

10, 20

H 10 20 T

15

10, 15, 20

Abstraction function maps the

concrete list to the abstract set’s

contents

03/11/2011

Page 98: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Correctness (informal)

Non-blocking data structures and transactional memory 98

time

Lo

oku

p(20

)

True

Insert(15)

True

High-level operation

Primitive step (read/write/CAS)

H H->10 10->20 H H->10 New CAS

03/11/2011

Page 99: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Correctness (informal)

Non-blocking data structures and transactional memory 99

time

Lo

oku

p(20

)

True

Insert(15)

True

H H->10 10->20 H H->10 New CAS

A left mover commutes with operations immediately before it

A right mover commutes with operations immediately after it

1. Show operations before linearization point are right movers

2. Show operations after linearization point are left movers

3. Show linearization point updates abstract state

03/11/2011

Page 100: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Correctness (informal)

Non-blocking data structures and transactional memory 100

time

Lo

oku

p(20

)

True

Insert(15)

True

H H->10 10->20 H H->10 New CAS

A left mover commutes with operations immediately before it

A right mover commutes with operations immediately after it

Move these right over the read of the

10->20 link

03/11/2011

Page 101: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Adding “delete”

First attempt: just use CAS delete(10):

H 10 30 T

10 30

Non-blocking data structures and transactional memory 03/11/2011 101

Page 102: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Delete and insert:

delete(10) & insert(20):

H 10 30 T

10 30

20

30 20

Non-blocking data structures and transactional memory 03/11/2011 102

Page 103: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Logical vs physical deletion

H 10 30 T

20

10 30

30 30X

30 20

Non-blocking data structures and transactional memory 03/11/2011 103

Use a ‘spare’ bit to indicate logically deleted nodes:

Page 104: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Delete-greater-than-or-equal

DeleteGE(int x) -> int

Remove “x”, or next element above “x”

H 10 30 T

• DeleteGE(20) -> 30

H 10 T

Non-blocking data structures and transactional memory 03/11/2011 104

Page 105: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Does this work: DeleteGE(20)

H 10 30 T

1. Walk down the list, as in a normal delete, find 30 as

next-after-20

2. Do the deletion as normal: set the mark bit in 30, then

physically unlink

Non-blocking data structures and transactional memory 03/11/2011 105

Page 106: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Delete-greater-than-or-equal

time

Thread 2:

Thread 1:

insert(25)->true insert(30)->false

deleteGE(20)->30

A B

C

A must be after C (otherwise C should

have returned 15)

C must be after B (otherwise B should

have succeeded)

B must be after A (thread order)

Non-blocking data structures and transactional memory 03/11/2011 106

Page 107: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

How to realise this is wrong

See operation which determines result

Consider a delay at that point

Is the result still valid?

Delayed read: is the memory still accessible (more of this next week)

Delayed write: is the write still correct to perform?

Delayed CAS: does the value checked by the CAS determine the result?

Non-blocking data structures and transactional memory 03/11/2011 107

Page 108: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

What’s wrong with locks?

Lists without locks, linearizability

Lock-free progress

Hashtables

Skiplists

Queues

Reducing contention

Memory management

Transactional memory

Non-blocking data structures and transactional memory 108 03/11/2011

Page 109: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

static volatile int MY_LIST = 0;

bool find(int key) {

// Wait until list available

while (CAS(&MY_LIST, 0, 1) == 1) {

}

...

// Release list

MY_LIST = 0;

}

OK, we’re not calling pthread_mutex_lock... but we’re essentially doing the

same thing

Non-blocking data structures and transactional memory 03/11/2011 109

Progress: is this a good “lock-free” list?

Page 110: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

“Lock-free”

A specific kind of non-blocking progress guarantee

Precludes the use of typical locks

From libraries

Or “hand rolled”

Often mis-used informally as a synonym for

Free from calls to a locking function

Fast

Scalable

Non-blocking data structures and transactional memory 03/11/2011 110

Page 111: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

time

Wait-free

A thread finishes its own operation if it continues executing steps

Start

Fin

ish

Fin

ish

Start

Fin

ish

Non-blocking data structures and transactional memory 03/11/2011 111

Start

Page 112: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Implementing wait-free algorithms

General construction techniques exist (“universal constructions”)

In practice, often used in hybrid settings (e.g., wait-free find)

Queuing and helping strategies: everyone ensures oldest operation makes progress

Niches, e.g., bounded-wait-free in real-time systems

Non-blocking data structures and transactional memory 03/11/2011 112

Page 113: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

time

Lock-free

Some thread finishes its operation if threads continue taking steps

Start

Start

Fin

ish

Fin

ish

Start

Start

Fin

ish

Non-blocking data structures and transactional memory 03/11/2011 113

Page 114: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

A (poor) lock-free counter

Non-blocking data structures and transactional memory 114

int getNext(int *counter) { while (true) { int result = *counter; if (CAS(counter, result, result+1)) { return result; } } }

Not wait free: no guarantee that any

particular thread will succeed

03/11/2011

Page 115: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Implementing lock-free algorithms

Ensure that one thread (A) only has to repeat work if some other thread (B) has made “real progress”

e.g., insert(x) starts again if it finds that a conflicting update has occurred

Use helping to let one thread finish another’s work

e.g., physically deleting a node on its behalf

Non-blocking data structures and transactional memory 03/11/2011 115

Page 116: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

time

Obstruction-free

A thread finishes its own operation if it runs in isolation

Start

Start

Fin

ish Interference here can prevent

any operation finishing

Non-blocking data structures and transactional memory 03/11/2011 116

Page 117: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

A (poor) obstruction-free counter

Non-blocking data structures and transactional memory 117

int getNext(int *counter) { while (true) { int result = LL(counter); if (SC(counter, result+1)) { return result; } } }

Weak load-linked (LL) store-conditional (SC): LL on one thread will prevent an SC on another thread

succeeding

03/11/2011

Page 118: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Building obstruction-free algorithms

Ensure that none of the low-level steps leave a data structure “broken”

On detecting a conflict:

Help the other party finish

Get the other party out of the way

Use contention management to reduce likelihood of live-lock

Non-blocking data structures and transactional memory 03/11/2011 118

Page 119: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

What’s wrong with locks?

Lists without locks, linearizability

Lock-free progress

Hashtables

Skiplists

Queues

Reducing contention

Memory management

Transactional memory

Non-blocking data structures and transactional memory 119 03/11/2011

Page 120: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hash tables

0 16 24

5

3 11

Bucket array: 8 entries in

example

List of items with hash val modulo 8 == 0

Non-blocking data structures and transactional memory 03/11/2011 120

Page 121: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hash tables: Contains(16)

0 16 24

5

3 11

1. Hash 16. Use bucket 0

2. Use normal list operations

Non-blocking data structures and transactional memory 03/11/2011 121

Page 122: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Hash tables: Delete(11)

0 16 24

5

3 11

1. Hash 11. Use bucket 3

2. Use normal list operations

Non-blocking data structures and transactional memory 03/11/2011 122

Page 123: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Lessons from this hashtable

Informal correctness argument:

Operations on different buckets don’t conflict: no extra concurrency control needed

Operations appear to occur atomically at the point where the underlying list operation occurs

(Not specific to lock-free lists: could use whole-table lock, or per-list locks, etc.)

Non-blocking data structures and transactional memory 03/11/2011 123

Page 124: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Practical difficulties:

Key-value mapping

Population count

Iteration

Resizing the bucket array

Options to consider when implementing a “difficult” operation:

Relax the semantics (e.g., non-exact count, or non-linearizable count)

Fall back to a simple implementation if permitted (e.g., lock the whole table for resize)

Design a clever implementation (e.g., split-ordered lists)

Use a different data structure (e.g., skip lists)

Non-blocking data structures and transactional memory 03/11/2011 124

Page 125: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

What’s wrong with locks?

Lists without locks, linearizability

Lock-free progress

Hashtables

Skiplists

Queues

Reducing contention

Memory management

Transactional memory

Non-blocking data structures and transactional memory 125 03/11/2011

Page 126: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Skip lists

5 11 16 24 0 3

Each node is a “tower” of random size. High levels

skip over lower levels

All items in a single list: this defines the set’s

contents

Non-blocking data structures and transactional memory 03/11/2011 126

Page 127: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Skip lists: Delete(11)

5 11 16 24 0 3

Principle: lowest list is the truth

1. Find “11” node, mark it logically deleted

2. Link by link remove “11” from the towers

3. Finally, remove “11” from lowest list

Non-blocking data structures and transactional memory 03/11/2011 127

Page 128: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

What’s wrong with locks?

Lists without locks, linearizability

Lock-free progress

Hashtables

Skiplists

Queues

Reducing contention

Memory management

Transactional memory

Non-blocking data structures and transactional memory 128 03/11/2011

Page 129: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Work stealing queues

PushBottom(Item) PopBottom() -> Item

PopTop() -> Item

Add/remove items, PopBottom must return an item if the queue is

not empty

Try to steal an item. May sometimes return

nothing “spuriously” 1. Semantics relaxed for “PopTop”

2. Restriction: only one thread ever calls “Push/PopBottom”

3. Implementation costs skewed toward “PopTop” complex

Non-blocking data structures and transactional memory 03/11/2011 129

Page 130: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

0

1

2

3

4

Bounded deque

Top / V0

Bottom “Bottom” is a normal integer, updated only by

the local end of the queue

Items between the indices are present in the

queue “Top” has a version number, updated atomically with it

Non-blocking data structures and transactional memory 03/11/2011 130

Arora, Blumofe, Plaxton

Page 131: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

0

1

2

3

4

Bounded deque

Top / V0

Bottom

void pushBottom(Item i){

tasks[bottom] = i;

bottom++;

}

Non-blocking data structures and transactional memory 03/11/2011 131

Page 132: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

0

1

2

3

4

Bounded deque

Top / V0

Bottom

void pushBottom(Item i){

tasks[bottom] = i;

bottom++;

}

Item popBottom() {

if (bottom ==0) return null;

bottom--;

result = tasks[bottom];

<tmp_top,tmp_v> = <top,version>;

if (bottom > tmp_top) return result;

….

return null;

}

Non-blocking data structures and transactional memory 03/11/2011 132

Page 133: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Top / V1

0

1

2

3

4

Bounded deque

Top / V0

Bottom

void pushBottom(Item i){

tasks[bottom] = i;

bottom++;

}

Item popBottom() {

if (bottom ==0) return null;

bottom--;

result = tasks[bottom];

<tmp_top,tmp_v> = <top,version>;

if (bottom > tmp_top) return result;

….

return null;

}

if (bottom==top) {

bottom = 0;

if (CAS( &<top,version>,

<tmp_top,tmp_v>,

<0,v+1>)) {

return result;

}

}

<top,version>=<0,v+1>

Item popTop() {

if (bottom <= top) return null;

<tmp_top,tmp_v> = <top, version>;

result = tasks[tmp_top];

if (CAS( &<top,version>,

<tmp_top, tmp_v>,

<tmp_top+1, v+1>)) {

return result;

}

return null;

}

Non-blocking data structures and transactional memory 03/11/2011 133

Page 134: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

0

1

2

3

4

Bounded deque

Top / V0

Bottom

void pushBottom(Item i){

tasks[bottom] = i;

bottom++;

}

Item popBottom() {

if (bottom ==0) return null;

bottom--;

result = tasks[bottom];

<tmp_top,tmp_v> = <top,version>;

if (bottom > tmp_top) return result;

….

return null;

}

if (bottom==top) {

bottom = 0;

if (CAS( &<top,version>,

<tmp_top,tmp_v>,

<0,v+1>)) {

return result;

}

}

<top,version>=<0,v+1>

Item popTop() {

if (bottom <= top) return null;

<tmp_top,tmp_v> = <top, version>;

result = tasks[tmp_top];

if (CAS( &<top,version>,

<tmp_top, tmp_v>,

<tmp_top+1, v+1>)) {

return result;

}

return null;

}

Non-blocking data structures and transactional memory 03/11/2011 134

Page 135: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

ABA problems

0

1

2

3

4

Top

Item popTop() {

if (bottom <= top) return null;

tmp_top = top;

result = tasks[tmp_top];

if (CAS(&top, top, top+1)) {

return result;

}

return null;

}

AAA

BBB

CCC

Bottom

result = CCC

FFF

EEE

DDD

Non-blocking data structures and transactional memory 03/11/2011 135

Page 136: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

General techniques

Local operations designed to avoid CAS

Traditionally slower, less so now

Costs of memory fences can be important (“Idempotent work stealing”, Michael et al)

Local operations just use read and write

Only one accessor, check for interference

Use CAS:

Resolve conflicts between stealers

Resolve local/stealer conflicts

Version number to ensure conflicts seen

Non-blocking data structures and transactional memory 03/11/2011 136

Page 137: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

What’s wrong with locks?

Lists without locks, linearizability

Lock-free progress

Hashtables

Skiplists

Queues

Reducing contention

Memory management

Transactional memory

Non-blocking data structures and transactional memory 137 03/11/2011

Page 138: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Reducing contention

Suppose you’re implementing a shared counter with the following sequential spec:

Non-blocking data structures and transactional memory 138

void increment(int *counter) {

atomic {

(*counter) ++;

}

}

How well can this scale?

void decrement(int *counter) {

atomic {

(*counter) --;

}

}

bool isZero(int *counter) {

atomic {

return (*counter) == 0;

}

}

03/11/2011

Page 139: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

SNZI trees

Non-blocking data structures and transactional memory 139

SNZI

(10,100)

SNZI

(2,230)

SNZI

(5,250)

T2 T1 T3 T5 T4 T6

Child SNZI forwards inc/dec to parent when

the child changes to/from zero

Each node holds a value and a version number

(updated together with CAS)

03/11/2011

SNZI: Scalable NonZero Indicators, Ellen et al

Page 140: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

SNZI trees, linearizability on 0->1 change

Non-blocking data structures and transactional memory 140

SNZI

(0,100)

SNZI

(0,230)

T2 T1

1. T1 calls increment 2. T1 increments child to 1 3. T2 calls increment 4. T2 increments child to 2 5. T2 completes 6. Tx calls isZero 7. Tx sees 0 at parent 8. T1 calls increment on parent 9. T1 completes

Tx

03/11/2011

Page 141: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

SNZI trees

Non-blocking data structures and transactional memory 141

void increment(snzi *s) {

bool done=false;

int undo=0;

while(!done) {

<val,ver> = read(s->state);

if (val >= 1 && CAS(s->state, <val,ver>, <val+1,ver>)) { done = true; }

if (val == 0 && CAS(s->state, <val,ver>, <½, ver+1>)) {

done = true; val=½; ver=ver+1

}

if (val == ½) {

increment(s->parent);

if (!CAS(s->state, <val, ver>, <1, ver)) { undo ++; }

}

}

while (undo > 0) {

decrement(s->parent);

}

}

03/11/2011

Page 142: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Reducing contention: stack

Non-blocking data structures and transactional memory 142

A scalable lock-free stack algorithm, Hendler et al

Existing lock-free stack (e.g., Treiber’s): good

performance under low contention, poor

scalability

Pu

sh

Po

p

Po

p

Pu

sh

Pu

sh

03/11/2011

Page 143: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Pairing up operations

Non-blocking data structures and transactional memory 143

Pu

sh(10

)

Pu

sh(20

)

Pu

sh(30

)

Po

p

20

Po

p

10

03/11/2011

Page 144: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Back-off elimination array

Non-blocking data structures and transactional memory 144

Stack

Elimination array

Contention on the stack? Try the array

Don’t get eliminated?

Try the stack

03/11/2011

Operation record: Thread, Push/Pop, …

Page 145: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

What’s wrong with locks?

Lists without locks, linearizability

Lock-free progress

Hashtables

Skiplists

Queues

Reducing contention

Memory management

Transactional memory

Non-blocking data structures and transactional memory 145 03/11/2011

Page 146: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Lock-free data structures in C

Pseudo-code, Java, C#:

Explicit memory allocation

Deallocation by GC

C/C++:

Explicit memory allocation & deallocation

When is it safe to deallocate a piece of memory?

Non-blocking data structures and transactional memory 03/11/2011 146

Page 147: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Deletion revisited: Delete(10)

H 10 30 T

H 10 30 T

H 10 30 T

Non-blocking data structures and transactional memory 03/11/2011 147

Page 148: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

De-allocate to the OS?

H 30 T 10

Search(20)

Non-blocking data structures and transactional memory 03/11/2011 148

Page 149: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Re-use as something else?

H 30 T 10 100 200

Search(20)

Non-blocking data structures and transactional memory 03/11/2011 149

Page 150: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Re-use as a list node?

H 30 T 10

H 30 T

20

Search(20)

Non-blocking data structures and transactional memory 03/11/2011 150

Page 151: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

H 10 30 T

Reference counting

1 1 1 1

1. Decide what to access

Non-blocking data structures and transactional memory 03/11/2011 151

Page 152: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

H 10 30 T

Reference counting

2 1 1 1

1. Decide what to access 2. Increment reference count

Non-blocking data structures and transactional memory 03/11/2011 152

Page 153: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

H 10 30 T

Reference counting

2 1 1 1

1. Decide what to access 2. Increment reference count 3. Check access still OK

Non-blocking data structures and transactional memory 03/11/2011 153

Page 154: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

H 10 30 T

Reference counting

2 2 1 1

1. Decide what to access 2. Increment reference count 3. Check access still OK

Non-blocking data structures and transactional memory 03/11/2011 154

Page 155: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

H 10 30 T

Reference counting

1 2 1 1

1. Decide what to access 2. Increment reference count 3. Check access still OK

Non-blocking data structures and transactional memory 03/11/2011 155

Page 156: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

H 10 30 T

Reference counting

1 1 1 1

1. Decide what to access 2. Increment reference count 3. Check access still OK 4. Defer deallocation until count 0

Non-blocking data structures and transactional memory 03/11/2011 156

Page 157: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Epoch mechanisms Global epoch: 1000 Thread 1 epoch: - Thread 2 epoch: -

H 10 30 T

Non-blocking data structures and transactional memory 03/11/2011 157

Page 158: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

H 10 30 T

Epoch mechanisms Global epoch: 1000

Thread 1 epoch: 1000 Thread 2 epoch: -

1. Record global epoch at start of operation

Non-blocking data structures and transactional memory 03/11/2011 158

Page 159: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

H 10 30 T

Epoch mechanisms Global epoch: 1000

Thread 1 epoch: 1000 Thread 2 epoch: 1000

1. Record global epoch at start of operation

2. Keep per-epoch deferred deallocation lists

Deallocate @ 1000

Non-blocking data structures and transactional memory 03/11/2011 159

Page 160: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

H 10 30 T

Epoch mechanisms Global epoch: 1001

Thread 1 epoch: 1000 Thread 2 epoch: -

1. Record global epoch at start of operation

2. Keep per-epoch deferred deallocation lists

3. Increment global epoch at end of operation (or periodically)

Non-blocking data structures and transactional memory 03/11/2011 160

Deallocate @ 1000

Page 161: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Epoch mechanisms Global epoch: 1002 Thread 1 epoch: - Thread 2 epoch: -

1. Record global epoch at start of operation

2. Keep per-epoch deferred deallocation lists

3. Increment global epoch at end of operation (or periodically)

4. Free when everyone past epoch

10

Deallocate @ 1000

Non-blocking data structures and transactional memory 03/11/2011 161

H 30 T

Page 162: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

The “repeat offender problem”

Non-blocking data structures and transactional memory 162

Free: ready for allocation

Allocated and linked in to a data

structure

Escaping: unlinked, but possibly

temporarily in use

03/11/2011

Page 163: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Re-use via ROP

1. Decide what to access 2. Set guard 3. Check access still OK

Thread 1 guards

Non-blocking data structures and transactional memory 03/11/2011 163

H 10 30 T

Page 164: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Re-use via ROP

1. Decide what to access 2. Set guard 3. Check access still OK

Thread 1 guards

Non-blocking data structures and transactional memory 03/11/2011 164

H 10 30 T

Page 165: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Re-use via ROP

1. Decide what to access 2. Set guard 3. Check access still OK

Thread 1 guards

Non-blocking data structures and transactional memory 03/11/2011 165

H 10 30 T

Page 166: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Re-use via ROP

1. Decide what to access 2. Set guard 3. Check access still OK

Thread 1 guards

Non-blocking data structures and transactional memory 03/11/2011 166

H 10 30 T

Page 167: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Re-use via ROP

1. Decide what to access 2. Set guard 3. Check access still OK

Thread 1 guards

Non-blocking data structures and transactional memory 03/11/2011 167

H 10 30 T

Page 168: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Re-use via ROP

1. Decide what to access 2. Set guard 3. Check access still OK

Thread 1 guards

Non-blocking data structures and transactional memory 03/11/2011 168

H 10 30 T

Page 169: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Re-use via ROP

H 10 30 T

1. Decide what to access 2. Set guard 3. Check access still OK 4. Batch deallocations and defer on

objects while guards are present

Thread 1 guards

Non-blocking data structures and transactional memory 03/11/2011 169

See also: “Safe memory reclamation”

& hazard pointers, Maged Michael

Page 170: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

What do we care about?

Non-blocking data structures and transactional memory 170

How fast is it?

Is it correct? How easy is it to write?

When can I

use it?

How does it scale?

03/11/2011

Page 171: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

Transactional memory

TM and composability

STM internals

Integration into a language runtime system

Sandboxing & strong isolation

Current performance and my perspective on TM

Non-blocking data structures and transactional memory 171 03/11/2011

Page 172: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

What we want

Hardware

Concurrency primitives

Library Library Library

Library

Library Library

Library

Libraries build layered concurrency abstractions

Non-blocking data structures and transactional memory 172 03/11/2011

Page 173: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Library

Locks and condition variables (a) are hard to use and (b) do not compose

Hardware

What we have

Non-blocking data structures and transactional memory 173 03/11/2011

Page 174: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Atomic blocks

Atomic blocks built over transactional memory. In Haskell: 3 primitives: atomic, retry, orElse

Library Library Library

Library

Library Library

Library

Hardware

Non-blocking data structures and transactional memory 174 03/11/2011

Page 175: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Atomic memory transactions

To a first approximation, just write the sequential code, and wrap atomic around it

All-or-nothing semantics: Atomic commit

Atomic block executes in Isolation

Cannot deadlock (there are no locks!)

Atomicity makes error recovery easy (e.g. exception thrown inside the PopLeft code)

Item PopLeft() {

atomic { ... sequential code ... }

}

Like database transactions

ACID

Non-blocking data structures and transactional memory 175 03/11/2011

Page 176: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Atomic blocks compose (locks do not)

Guarantees to get two consecutive items

PopLeft() is unchanged

Cannot be achieved with locks (except by breaking the PopLeft abstraction)

void GetTwo() {

atomic {

i1 = PopLeft();

i2 = PopLeft();

}

DoSomething( i1, i2 );

}

Composition is THE way we

build big programs that work

Non-blocking data structures and transactional memory 176 03/11/2011

Page 177: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

retry means “abandon execution of the atomic block and re-run it (when there is a chance it’ll complete)”

No lost wake-ups

No consequential change to GetTwo(), even though GetTwo must wait for there to be two items in the queue

Item PopLeft() {

atomic {

if (leftSentinel.right==rightSentinel) {

retry;

} else { ...remove item from queue... }

} }

Blocking: how does PopLeft wait for data?

Non-blocking data structures and transactional memory 177 03/11/2011

Page 178: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

do {...this...} orelse {...that...} tries to run “this”

If “this” retries, it runs “that” instead

If both retry, the do-block retries. GetEither() will thereby wait for there to be an item in either queue

void GetEither() {

atomic {

do { i = Q1.Get(); }

orelse { i = Q2.Get(); }

R.Put( i );

} }

Q1 Q2

R

Choice: waiting for either of two queues

Non-blocking data structures and transactional memory 178 03/11/2011

Page 179: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Programming with atomic blocks With locks, you think about:

Which lock protects which data? What data can be mutated when by other threads? Which condition variables must be notified when?

None of this is explicit in the source code

With atomic blocks you think about

What are the invariants (e.g. the tree is balanced)?

Each atomic block maintains the invariants

Purely sequential reasoning within a block, which is dramatically easier

Much easier setting for static analysis tools

Non-blocking data structures and transactional memory 179 03/11/2011

Page 180: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Summary so far

Atomic blocks (atomic, retry, orElse) are a real step forward

It’s like using a high-level language instead of assembly code: whole classes of low-level errors are eliminated.

Not a silver bullet:

you can still write buggy programs;

concurrent programs are still harder to write than sequential ones;

just aimed at shared memory.

But the improvement is very substantial

Non-blocking data structures and transactional memory 180 03/11/2011

Page 181: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

STM 5 years ago

0

1

2

3

No

rmal

ised

exe

cuti

on

tim

e

Sequential baseline (1.00x)

Coarse-grained locking (1.13x)

Fine-grained locking (2.57x) Early STM (5.69x)

Workload: operations on a red-black tree, 1 thread,

6:1:1 lookup:insert:delete mix with keys 0..65535

Non-blocking data structures and transactional memory 181 03/11/2011

Page 182: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Implementation techniques Direct-update STM

Allow transactions to make updates in place in the heap Avoids reads needing to search the log to see earlier writes that the

transaction has made Makes successful commit operations faster at the cost of extra work

on contention or when a transaction aborts

Compiler integration Decompose the transactional memory operations into primitives Expose the primitives to compiler optimization (e.g. to hoist

concurrency control operations out of a loop)

Runtime system integration Integration with the garbage collector or runtime system

components to scale to atomic blocks containing 100M memory accesses

Memory management system used to detect conflicts between transactional and non-transactional accesses

Non-blocking data structures and transactional memory 182 03/11/2011

Page 183: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Results: concurrency control overhead

0

1

2

3

No

rmal

ised

exe

cuti

on

tim

e

Sequential baseline (1.00x)

Coarse-grained locking (1.13x)

Fine-grained locking (2.57x)

Direct-update STM (2.04x)

Direct-update STM + compiler integration

(1.46x)

Traditional STM (5.69x)

Workload: operations on a red-black tree, 1 thread,

6:1:1 lookup:insert:delete mix with keys 0..65535

Scalable to multicore

Non-blocking data structures and transactional memory 183 03/11/2011

Page 184: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

Transactional memory

Atomic transactions and composability

STM internals

Integration into a language runtime system

Sandboxing & strong isolation

Current performance and my perspective on TM

Non-blocking data structures and transactional memory 184 03/11/2011

Page 185: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Atomic blocks

Class Q { QElem leftSentinel; QElem rightSentinel; void pushLeft(int item) { atomic { QElem e = new QElem(item); e.right = this.leftSentinel.right; e.left = this.leftSentinel; this.leftSentinel.right.left = e; this.leftSentinel.right = e; } } ... }

Class Q { QElem leftSentinel; QElem rightSentinel; void pushLeft(int item) { do { tx = TxStart(); QElem e = new QElem(item); TxWrite(tx, &e.right, TxRead(tx, &this.leftSentinel.right)); TxWrite(tx, &e.left, this.leftSentinel); TxWrite(tx, &TxRead(tx, &this.leftSentinel.right).left, e); TxWrite(tx, &this.leftSentinel.right, e); } while (!TxCommit()); } ... }

Non-blocking data structures and transactional memory 03/11/2011 185

Page 186: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Bartok-STM

Use per-object meta-data (“TMWs”)

Each TMW is either:

Locked, holding a pointer to the transaction that has the object open for update

Available, holding a version number indicating how many times the object has been locked

Writers eagerly lock TMWs to gain access to the object, using eager version management

Maintain an undo log in case of roll-back

Readers log the version numbers they see and perform lazy conflict detection at commit time

Non-blocking data structures and transactional memory 03/11/2011 186

Page 187: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: uncontended swap a:

v150

1000

v250

2000

b:

void Swap(int *a, int *b) {

do {

tx = TxStart();

va = TxRead(tx, &a);

vb = TxRead(tx, &b);

TxWrite(tx, &a, vb);

TxWrite(tx, &b, va);

} while (!TxCommit());

}

Non-blocking data structures and transactional memory 03/11/2011 187

Page 188: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: uncontended swap a:

v150

1000

v250

2000

b:

void Swap(int *a, int *b) {

do {

tx = TxStart();

va = TxRead(tx, &a);

vb = TxRead(tx, &b);

TxWrite(tx, &a, vb);

TxWrite(tx, &b, va);

} while (!TxCommit());

}

Tx1 Objects read

Objects updated

Values

overwritten

Non-blocking data structures and transactional memory 03/11/2011 188

Page 189: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: uncontended swap a:

v150

1000

v250

2000

a: v150

b:

void Swap(int *a, int *b) {

do {

tx = TxStart();

va = TxRead(tx, &a);

vb = TxRead(tx, &b);

TxWrite(tx, &a, vb);

TxWrite(tx, &b, va);

} while (!TxCommit());

}

Tx1 Objects read

Objects updated

Values

overwritten

Non-blocking data structures and transactional memory 03/11/2011 189

Page 190: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: uncontended swap a:

v150

1000

v250

2000

a: v150 b: v250

b:

void Swap(int *a, int *b) {

do {

tx = TxStart();

va = TxRead(tx, &a);

vb = TxRead(tx, &b);

TxWrite(tx, &a, vb);

TxWrite(tx, &b, va);

} while (!TxCommit());

}

Tx1 Objects read

Objects updated

Values

overwritten

Non-blocking data structures and transactional memory 03/11/2011 190

Page 191: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: uncontended swap a:

Tx1

2000

v250

2000

a: v150 b: v250

a: v150

a.val = 1000

b:

void Swap(int *a, int *b) {

do {

tx = TxStart();

va = TxRead(tx, &a);

vb = TxRead(tx, &b);

TxWrite(tx, &a, vb);

TxWrite(tx, &b, va);

} while (!TxCommit());

}

Tx1 Objects read

Objects updated

Values

overwritten

Non-blocking data structures and transactional memory 03/11/2011 191

Page 192: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: uncontended swap a:

Tx1

2000

Tx1

1000

a: v150 b: v250

a: v150 b: v250

a.val = 1000 b.val = 2000

b:

void Swap(int *a, int *b) {

do {

tx = TxStart();

va = TxRead(tx, &a);

vb = TxRead(tx, &b);

TxWrite(tx, &a, vb);

TxWrite(tx, &b, va);

} while (!TxCommit());

}

Tx1 Objects read

Objects updated

Values

overwritten

Non-blocking data structures and transactional memory 03/11/2011 192

Page 193: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Commit in Bartok-STM

Iterate over the read set:

Current TMW matches logged

version?

Current TMW shows we locked

the object?

Abort

Logged TMW matches version in

our write set?

Abort

Yes

Yes Yes

No

No No

OK so far

OK so far

Non-blocking data structures and transactional memory 03/11/2011 193

Page 194: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Correctness sketch

time

Open obj1 for read Open obj2 for update

Commit: validate obj1 version

Commit: unlock obj2

Lock prevents concurrent updates

Validation checks no updates

Tx appears atomic after last “Open” and before

first validation step

Non-blocking data structures and transactional memory 03/11/2011 194

Page 195: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: uncontended swap a:

Tx1

2000

Tx1

1000

a: v150 b: v250

a: v150 b: v250

a.val = 1000 b.val = 2000

b:

void Swap(int *a, int *b) {

do {

tx = TxStart();

va = TxRead(tx, &a);

vb = TxRead(tx, &b);

TxWrite(tx, &a, vb);

TxWrite(tx, &b, va);

} while (!TxCommit());

}

Tx1 Objects read

Objects updated

Values

overwritten

“We locked the object...”

“...and no-one else got there

first!”

Non-blocking data structures and transactional memory 03/11/2011 195

Page 196: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: uncontended swap a:

v151

2000

Tx1

1000

a: v150 b: v250

a: v150 b: v250

a.val = 1000 b.val = 2000

b:

void Swap(int *a, int *b) {

do {

tx = TxStart();

va = TxRead(tx, &a);

vb = TxRead(tx, &b);

TxWrite(tx, &a, vb);

TxWrite(tx, &b, va);

} while (!TxCommit());

}

Tx1 Objects read

Objects updated

Values

overwritten

Non-blocking data structures and transactional memory 03/11/2011 196

Page 197: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: uncontended swap a:

v151

2000

v251

1000

a: v150 b: v250

a: v150 b: v250

a.val = 1000 b.val = 2000

b:

void Swap(int *a, int *b) {

do {

tx = TxStart();

va = TxRead(tx, &a);

vb = TxRead(tx, &b);

TxWrite(tx, &a, vb);

TxWrite(tx, &b, va);

} while (!TxCommit());

}

Tx1 Objects read

Values

overwritten

Non-blocking data structures and transactional memory 03/11/2011 197

Objects updated

Page 198: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example: uncontended swap a:

v151

2000

v251

1000

b:

void Swap(int *a, int *b) {

do {

tx = TxStart();

va = TxRead(tx, &a);

vb = TxRead(tx, &b);

TxWrite(tx, &a, vb);

TxWrite(tx, &b, va);

} while (!TxCommit());

}

Non-blocking data structures and transactional memory 03/11/2011 198

Page 199: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Tx-tx interaction in Bartok-STM

Read-read: no problem, both readers see the same version number and verify it at commit time

Read-write: reader sees that the writer has the object locked. Reader always defers to writer

Write-write: competition for lock serializes writers (drop locks, then spin to avoid deadlock)

Non-blocking data structures and transactional memory 03/11/2011 199

Page 200: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Gold standard:

During execution a transaction runs against a consistent view of memory

Won’t be “tricked” into looping, etc.

“Opacity”

What are the advantages / disadvantages when compared with an implementation giving weaker guarantees?

Non-blocking data structures and transactional memory 03/11/2011 200

Taxonomy: consistency during tx

Page 201: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

We need some way to manage the tentative updates that a transaction is making

Where are they stored?

How does the implementation find them (so a transaction’s read sees an earlier write)?

Lazy versioning: only make “real” updates when a transaction commits

Eager versioning: make updates as a transaction runs, roll them back on abort

What are the advantages, disadvantages?

Non-blocking data structures and transactional memory 03/11/2011 201

Taxonomy: lazy/eager versioning

Page 202: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

We need to detect when two transactions conflict with one another

Lazy conflict detection: detect conflicts at commit time

Eager conflict detection: detect conflicts as transactions run

Again, what are the advantages, disadvantages?

Non-blocking data structures and transactional memory 03/11/2011 202

Taxonomy: lazy/eager conflict detection

Page 203: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Taxonomy: word/object based

What granularity are conflicts detected at?

Object-based:

Access to programmer-defined structures (e.g. objects)

Word-based:

Access to words (or sets of words, e.g. cache lines)

Possibly after mapping under a hash function

What are the advantages and disadvantages of these approaches?

Non-blocking data structures and transactional memory 03/11/2011 203

Page 204: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Bartok-STM

Designed to work well on low-contention workloads

Eager version management to reduce commit costs

Eager locking to support eager version management

Primitives do not guarantee that transactions see a consistent view of the heap while running

Can be sandboxed in managed code...

...harder in native code

Non-blocking data structures and transactional memory 03/11/2011 204

Page 205: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

Transactional memory

TM and composability

STM internals

Integration into a language runtime system

Sandboxing & strong isolation

Current performance and my perspective on TM

Non-blocking data structures and transactional memory 205 03/11/2011

Page 206: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Atomic blocks

Class Q { QElem leftSentinel; QElem rightSentinel; void pushLeft(int item) { atomic { QElem e = new QElem(item); e.right = this.leftSentinel.right; e.left = this.leftSentinel; this.leftSentinel.right.left = e; this.leftSentinel.right = e; } } ... }

Class Q { QElem leftSentinel; QElem rightSentinel; void pushLeft(int item) { do { TxStart(); QElem e = new QElem(item); TxWrite(&e.right, TxRead(&this.leftSentinel.right)); TxWrite(&e.left, this.leftSentinel); TxWrite(&TxRead(&this.leftSentinel.right).left, e); TxWrite(&this.leftSentinel.right, e); } while (!TxCommit()); } ... }

Non-blocking data structures and transactional memory 03/11/2011 206

Page 207: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Compilation

Source code

MSIL bytecode

Native code

Source to bytecode compiler; typically “csc” in C#, “javac” for

Java

Bytecode-to-native compiler; JIT or traditional compilation

Non-blocking data structures and transactional memory 03/11/2011 207

Page 208: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Why divide things this way?

Little information loss from

source code to bytecode

Source-to-bytecode works a

file at a time, bytecode-to-native

can see the whole program (or,

at least, see all of the parts

needed so far in execution)

Lower level transformations

possible at bytecode-to-native

Integration between the STM and

other parts of the runtime system

void Swap(Pair p) { try { va = p.a; vb = p.b; p.a = vb; p.b = va; } catch (AtomicException) { } }

Non-blocking data structures and transactional memory 03/11/2011 208

Page 209: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Boilerplate around transactions

void Swap(Pair p) { do { done = true; try { try { tx = StartTx(); va = p.a; vb = p.b; p.a = vb; p.b = va; } finally { CommitTx(); } } catch (TxInvalid) { done = false; } } while (!done); }

Keep running the atomic block in a

fresh tx each time

Commit (on normal or exn exit)

Commit fails by raising a TxInvalid exception;

re-execute

(I’m using source code examples for clarity; in reality this would be in the compiler’s internal

intermediate code)

Non-blocking data structures and transactional memory 03/11/2011 209

Page 210: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Naïve expansion of data accesses

void Swap(Pair p) { do { done = true; try { try { tx = StartTx(); TxWrite(tx, &va, TxRead(tx, &p.a)); TxWrite(tx, &vb, TxRead(tx, &p.b)); TxWrite(tx, &p.a, TxRead(tx, &vb)); TxWrite(tx, &p.b, TxRead(tx, &va)); } finally { CommitTx(); } } catch (TxInvalid) { done = false; } } while (!done); }

Non-blocking data structures and transactional memory 03/11/2011 210

Page 211: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

What are the problems here?

Using the STM for thread-private local variables

Repeatedly mapping from addresses to concurrency control info

Duplicating concurrency control work if it’s implemented at a per-object granularity

Non-blocking data structures and transactional memory 03/11/2011 211

Page 212: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Decomposed STM primitive API

OpenForRead(tx, obj)

OpenForRead(tx, addr)

OpenForUpdate(tx, obj)

OpenForUpdate(tx, addr)

LogForUndo(tx, addr)

Indicate intent to read from an object or from a given

address

Indicate intent to update a specific address (&

optional size)

Non-blocking data structures and transactional memory 03/11/2011 212

Page 213: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Using the decomposed API

x = p.a; OpenForRead(tx, p); x = p.a;

p.b = y; OpenForUpdate(tx, p); LogForUndo(tx, &p.b); p.b = y;

Non-blocking data structures and transactional memory 03/11/2011 213

Page 214: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

... OpenForUpdate(tx, p); OpenForRead(tx, p); va = p.a; OpenForRead(tx, p); Vb = p.b; OpenForUpdate(tx, p); LogForUndo(tx, &p.a); p.a = vb; OpenForUpdate(tx, p); LogForUndo(tx, &p.b); p.b = va; ...

Second OpenForRead made unnecessary by first

Second OpenForUpdate made unnecessary by first

Always need update access: get it first

Non-blocking data structures and transactional memory 03/11/2011 214

Implementation using decomposed API

Page 215: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Improved expansion of data accesses

void Swap(Pair p) { do { done = true; try { try { tx = StartTx(); OpenForUpdate(tx, p); va = p.a; vb = p.b; LogForUndo(tx, &p.a); p.a = vb; LogForUndo(tx, &p.b); p.b = va; } finally { CommitTx(); } } catch (TxInvalid) { done = false; } } while (!done); }

Non-blocking data structures and transactional memory 03/11/2011 215

Page 216: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Are we done?

Local variables

By-ref parameters

Method calls

Keeping optimizations safe

GC integration

Finalizers

Condition synchronization

Non-blocking data structures and transactional memory 03/11/2011 216

Page 217: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Keeping optimizations safe

void Clear_tx(Pair p) { for (int i = 0; i < 10; i ++) { p.a = 10; p.b = i; } }

Original (contrived) source code

Non-blocking data structures and transactional memory 03/11/2011 217

Page 218: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Keeping optimizations safe

void Clear_tx(Pair p) { for (int i = 0; i < 10; i ++) { OpenForUpdate(tx, p); LogForUndo(tx, &p.a); p.a = 10; LogForUndo(tx, &p.b); p.b = i; } }

Expanded with decomposed API operations

Non-blocking data structures and transactional memory 03/11/2011 218

Page 219: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Keeping optimizations safe

void Clear_tx(Pair p) { p.a = 10; for (int i = 0; i < 10; i ++) { OpenForUpdate(tx, p); LogForUndo(tx, &p.a); LogForUndo(tx, &p.b); p.b = i; } }

Hoisting loop-invariant code

Non-blocking data structures and transactional memory 03/11/2011 219

Page 220: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Keeping optimizations safe

void Clear_tx(Pair p) { for (int i = 0; i < 10; i ++) { tmp1 = OpenForUpdate(tx, p); tmp2 = LogForUndo(tx, &p.a) <tmp1>; p.a = 10 <tmp2>; tmp3 = LogForUndo(tx, &p.b) <tmp1>; p.b = i <tmp3>; } }

Introduce dependencies

Non-blocking data structures and transactional memory 03/11/2011 220

Page 221: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Keeping optimizations safe

void Clear_tx(Pair p) { tmp1 = OpenForUpdate(tx, p); tmp2 = LogForUndo(tx, &p.a) <tmp1>; tmp3 = LogForUndo(tx, &p.a) <tmp1>; p.a = 10 <tmp2>; for (int i = 0; i < 10; i ++) { p.b = i <tmp3>; } }

Transformations must respect dependencies

Non-blocking data structures and transactional memory 03/11/2011 221

Page 222: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

GC integration

void Temp() { Pair result; atomic { for (int i = 0; i < 100000; i ++) { result = new Pair(); result.a = i; } return result; } }

Another contrived program

Lots of temporary objects are allocated as the atomic block

runs

Non-blocking data structures and transactional memory 03/11/2011 222

Page 223: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

GC integration

Abort all running tx on GC?

Not ideal: long running tx will not be able to commit

Is there a precedent for language features with this kind of perf?

Treat all the references from the logs as roots?

Not ideal: we’d keep all those temporaries

Non-blocking data structures and transactional memory 03/11/2011 223

Page 224: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

GC integration

Principle:

Consider the possible heaps based on whether tx commit or abort

Retain an object if it is alive in any of these cases (ideally “iff”)

Do we need to consider 2n possibilities with n running tx?

No: validate all the tx first so we know they are not conflicting

Consider the world if they all commit, consider the world if they all roll back

Non-blocking data structures and transactional memory 03/11/2011 224

Page 225: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Example heap

Normal heap

pointer

Overwritten pointer in undo log

Non-blocking data structures and transactional memory 03/11/2011 225

Page 226: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Conservative algorithm

Normal heap

pointer

Overwritten pointer in undo log

1. Validate tx

2. Trace heap as normal

Non-blocking data structures and transactional memory 03/11/2011 226

Page 227: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Conservative algorithm

Normal heap

pointer

Overwritten pointer in undo log

1. Validate tx

2. Trace heap as normal

3. Grey targets of overwritten ptrs

Non-blocking data structures and transactional memory 03/11/2011 227

Page 228: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Conservative algorithm

Normal heap

pointer

Overwritten pointer in undo log

1. Validate tx

2. Trace heap as normal

3. Grey targets of overwritten ptrs

4. Trace from new grey objects

Non-blocking data structures and transactional memory 03/11/2011 228

Page 229: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Conservative algorithm

Normal heap

pointer

Overwritten pointer in undo log

1. Validate tx

2. Trace heap as normal

3. Grey targets of overwritten ptrs

4. Trace from new grey objects

5. Reclaim white objects

Non-blocking data structures and transactional memory 03/11/2011 229

Page 230: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Precise algorithm

Normal heap

pointer

Overwritten pointer in undo log

1. Validate tx

2. Trace heap as normal

Non-blocking data structures and transactional memory 03/11/2011 230

Page 231: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Precise algorithm

Normal heap

pointer

Overwritten pointer in undo log

1. Validate tx

2. Trace heap as normal

3. Roll-back, re-gray updated black obj

Non-blocking data structures and transactional memory 03/11/2011 231

Page 232: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Precise algorithm

Normal heap

pointer

Overwritten pointer in undo log

1. Validate tx

2. Trace heap as normal

3. Roll-back, re-gray updated black obj

4. Trace from gray objects

Non-blocking data structures and transactional memory 03/11/2011 232

Page 233: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Precise algorithm

Normal heap

pointer

Overwritten pointer in undo log

1. Validate tx

2. Trace heap as normal

3. Roll-back, re-gray updated black obj

4. Trace from gray objects

5. Reclaim white objects

Non-blocking data structures and transactional memory 03/11/2011 233

Page 234: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Precise algorithm

Normal heap

pointer

Overwritten pointer in undo log

1. Validate tx

2. Trace heap as normal

3. Roll-back, re-gray updated black obj

4. Trace from gray objects

5. Reclaim white objects

6. Restore heap

Non-blocking data structures and transactional memory 03/11/2011 234

Page 235: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Finalizers

Pair p; atomic { p = new Pair(); }

Class Pair { void Finalize() { Console.Out.WriteLine(“Hello world\n”); } }

Suppose this block is attempted twice

How many times is this printed? (Or is

this program wrong?)

Non-blocking data structures and transactional memory 03/11/2011 235

Page 236: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Finalizers

Remember the intended semantics:

Exactly once execution

Transactionally-allocated objects are only eligible for finalization when the tx commits

Tentative allocation, non-finalization, and (re-)execution remains entirely transparent

Non-blocking data structures and transactional memory 03/11/2011 236

Page 237: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Condition synchronization

Semantically: in STM-Haskell we required the scheduler to only run atomic blocks when they succeed without calling “retry”

atomic { buffer.data = 42; buffer.full = true; }

atomic { if (!buffer.full) { retry; } result = buffer.data; buffer.full = false; }

This atomic block is only ready to run when

buffer.full is true

Non-blocking data structures and transactional memory 03/11/2011 237

Page 238: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Primitive for synchronization

void WaitTX(tx)

Semantically equivalent to AbortTx

Implementation may assume caller will immediately re-execute a (deterministic) tx

Implementation may introduce a delay to avoid unnecessary spinning

Intuition:

No point re-executing the consumer until the producer has run

Non-blocking data structures and transactional memory 03/11/2011 238

Page 239: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Compiling “retry” to “WaitTx”

atomic { if (!buffer.full) { retry; } result = buffer.data; buffer.full = false; }

void Consume(Buffer b) { do { done = true; try { try { tx = StartTx(); OpenForRead(tx, b); if (!b.full) { WaitTx(); } OpenForUpdate(tx, b); result = b.data; LogForUndo(tx, &b.full); b.full = false; } finally { CommitTx(); } } catch (TxInvalid) { done = false; } } while (!done); }

Non-blocking data structures and transactional memory 03/11/2011 239

Page 240: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Implementing WaitTx

buffer: v150

Val=0

Full=false

Non-blocking data structures and transactional memory 03/11/2011 240

Page 241: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Implementing WaitTx

buffer: v150

Val=0

null

Full=false

1. Extend object header with list of waiters

Non-blocking data structures and transactional memory 03/11/2011 241

Page 242: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Implementing WaitTx

buffer: v150

Val=0

null

Full=false

1. Extend object header with list of waiters

2. Extend tx records with a mutex & condvar pair

Consumer tx

Mutex

Condvar

Non-blocking data structures and transactional memory 03/11/2011 242

Page 243: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Implementing WaitTx

buffer: v150

Val=0

Full=false

1. Extend object header with list of waiters

2. Extend tx records with a mutex & condvar pair

Consumer tx

Mutex

Condvar

3. WaitTx links the consumer to the lists in

its read set

Non-blocking data structures and transactional memory 03/11/2011 243

Page 244: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Implementing WaitTx

buffer: v150

Val=0

Full=false

1. Extend object header with list of waiters

2. Extend tx records with a mutex & condvar pair

Consumer tx

Mutex

Condvar

3. WaitTx links the consumer to the lists in

its read set

4. WaitTx validates, locks the mutex, updates its

status, blocks

Non-blocking data structures and transactional memory 03/11/2011 244

Page 245: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Implementing WaitTx

buffer: v150

Val=0

Full=false

1. Extend object header with list of waiters

2. Extend tx records with a mutex & condvar pair

Consumer tx

Mutex

Condvar

3. WaitTx links the consumer to the lists in

its read set

4. WaitTx validates, locks the mutex, updates its

status, blocks

5. CommitTx wakes waiters on objects in its

write set

Non-blocking data structures and transactional memory 03/11/2011 245

Page 246: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Implementing WaitTx

buffer: v150

Val=0

Full=false

1. Extend object header with list of waiters

2. Extend tx records with a mutex & condvar pair

Consumer tx

Mutex

Condvar

3. WaitTx links the consumer to the lists in

its read set

4. WaitTx validates, locks the mutex, updates its

status, blocks

5. CommitTx wakes waiters on objects in its

write set

Use “thin locks” style tricks to avoid fixed

header word allocation

NB: many-to-many relationship, so probably use

separate doubly linked list

Use latch in the header for concurrency control on the

list

Non-blocking data structures and transactional memory 03/11/2011 246

Page 247: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

Transactional memory

TM and composability

STM internals

Integration into a language runtime system

Sandboxing & strong isolation

Current performance and my perspective on TM

Non-blocking data structures and transactional memory 247 03/11/2011

Page 248: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Sandboxing zombie transactions

Those that have become invalid but don’t yet know it

May access memory

May raise exceptions

May attempt system calls etc

General principle – validate before revealing any tx’s effects outside the STM world

Non-blocking data structures and transactional memory 03/11/2011 248

Page 249: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Looping / slow zombies

Method2 runs between Method1’s memory accesses

The transaction running Method1 becomes a zombie... but never attempts to commit

void Method1(Pair p) { atomic { ta = p.a; tb = p.b; if (ta != tb) { while (true) { } } } }

void Method2(Pair p) { atomic { p.a = 100; p.b = 100; } }

Non-blocking data structures and transactional memory 03/11/2011 249

Page 250: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Looping / slow zombies

Add new API function “ValidationTick”

ValidationTick guarantees: it will

eventually detect if its calling

transaction is invalid

Call it in any loop not otherwise calling

a TM API function

Optimize ValidationTick so it only does

“real” validation occasionally

(Could also optimize the placement of

ValidationTick calls)

OpenForRead(p); ta = p.a; tb = p.b; if (ta != tb) { while (true) { ValidationTick(); } }

Non-blocking data structures and transactional memory 03/11/2011 250

Page 251: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Strong isolation

Add a mechanism to detect conflicts between tx and normal accesses

We would like:

No overhead on direct accesses

Predictable performance

Little overhead over weak atomicity

Non-blocking data structures and transactional memory 03/11/2011 251

Page 252: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Strong isolation: implementation

Physical address

space

Virtual address

space

Tx-heap Normal-heap

Normal memory accesses

Memory accesses

from atomic blocks

Non-blocking data structures and transactional memory 03/11/2011 252

Page 253: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Writes from atomic blocks

Physical address

space

Virtual address

space

Tx-heap Normal-heap

Normal memory accesses

Memory accesses

from atomic blocks

1. Atomic block attempts to write to a

field of an object

Non-blocking data structures and transactional memory 03/11/2011 253

Page 254: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Writes from atomic blocks

Physical address

space

Virtual address

space

Tx-heap Normal-heap

Normal memory accesses

Memory accesses

from atomic blocks

2. Revoke direct access to the page holding the direct view of the object

Non-blocking data structures and transactional memory 03/11/2011 254

Page 255: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Writes from atomic blocks

Physical address

space

Virtual address

space

Tx-heap Normal-heap

Normal memory accesses

Memory accesses

from atomic blocks

3. Use underlying STM write primitives

Non-blocking data structures and transactional memory 03/11/2011 255

Page 256: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Writes from atomic blocks

Physical address

space

Virtual address

space

Tx-heap Normal-heap

Normal memory accesses

Memory accesses

from atomic blocks

4A. Restore direct access once the

underlying transaction has finished

Non-blocking data structures and transactional memory 03/11/2011 256

Page 257: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Conflicting normal access

Physical address

space

Virtual address

space

Tx-heap Direct-heap

Normal memory accesses

Memory accesses

from atomic blocks

4B. Access violation (AV) delivered to a

normal thread accessing that page: wait for TX

Non-blocking data structures and transactional memory 03/11/2011 257

Page 258: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Separate tx / non-tx allocations

Physical address

space

Virtual address

space

Tx-heap Normal-heap

Normal alloc Tx alloc

Non-blocking data structures and transactional memory 03/11/2011 258

Page 259: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Make page protections lazily

Physical address

space

Virtual address

space

Tx-heap Normal-heap

Normal memory accesses

Memory accesses

from atomic blocks

RW RW RW RW 1. Table shadows

intended page state

Non-blocking data structures and transactional memory 03/11/2011 259

Page 260: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Make page protections lazily

Physical address

space

Virtual address

space

Tx-heap Normal-heap

Normal memory accesses

Memory accesses

from atomic blocks

RW - RW RW

2. Atomic block attempts to write to a field of an object; revoke access,

update table

Non-blocking data structures and transactional memory 03/11/2011 260

Page 261: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Make page protections lazily

Physical address

space

Virtual address

space

Tx-heap Normal-heap

Normal memory accesses

Memory accesses

from atomic blocks

RW RW RW RW

3. On commit, just update the table

Non-blocking data structures and transactional memory 03/11/2011 261

Page 262: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Make page protections lazily

Physical address

space

Virtual address

space

Tx-heap Normal-heap

Normal memory accesses

Memory accesses

from atomic blocks

RW - RW RW

4. Subsequent atomic block just updates the

table

Non-blocking data structures and transactional memory 03/11/2011 262

Page 263: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Genome

0

0.5

1

1.5

2

2.5

3

3.5

No

rma

lize

d e

xe

cuti

on

tim

e

Page 264: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Labyrinth

0

0.5

1

1.5

2

2.5

3

3.5

No

rma

lize

d e

xe

cuti

on

tim

e

Page 265: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Course overview: structure

Building locks

Lock-free programming

Transactional memory

TM and composability

STM internals

Integration into a language runtime system

Sandboxing & strong isolation

Combining TM with libraries, locking, and IO

Current performance and my perspective on TM

Non-blocking data structures and transactional memory 265 03/11/2011

Page 266: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Performance figures depend on...

Workload : What do the atomic blocks do? How long is spent inside them?

Baseline implementation: Mature existing compiler, or prototype?

Intended semantics: Support static separation? Violation freedom (TDRF)?

STM implementation: In-place updates, deferred updates, eager/lazy conflict detection, visible/invisible readers?

STM-specific optimizations: e.g. to remove or downgrade redundant TM operations

Integration: e.g. dynamically between the GC and the STM, or inlining of STM functions during compilation

Implementation effort: low-level perf tweaks, tuning, etc.

Hardware: e.g. performance of CAS and memory system

Non-blocking data structures and transactional memory 03/11/2011 266

Page 267: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Labyrinth

STAMP v0.9.10

256x256x3 grid

Routing 256 paths

Almost all execution inside atomic blocks

Atomic blocks can attempt 100K+ updates

C# version derived from original C

Compiled using Bartok, whole program mode, C# -> x86 (~80% perf of original C with VS2008)

Overhead results with Core2 Duo running Windows Vista

s1

e1

“STAMP: Stanford Transactional Applications for Multi-Processing” Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, Kunle Olukotun , IISWC 2008

Non-blocking data structures and transactional memory 03/11/2011 267

Page 268: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

11.86

3.14

1.99 1.71 1.71

0

2

4

6

8

10

12

14

STM Dynamicfiltering

Dataflowopts

Filter opts Re-use logs

1-th

rea

d,

no

rma

lize

d t

o s

eq

. b

ase

lin

e

Sequential overhead

STM implementation supporting static separation In-place updates

Lazy conflict detection Per-object STM metadata

Addition of read/write barriers before accesses Read: log per-object metadata word

Update: CAS on per-object metadata word Update: log value being overwritten

Non-blocking data structures and transactional memory 03/11/2011 268

Page 269: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Sequential overhead

11.86

3.14

1.99 1.71 1.71

0

2

4

6

8

10

12

14

STM Dynamicfiltering

Dataflowopts

Filter opts Re-use logs

1-th

rea

d,

no

rma

lize

d t

o s

eq

. b

ase

lin

e

Dynamic filtering to remove redundant logging

Log size grows with #locations accessed Consequential reduction in validation time

1st level: per-thread hashtable (1024 entries) 2nd level: per-object bitmap of updated fields

Non-blocking data structures and transactional memory 03/11/2011 269

Page 270: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Sequential overhead

11.86

3.14

1.99 1.71 1.71

0

2

4

6

8

10

12

14

STM Dynamicfiltering

Dataflowopts

Filter opts Re-use logs

1-th

rea

d,

no

rma

lize

d t

o s

eq

. b

ase

lin

e

Data-flow optimizations

Remove repeated log operations Open-for-read/update on a per-object basis

Log-old-value on a per-field basis Remove concurrency control on newly-allocated objects

Non-blocking data structures and transactional memory 03/11/2011 270

Page 271: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Sequential overhead

11.86

3.14

1.99 1.71 1.71

0

2

4

6

8

10

12

14

STM Dynamicfiltering

Dataflowopts

Filter opts Re-use logs

1-th

rea

d,

no

rma

lize

d t

o s

eq

. b

ase

lin

e

Inline optimized filter operations

Re-use table_base between filter operations Avoids caller save/restore on filter hits

mov eax <- obj_addr and eax <- eax, 0xffc mov ebx <- [table_base + eax] cmp ebx, obj_addr

Non-blocking data structures and transactional memory 03/11/2011 271

Page 272: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Sequential overhead

11.86

3.14

1.99 1.71 1.71

0

2

4

6

8

10

12

14

STM Dynamicfiltering

Dataflowopts

Filter opts Re-use logs

1-th

rea

d,

no

rma

lize

d t

o s

eq

. b

ase

lin

e

Re-use STM logs between transactions

Reduces pressure on per-page allocation lock Reduces time spent in GC

Non-blocking data structures and transactional memory 03/11/2011 272

Page 273: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Scaling – Labyrinth

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

1 2 3 4 5 6 7 8

Ex

ecu

tio

n t

ime

/ s

eq

. ba

seli

ne

#Threads

Static separation strong isolation

1.0 = wall-clock execution time of sequential code without

concurrency control

Non-blocking data structures and transactional memory 03/11/2011 273

Page 274: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Scaling – Genome

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

1 2 3 4 5 6 7 8

Ex

ecu

tio

n t

ime

/ s

eq

. ba

seli

ne

#Threads

Static separation strong isolation

Non-blocking data structures and transactional memory 03/11/2011 274

Page 275: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Granularity

Distributed, large-scale atomic actions

Composable shared memory data structures

“Leaf” shared memory data structures

General purpose atomic actions in a program

Non-blocking data structures and transactional memory 03/11/2011 275

Page 276: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Programming abstraction

Lock elision

The program’s semantics is defined using locks. TM is used as an implementation mechanism.

Speculation

Semantics defined by speculative execution, commit, etc. (either implicitly, or explicitly)

Atomic

Semantics defined by atomic execution (e.g. “atomic { X }”). Speculation, if used, is abstracted by the implementation.

Non-blocking data structures and transactional memory 03/11/2011 276

Page 277: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Purpose

Makes software easier to develop /

verify / maintain / …

Faster: better than alternatives,

irrespective of complexity

Non-blocking data structures and transactional memory 03/11/2011 277

Page 278: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Design points that I like

HW DCAS / 3-CAS / … Granularity: leaf data structures

Abstraction: atomic multi-word CAS Purpose: faster

HTM with limited guarantees (~ASF) Granularity: leaf data structures Abstraction: short transactions

Purpose: faster

Static separation (e.g., STM-Haskell) Granularity: composable data structures

Abstraction: atomic actions Purpose: easier, decent perf

Non-blocking data structures and transactional memory 03/11/2011 278

Page 279: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

Design points I am sceptical about

Speculative lock elision on general-purpose s/w

“atomic” blocks over normal data in a high-level language (C#/Java)

(prove me wrong, I would like either of these to work!)

Non-blocking data structures and transactional memory 03/11/2011 279

Page 280: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

What do we care about?

Non-blocking data structures and transactional memory 280

How fast is it?

Is it correct? How easy is it to write?

When can I

use it?

How does it scale?

03/11/2011

Page 281: MULTI-CORE PROGRAMMING - University of Cambridge · Suggested reading “The art of multiprocessor programming”, Herlihy & Shavit – excellent coverage of shared memory data structures,

©2011 Microsoft Corporation. All rights reserved. This material is provided for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft is a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries.

www.research.microsoft.com