algorithms for scalable synchronization on shared-memory multiprocessors

John M. Mellor-Crummey

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Joseph Garvey & Joshua San Miguel

Michael L. Scott

Dance Hall Machines?

•Various insns known as fetch_and_ф insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swap•Some can be used to simulate others but often with overhead•Some lock types require a particular primitive to be implemented or to be implemented efficiently

Atomic Instructions

type lock = (unlocked, locked)

procedure acquire_lock (lock *L)

while test_and_set (L) == locked ;

procedure release_lock (lock *L)

*L = unlocked

Test_and_set: Basic

Test_and_set: Basic

$

P

$

P

$

P

Memory



while 1

if *L == unlocked

if test_and_set (L) == unlockedreturn


*L = unlocked

Test_and_set: test_and_test_and_set

Test_and_set: test_and_test_and_set

$

P

$

P

$

P

Memory



delay = 1

while test_and_set (L) == lockedpause (delay)

delay = delay * 2


*L = unlocked

Test_and_set: test_and_set with backoff

type lock = record

next_ticket = 0

now_serving = 0


my_ticket = fetch_and_increment(L->next_ticket)while 1

if L->now_serving == my_ticket

return


L->now_serving = L->now_serving + 1

Ticket Lock

type lock = record

slots = array [0…numprocs – 1] of (has_lock, must_wait)

next_slot = 0


my_place = fetch_and_increment (L->next_slot)// Various modulo work to handle overflow

while L->slots[my_place] == must_wait ;

L->slots[my_place] = must_wait


L->slots[my_place + 1] = has_lock

Array-Based Queuing Locks

Array-Based Queuing Locks

Memorynext_slot

slots

$

Pmy_place

$

Pmy_place

$

Pmy_place

type qnode = record

qnode *next

bool locked

type lock = qnode*

procedure acquire_lock (lock *L, qnode *I)

I->next = Null

qnode *predecessor = fetch_and_store (L, I)

if predecessor != Null

I->locked = true

predecessor->next = I

while I->locked ;

MCS Locksprocedure release_lock (lock *L, qnode *I)

if I->next == Null

if compare_and_swap (L, I, Null)

return

while I->next == Null ;

I->next->locked = false

MCS Locks

L1-R

2-B

3-B

2-R

3-R3-E4-B

5-B

4-R

procedure release_lock (lock *L, qnode *I)

if I->next == Null

if compare_and_swap (L, I, Null)

return

while I->next == Null ;

I->next->locked = false

Results: Scalability – Distributed Memory Architecture

Results: Scalability – Cache Coherent Architecture

•Butterfly’s atomic insns are very expensive•Butterfly can’t handle 24-bit pointers

Results: Single Processor Lock/Release Time

Times are in μs Test_and_set Ticket Anderson (Queue)

MCS

Butterfly (Distributed)

34.9 38.7 65.7 71.3

Symmetry (Cache coherent)

7.0 NA 10.6 9.2

Results: Network Congestion

Busy-wait Lock Increase in Network Latency Measured FromLock Node Idle Node

test_and_set 1420% 96%test_and_set w/ linear backoff

882% 67%

test_and_set w/ exp. backoff

32% 4%

ticket 992% 97%ticket w/ prop backoff 53% 8%Anderson 75% 67%MCS 4% 2%

•Atomic insns >> normal insns && 1 processor latency is very important don’t use MCS•If processes might be preempted test_and_set with exponential backoff

Which lock should I use?fetch_and_store supported?

fetch_and_increment supported?

Yes No

test_and_set w/ exp backoff

Ticket

MCSYes No

Centralized BarrierP0P1P2P3

01234

Software Combining Tree BarrierP0P1P2P3

012 10 2

102

P0P1

P2P3

Tournament BarrierP0P1P2P3

P0 P1 P2 P3

W

C

L W

L

L

Dissemination BarrierP0P1P2P3

P0 P1 P2 P3

New Tree-Based BarrierP0P1P2P3

012

0

0

0

3

Summary

Barrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree

O(p) tree no O(p × fan-in) or O(∞)

Tournament O(plogp) tree yes O(p)Dissemination O(plogp) none yes O(plogp)New Tree-Based O(p) tree yes 2p - 2

Results – Distributed Shared MemoryBarrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree



Results – Broadcast-Based Cache-CoherentBarrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree



Results – Local vs. Remote Spinning

Barrier Network Latency (local)

Network Latency (remote)

New Tree-Based 10% increase 124% increaseDissemination 18% increase 117% increase

Barrier Decision Tree

Multiprocessor?

Dissemination Barrier

Centralized Barrier

New Tree-Based Barrier(tree wakeup)

New Tree-Based Barrier

(central wakeup)

Distributed Shared Memory

Broadcast-Based Cache-

Coherent

•No dance hall•No need for complicated hardware synch•Need a full set of fetch_and_ф

Architectural Recommendations

algorithms for scalable synchronization on shared-memory multiprocessors

Documents