algorithms for scalable synchronization on shared-memory multiprocessors

John M. Mellor-Crummey

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Joseph Garvey & Joshua San Miguel

Michael L. Scott

Dance Hall Machines?

•Various insns known as fetch_and_ф insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swap•Some can be used to simulate others but often with overhead•Some lock types require a particular primitive to be implemented or to be implemented efficiently

Atomic Instructions

type lock = (unlocked, locked)

procedure acquire_lock (lock *L)

while test_and_set (L) == locked ;

procedure release_lock (lock *L)

*L = unlocked

Test_and_set: Basic

Memory

while 1

if *L == unlocked

if test_and_set (L) == unlockedreturn

*L = unlocked

Test_and_set: test_and_test_and_set

Memory

delay = 1

while test_and_set (L) == lockedpause (delay)

delay = delay * 2

*L = unlocked

Test_and_set: test_and_set with backoff

type lock = record

next_ticket = 0

now_serving = 0

my_ticket = fetch_and_increment(L->next_ticket)while 1

if L->now_serving == my_ticket

return

L->now_serving = L->now_serving + 1

Ticket Lock

type lock = record

slots = array [0…numprocs – 1] of (has_lock, must_wait)

next_slot = 0

my_place = fetch_and_increment (L->next_slot)// Various modulo work to handle overflow

while L->slots[my_place] == must_wait ;

L->slots[my_place] = must_wait

L->slots[my_place + 1] = has_lock

Array-Based Queuing Locks

Memorynext_slot

Pmy_place

type qnode = record

qnode *next

bool locked

type lock = qnode*

procedure acquire_lock (lock *L, qnode *I)

I->next = Null

qnode *predecessor = fetch_and_store (L, I)

if predecessor != Null

I->locked = true

predecessor->next = I

while I->locked ;

MCS Locksprocedure release_lock (lock *L, qnode *I)

if I->next == Null

if compare_and_swap (L, I, Null)

return

while I->next == Null ;

I->next->locked = false

MCS Locks

3-R3-E4-B

procedure release_lock (lock *L, qnode *I)

if I->next == Null

if compare_and_swap (L, I, Null)

return

while I->next == Null ;

I->next->locked = false

Results: Scalability – Distributed Memory Architecture

Results: Scalability – Cache Coherent Architecture

•Butterfly’s atomic insns are very expensive•Butterfly can’t handle 24-bit pointers

Results: Single Processor Lock/Release Time

Times are in μs Test_and_set Ticket Anderson (Queue)

Butterfly (Distributed)

34.9 38.7 65.7 71.3

Symmetry (Cache coherent)

7.0 NA 10.6 9.2

Results: Network Congestion

Busy-wait Lock Increase in Network Latency Measured FromLock Node Idle Node

test_and_set 1420% 96%test_and_set w/ linear backoff

882% 67%

test_and_set w/ exp. backoff

32% 4%

ticket 992% 97%ticket w/ prop backoff 53% 8%Anderson 75% 67%MCS 4% 2%

•Atomic insns >> normal insns && 1 processor latency is very important don’t use MCS•If processes might be preempted test_and_set with exponential backoff

Which lock should I use?fetch_and_store supported?

fetch_and_increment supported?

Yes No

test_and_set w/ exp backoff

Ticket

MCSYes No

Centralized BarrierP0P1P2P3

Software Combining Tree BarrierP0P1P2P3

012 10 2

Tournament BarrierP0P1P2P3

P0 P1 P2 P3

Dissemination BarrierP0P1P2P3

P0 P1 P2 P3

New Tree-Based BarrierP0P1P2P3

Summary

Barrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree

O(p) tree no O(p × fan-in) or O(∞)

Tournament O(plogp) tree yes O(p)Dissemination O(plogp) none yes O(plogp)New Tree-Based O(p) tree yes 2p - 2

Results – Distributed Shared MemoryBarrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree

Results – Broadcast-Based Cache-CoherentBarrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree

Results – Local vs. Remote Spinning

Barrier Network Latency (local)

Network Latency (remote)

New Tree-Based 10% increase 124% increaseDissemination 18% increase 117% increase

Barrier Decision Tree

Multiprocessor?

Dissemination Barrier

Centralized Barrier

New Tree-Based Barrier(tree wakeup)

New Tree-Based Barrier

(central wakeup)

Distributed Shared Memory

Broadcast-Based Cache-

Coherent

•No dance hall•No need for complicated hardware synch•Need a full set of fetch_and_ф

Architectural Recommendations

algorithms for scalable synchronization on shared-memory multiprocessors

Documents

cache coherence directories for scalable multiprocessors

reactive multi-word synchronization for multiprocessors ·...

memory consistency and event ordering in scalable shared...

the thrifty barrier energy-aware synchronization in...

scalable reader-writer synchronization for shared-memory...

second edition embedded multiprocessors...

chapter 7 (excl. 7.9): scalable multiprocessors

scalable reader writer synchronization

1 scalable synchronization and reciprocity calibration...

scalable distributed memory multiprocessors todd c. mowry cs...

atlas: a scalable and high-performance …...in scalable...

on scalable shared memory multiprocessors...automatic...

reactive synchronization algorithms for multiprocessors

john m. mellor-crummey algorithms for scalable...

hierarchical phasers for scalable synchronization and...

algorithms for scalable synchronization on shared-memory...

scaling issues scalable multiprocessors

disco: running commodity operating systems on scalable...

scalable distributed memory multiprocessors

eecs 570: fall 2003 -- rev3 1 chapter 8: cache coherents in...