algorithms for scalable synchronization on shared-memory multiprocessors
Post on 23-Feb-2016
120 Views
Preview:
DESCRIPTION
TRANSCRIPT
John M. Mellor-Crummey
Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors
Joseph Garvey & Joshua San Miguel
Michael L. Scott
Dance Hall Machines?
•Various insns known as fetch_and_ф insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swap•Some can be used to simulate others but often with overhead•Some lock types require a particular primitive to be implemented or to be implemented efficiently
Atomic Instructions
type lock = (unlocked, locked)
procedure acquire_lock (lock *L)
while test_and_set (L) == locked ;
procedure release_lock (lock *L)
*L = unlocked
Test_and_set: Basic
Test_and_set: Basic
$
P
$
P
$
P
Memory
type lock = (unlocked, locked)
procedure acquire_lock (lock *L)
while 1
if *L == unlocked
if test_and_set (L) == unlockedreturn
procedure release_lock (lock *L)
*L = unlocked
Test_and_set: test_and_test_and_set
Test_and_set: test_and_test_and_set
$
P
$
P
$
P
Memory
type lock = (unlocked, locked)
procedure acquire_lock (lock *L)
delay = 1
while test_and_set (L) == lockedpause (delay)
delay = delay * 2
procedure release_lock (lock *L)
*L = unlocked
Test_and_set: test_and_set with backoff
type lock = record
next_ticket = 0
now_serving = 0
procedure acquire_lock (lock *L)
my_ticket = fetch_and_increment(L->next_ticket)while 1
if L->now_serving == my_ticket
return
procedure release_lock (lock *L)
L->now_serving = L->now_serving + 1
Ticket Lock
type lock = record
slots = array [0…numprocs – 1] of (has_lock, must_wait)
next_slot = 0
procedure acquire_lock (lock *L)
my_place = fetch_and_increment (L->next_slot)// Various modulo work to handle overflow
while L->slots[my_place] == must_wait ;
L->slots[my_place] = must_wait
procedure release_lock (lock *L)
L->slots[my_place + 1] = has_lock
Array-Based Queuing Locks
Array-Based Queuing Locks
Memorynext_slot
slots
$
Pmy_place
$
Pmy_place
$
Pmy_place
type qnode = record
qnode *next
bool locked
type lock = qnode*
procedure acquire_lock (lock *L, qnode *I)
I->next = Null
qnode *predecessor = fetch_and_store (L, I)
if predecessor != Null
I->locked = true
predecessor->next = I
while I->locked ;
MCS Locksprocedure release_lock (lock *L, qnode *I)
if I->next == Null
if compare_and_swap (L, I, Null)
return
while I->next == Null ;
I->next->locked = false
MCS Locks
L1-R
2-B
3-B
2-R
3-R3-E4-B
5-B
4-R
procedure release_lock (lock *L, qnode *I)
if I->next == Null
if compare_and_swap (L, I, Null)
return
while I->next == Null ;
I->next->locked = false
Results: Scalability – Distributed Memory Architecture
Results: Scalability – Cache Coherent Architecture
•Butterfly’s atomic insns are very expensive•Butterfly can’t handle 24-bit pointers
Results: Single Processor Lock/Release Time
Times are in μs Test_and_set Ticket Anderson (Queue)
MCS
Butterfly (Distributed)
34.9 38.7 65.7 71.3
Symmetry (Cache coherent)
7.0 NA 10.6 9.2
Results: Network Congestion
Busy-wait Lock Increase in Network Latency Measured FromLock Node Idle Node
test_and_set 1420% 96%test_and_set w/ linear backoff
882% 67%
test_and_set w/ exp. backoff
32% 4%
ticket 992% 97%ticket w/ prop backoff 53% 8%Anderson 75% 67%MCS 4% 2%
•Atomic insns >> normal insns && 1 processor latency is very important don’t use MCS•If processes might be preempted test_and_set with exponential backoff
Which lock should I use?fetch_and_store supported?
fetch_and_increment supported?
Yes No
test_and_set w/ exp backoff
Ticket
MCSYes No
Centralized BarrierP0P1P2P3
01234
Software Combining Tree BarrierP0P1P2P3
012 10 2
102
P0P1
P2P3
Tournament BarrierP0P1P2P3
P0 P1 P2 P3
W
C
L W
L
L
Dissemination BarrierP0P1P2P3
P0 P1 P2 P3
New Tree-Based BarrierP0P1P2P3
012
0
0
0
3
Summary
Barrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree
O(p) tree no O(p × fan-in) or O(∞)
Tournament O(plogp) tree yes O(p)Dissemination O(plogp) none yes O(plogp)New Tree-Based O(p) tree yes 2p - 2
Results – Distributed Shared MemoryBarrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree
O(p) tree no O(p × fan-in) or O(∞)
Tournament O(plogp) tree yes O(p)Dissemination O(plogp) none yes O(plogp)New Tree-Based O(p) tree yes 2p - 2
Results – Broadcast-Based Cache-CoherentBarrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree
O(p) tree no O(p × fan-in) or O(∞)
Tournament O(plogp) tree yes O(p)Dissemination O(plogp) none yes O(plogp)New Tree-Based O(p) tree yes 2p - 2
Results – Local vs. Remote Spinning
Barrier Network Latency (local)
Network Latency (remote)
New Tree-Based 10% increase 124% increaseDissemination 18% increase 117% increase
Barrier Decision Tree
Multiprocessor?
Dissemination Barrier
Centralized Barrier
New Tree-Based Barrier(tree wakeup)
New Tree-Based Barrier
(central wakeup)
Distributed Shared Memory
Broadcast-Based Cache-
Coherent
•No dance hall•No need for complicated hardware synch•Need a full set of fetch_and_ф
Architectural Recommendations
top related