algorithms for scalable synchronization on shared-memory multiprocessors

29
John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Upload: dash

Post on 23-Feb-2016

117 views

Category:

Documents


0 download

DESCRIPTION

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. John M. Mellor-Crummey. Michael L. Scott. Joseph Garvey & Joshua San Miguel. Dance Hall Machines?. Atomic Instructions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

John M. Mellor-Crummey

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Joseph Garvey & Joshua San Miguel

Michael L. Scott

Page 2: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Dance Hall Machines?

Page 3: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

•Various insns known as fetch_and_ф insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swap•Some can be used to simulate others but often with overhead•Some lock types require a particular primitive to be implemented or to be implemented efficiently

Atomic Instructions

Page 4: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

type lock = (unlocked, locked)

procedure acquire_lock (lock *L)

while test_and_set (L) == locked ;

procedure release_lock (lock *L)

*L = unlocked

Test_and_set: Basic

Page 5: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Test_and_set: Basic

$

P

$

P

$

P

Memory

Page 6: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

type lock = (unlocked, locked)

procedure acquire_lock (lock *L)

while 1

if *L == unlocked

if test_and_set (L) == unlockedreturn

procedure release_lock (lock *L)

*L = unlocked

Test_and_set: test_and_test_and_set

Page 7: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Test_and_set: test_and_test_and_set

$

P

$

P

$

P

Memory

Page 8: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

type lock = (unlocked, locked)

procedure acquire_lock (lock *L)

delay = 1

while test_and_set (L) == lockedpause (delay)

delay = delay * 2

procedure release_lock (lock *L)

*L = unlocked

Test_and_set: test_and_set with backoff

Page 9: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

type lock = record

next_ticket = 0

now_serving = 0

procedure acquire_lock (lock *L)

my_ticket = fetch_and_increment(L->next_ticket)while 1

if L->now_serving == my_ticket

return

procedure release_lock (lock *L)

L->now_serving = L->now_serving + 1

Ticket Lock

Page 10: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

type lock = record

slots = array [0…numprocs – 1] of (has_lock, must_wait)

next_slot = 0

procedure acquire_lock (lock *L)

my_place = fetch_and_increment (L->next_slot)// Various modulo work to handle overflow

while L->slots[my_place] == must_wait ;

L->slots[my_place] = must_wait

procedure release_lock (lock *L)

L->slots[my_place + 1] = has_lock

Array-Based Queuing Locks

Page 11: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Array-Based Queuing Locks

Memorynext_slot

slots

$

Pmy_place

$

Pmy_place

$

Pmy_place

Page 12: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

type qnode = record

qnode *next

bool locked

type lock = qnode*

procedure acquire_lock (lock *L, qnode *I)

I->next = Null

qnode *predecessor = fetch_and_store (L, I)

if predecessor != Null

I->locked = true

predecessor->next = I

while I->locked ;

MCS Locksprocedure release_lock (lock *L, qnode *I)

if I->next == Null

if compare_and_swap (L, I, Null)

return

while I->next == Null ;

I->next->locked = false

Page 13: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

MCS Locks

L1-R

2-B

3-B

2-R

3-R3-E4-B

5-B

4-R

procedure release_lock (lock *L, qnode *I)

if I->next == Null

if compare_and_swap (L, I, Null)

return

while I->next == Null ;

I->next->locked = false

Page 14: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Results: Scalability – Distributed Memory Architecture

Page 15: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Results: Scalability – Cache Coherent Architecture

Page 16: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

•Butterfly’s atomic insns are very expensive•Butterfly can’t handle 24-bit pointers

Results: Single Processor Lock/Release Time

Times are in μs Test_and_set Ticket Anderson (Queue)

MCS

Butterfly (Distributed)

34.9 38.7 65.7 71.3

Symmetry (Cache coherent)

7.0 NA 10.6 9.2

Page 17: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Results: Network Congestion

Busy-wait Lock Increase in Network Latency Measured FromLock Node Idle Node

test_and_set 1420% 96%test_and_set w/ linear backoff

882% 67%

test_and_set w/ exp. backoff

32% 4%

ticket 992% 97%ticket w/ prop backoff 53% 8%Anderson 75% 67%MCS 4% 2%

Page 18: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

•Atomic insns >> normal insns && 1 processor latency is very important don’t use MCS•If processes might be preempted test_and_set with exponential backoff

Which lock should I use?fetch_and_store supported?

fetch_and_increment supported?

Yes No

test_and_set w/ exp backoff

Ticket

MCSYes No

Page 19: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Centralized BarrierP0P1P2P3

01234

Page 20: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Software Combining Tree BarrierP0P1P2P3

012 10 2

102

P0P1

P2P3

Page 21: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Tournament BarrierP0P1P2P3

P0 P1 P2 P3

W

C

L W

L

L

Page 22: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Dissemination BarrierP0P1P2P3

P0 P1 P2 P3

Page 23: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

New Tree-Based BarrierP0P1P2P3

012

0

0

0

3

Page 24: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Summary

Barrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree

O(p) tree no O(p × fan-in) or O(∞)

Tournament O(plogp) tree yes O(p)Dissemination O(plogp) none yes O(plogp)New Tree-Based O(p) tree yes 2p - 2

Page 25: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Results – Distributed Shared MemoryBarrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree

O(p) tree no O(p × fan-in) or O(∞)

Tournament O(plogp) tree yes O(p)Dissemination O(plogp) none yes O(plogp)New Tree-Based O(p) tree yes 2p - 2

Page 26: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Results – Broadcast-Based Cache-CoherentBarrier Space Wakeup Local Spinning Network TxnsCentralized O(1) broadcast no O(p) or O(∞)Software Combining Tree

O(p) tree no O(p × fan-in) or O(∞)

Tournament O(plogp) tree yes O(p)Dissemination O(plogp) none yes O(plogp)New Tree-Based O(p) tree yes 2p - 2

Page 27: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Results – Local vs. Remote Spinning

Barrier Network Latency (local)

Network Latency (remote)

New Tree-Based 10% increase 124% increaseDissemination 18% increase 117% increase

Page 28: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Barrier Decision Tree

Multiprocessor?

Dissemination Barrier

Centralized Barrier

New Tree-Based Barrier(tree wakeup)

New Tree-Based Barrier

(central wakeup)

Distributed Shared Memory

Broadcast-Based Cache-

Coherent

Page 29: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

•No dance hall•No need for complicated hardware synch•Need a full set of fetch_and_ф

Architectural Recommendations