multivie w slide 1 (of 23) the performance of spin lock alternatives for shared-memory...
TRANSCRIPT
Slide 1 (of 23)
MULTIVIE
WThe Performance of Spin Lock Alternatives for Shared-
Memory MultiprocessorsPaper: Thomas E. Anderson
Presentation: Emerson Murphy-Hill
Slide 2 (of 23)
MULTIVIE
W
Introduction
• Shared Memory Multiprocessors• Mutual exclusion required• Almost always hardware primitives provided
– Direct mutual exclusion– Mutual exclusion through locking
• Interest here: short critical regions, spin locks• The problem: spinning processors cost
communication bandwidth – how can we cut it?
Slide 3 (of 23)
MULTIVIE
W
Range of Architectures• Two dimensions:
– Interconnect type (multistage network or bus)– Cache type
• So six architectures considered:– Multistage network without private caches– Multistage network, invalidation based cache coherence using
RD– Bus without coherent private cache– Bus w/snoopy write through invalidation-based cache coherence – Bus with snoopy write-back invalidation based cache coherence– Bus with snoopy distributed write cache coherence
• Architectures generally read, modify, and write atomically
Slide 4 (of 23)
MULTIVIE
W
Why Spinlocks are Slow
• Tradeoff: frequent polling gets you the lock faster, but slows everyone else down
• Latency is an issue: some overhead for complicated spinlock algorithm
Slide 5 (of 23)
MULTIVIE
W
A Spin-Waiting Algorithm
• Spin on Test-and-Setwhile(TestAndSet(lock) = BUSY);
<criticial section>
Lock := CLEAR;
• Slow, because:– Lock holder must content with non-lock
holders– Spinning requests slow other requests
Slide 6 (of 23)
MULTIVIE
W
Another Spin-Waiting Algorithm
• Spin on Read (Test-and-Test-and-Set)while(lock=BUSY or TestAndSet(lock)=BUSY);<criticial section>lock := CLEAR;
• For architectures with per-processor cache
• Like previous, but no network/bus communication on read
• For short critical sections, this is slow, because the time to quiesce (all processors resume spinning) dominates
Slide 7 (of 23)
MULTIVIE
W
Reasons Why Quiescence is Slow
• Elapsed time between Read and Test-and-Set
• All cached copies of a lock are invalidated on a Test-and-Set, even if the test fails
• Invalidation-based cache-coherence requires O(P) bus/network cycles, because a written value has to be propegated to every processor (the same one!)
Slide 8 (of 23)
MULTIVIE
W
Validation
Slide 9 (of 23)
MULTIVIE
W
Validation (a bit more)
Slide 10 (of 23)
MULTIVIE
W
Now, Speed it Up…
• Author presents 5 alternative approaches
• Interesting approach – 4 are based on the observation that communication during spin waiting is like CSMA (Ethernet) networking protocols
Slide 11 (of 23)
MULTIVIE
W
1/5: Static Delay on Lock Release
• When a processor notices the lock has been released, it waits a fixed amount of time before trying a Test-And-Set
• Each processor is assigned a static delay (slot)
• Good performance:– Fewer slots, fewer spinning processors– Many slots, more spinning processors
Slide 12 (of 23)
MULTIVIE
W
2/5: Backoff on Lock Release
• Like Ethernet backoff• Wait a small amount of time between
Read and Test-and-Set• If processor collides with another
processor, it backs off for a greater random interval
• Indirectly, processors base backoff interval on the number of spinning processors
• But…
Slide 13 (of 23)
MULTIVIE
W
More on Backoff…
• Processors should not change their mean delay if another processor acquires the lock
• Maximum time to delay should be bounded
• Initial delay on arrival should be a fraction of the last delay
Slide 14 (of 23)
MULTIVIE
W
3/5: Static Delay before Reference
while(lock=BUSY or TestAndSet(lock)=BUSY)
delay();
<criticial section>
• Here you just check the lock less often
• Good when:– Checking frequently, and few other
spinners– Checking infrequently, many spinners
Slide 15 (of 23)
MULTIVIE
W
4/5: Backoff before Reference
while(lock=BUSY or TestAndSet(lock)=BUSY)
delay();
delay += randomBackoff();
<criticial section>
• Analogous to backoff on lock release• Both dynamic and static backoff are
bad when the critical section is long: they just keep backing off while the lock is being held
Slide 16 (of 23)
MULTIVIE
W
5/5: Queue• Can’t estimate backoff by number of waiting
processes, can’t keep a process queue (just as slow as the lock!)
• This author’s contribution (finally):
Init flags[0] := HAS_LOCK;flags[1..P-1] := MUST_WAIT;queueLast := 0;
Lock myPlace := ReadAndIncrement(queueLast);while(flags[myPlace mod P]=MUST_WAIT);<critical section>
Unlock flags[myPlace mod P] := MUST_WAIT;flags[(myPlace+1) mod P] := HAS_LOCK;
Slide 17 (of 23)
MULTIVIE
W
More on Queuing
• Works especially well for multistage networks – each flag can be on a separate module, so a single memory location isn’t saturated with requests
• Works less well if there’s a bus without cache coherence, because we still have the problem that each process has to poll for a single value in one place
• Lock latency is increased (overhead), so poor performance when there’s no contention
Slide 18 (of 23)
MULTIVIE
W
Benchmark Spin-lock Alternatives
Slide 19 (of 23)
MULTIVIE
W
Overhead vs. Number of Slots
Slide 20 (of 23)
MULTIVIE
W
Spin-waiting Overhead for a Burst
Slide 21 (of 23)
MULTIVIE
W
Network Hardware Solutions
• Combining Networks– Multiple paths to same memory location
• Hardware Queuing– Eliminates polling across the network
• Goodman’s Queue Links– Stores the name of the next processor in the
queue directly in each processor’s cache– Eliminates need for memory access for
queuing
Slide 22 (of 23)
MULTIVIE
W
Bus Hardware Solutions
• Invalidate cache copies ONLY when Test-and-Set succeeds
• Read broadcast– Whenever some other processor reads a value
which I know is invalid, I get a copy of that value too (piggyback)
– Eliminates the cascade of read-misses• Special handling of Test-and-Set
– Cache and bus controllers don’t mess with the bus if the lock is busy
– Essentially, doesn’t do a test-and-set so long as there is a possibility it might fail
Slide 23 (of 23)
MULTIVIE
W
Conclusions
• Spin-locking performance doesn’t scale• A variant of Ethernet backoff has good
results when there is little lock contention
• Queuing (parallelizing lock handoff) has good results when there are waiting processors
• A little supportive hardware goes a long way towards a healthy multiprocessor relationship