jeremy denham april 7, 2008. motivation background / previous work experimentation results ...
TRANSCRIPT
![Page 1: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/1.jpg)
Jeremy DenhamApril 7, 2008
![Page 2: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/2.jpg)
MotivationBackground / Previous workExperimentationResultsQuestions
![Page 3: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/3.jpg)
Modern processor design trends are primarily concerned with the multi-core design paradigm.
Still figuring out what to do with them Different way of thinking about “shared-
memory multiprocessors” Distributed apps?
Synchronization will be important.
![Page 4: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/4.jpg)
Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, Mellor-Crummey & Scott 1991.
Scalable, busy-wait synchronization algorithms No memory or interconnect contention O(1) remote references per mechanism
utilization Spin locks and barriers
![Page 5: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/5.jpg)
“Spin” on lock by busy-waiting until available.
Typically involves “fetch-and-Φ” operations
Must be atomic!
![Page 6: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/6.jpg)
“Test-and-set” Needs processor support to make it
atomic “fetch-and-store” xchg works in x86
Loop until lock is possessedExpensive!
Frequently accessed, too Networking issues
![Page 7: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/7.jpg)
Can reduce fetch-and-Φ ops to one per lock acquisition
FIFO service guaranteeTwo counters
Requests Releases fetch_and_increment request counter Wait until release counter reflects turn
Still problematic…
![Page 8: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/8.jpg)
T.E. Anderson Incoming
processes put themselves in the queue
Lock holder hands off the lock to next in queue
Faster than ticket, but more space
![Page 9: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/9.jpg)
FIFO GuaranteeLocal spinning!Small constant amount of spaceCache coherence a non-issue
![Page 10: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/10.jpg)
Each processor allocates a record next link boolean flag
Adds to queueSpins locallyOwner passes lock to next user in
queue as necessary
![Page 11: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/11.jpg)
Mechanism for “phase separation”
Block processes from proceeding until all others have reached a checkpoint
Designed for repetitive use
![Page 12: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/12.jpg)
“Local” and “global” senseAs processor arrives
Reverse local sense Signal its arrival If last, reverse global sense Else spin
Lots of spinning…
![Page 13: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/13.jpg)
Barrier information is “disseminated” algorithmically
At each synchronization stage k, processor i signals processor (i + 2k) mod P, where P is the number of processors
Similarly, processor i continues when it is signaled by processor (i - 2k) mod P
log(P) operations on critical path, P log(P) remote operations
![Page 14: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/14.jpg)
Tree-based approachOutcome statically determined“Roles” for each round
“loser” notifies “winner,” then drops out “winner” waits to be notified,
participates in next round “champion” sets global flag when over
log(P) roundsHeavy interconnect traffic…
![Page 15: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/15.jpg)
Also tree-basedLocal spinningO(P) space for P processors (2P – 2) network transactionsO(log P) network transactions on
critical path
![Page 16: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/16.jpg)
Use two P-node trees“child-not-ready” flag for each child
present in parentWhen all children have signaled
arrival, parent signals its parentWhen root detects all children have
arrived, signals to the group that it can proceed to next barrier.
![Page 17: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/17.jpg)
Experiments done on BBN Butterfly 1 and Sequent Symmetry Model B machines
BBN Supports up to 256 processor nodes 8 MHz MC68000
Sequent Supports up to 30 processor nodes 16 MHz Intel 80386
Most concerned with Sequent
![Page 18: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/18.jpg)
![Page 19: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/19.jpg)
![Page 20: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/20.jpg)
![Page 21: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/21.jpg)
Want to extend to multi-core machines
Scalability of limited usefulness (not that many cores) Shared resources Core load
![Page 22: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/22.jpg)
Intel Centrino Duo T5200 Processor Two cores 1.60 GHz per core 2MB L2 Cache
Windows Vista2GB DDR2 Memory
![Page 23: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/23.jpg)
Evaluate basic and MCS approaches Simple and complex evaluations Core pinning Load ramping
![Page 24: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/24.jpg)
Code porting Lots of Linux-specific code
Win32 Thread API Esoteric… How to pin a thread to a core?
Timing Win32 μsec-granularity measurement
Surprisingly archaic C code
![Page 25: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/25.jpg)
Spin lock base code portedBarriers nearly doneSimple experiments for spin locks
done More complex on the way
![Page 26: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/26.jpg)
Simple spin lock tests Simple lock outperforms MCS on:▪ Empty Critical Section▪ Simple FP Critical Section▪ Single core▪ Dual core
More procedural overhead for MCS on small scale
Next steps: ▪ More threads!▪ More critical section complexity
![Page 27: Jeremy Denham April 7, 2008. Motivation Background / Previous work Experimentation Results Questions](https://reader035.vdocument.in/reader035/viewer/2022062423/56649f275503460f94c3f5d2/html5/thumbnails/27.jpg)