a scalable approach to thread-level speculation j. gregory steffan, christopher b. colohan, antonia...
Post on 19-Dec-2015
219 views
TRANSCRIPT
![Page 1: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/1.jpg)
A Scalable Approach to Thread-Level Speculation
J. Gregory Steffan, Christopher B. Colohan,
Antonia Zhai, and Todd C. Mowry
Carnegie Mellon University
![Page 2: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/2.jpg)
Outline Motivation Thread level speculation (TLS) Coherence scheme Optimizations Methodology Results Conclusion
![Page 3: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/3.jpg)
Motivation Leading chip manufactures going for multi-
core architectures Usually used to increase throughput To exploit these parallel resources to increase
performance – need to parallelize programs Integer programs hard to parallelize Use speculation – thread level speculation
(TLS)!
![Page 4: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/4.jpg)
Thread level speculation (TLS)
![Page 5: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/5.jpg)
Scalable Approach The paper aims to design a scalable approach
which applies to wide variety of multi-processor like architectures
Only limitation is that the architecture should be shared memory based
The TLS is implemented over the invalidation based cache coherence protocol
![Page 6: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/6.jpg)
Example Each cache line has special bits
SL – speculative load has accessed the line SM – the line is speculatively modified
Thread is squashed if Line is present SL is set If epoch number indicates an earlier thread
![Page 7: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/7.jpg)
Speculation level We are concerned only
with the speculation level – level in the cache hierarchy where the cache protocol begins
We can ignore all the other levels
![Page 8: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/8.jpg)
Cache line states Apart from the cache
state bits we need SL and SM bits
A cache line with speculative bits set cannot be replaced
The thread is either squashed or the operation is delayed
![Page 9: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/9.jpg)
Basic cache coherence protocol When a processor wants to load a value, it
atleast needs shared access to the line When it wants to write, it needs exclusive
access Coherence mechanism issues invalidation
message when it receives request for exclusive access
![Page 10: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/10.jpg)
Coherence mechanism
![Page 11: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/11.jpg)
Commit When the homefree token arrives there is no
possibility of further squashes SpE is changed to E and SpS to S Lines with SM bit set has to have D bit set If a line is speculatively modified and shared,
we have to get exclusive access for that line Ownership required buffer (ORB) is used to track
such lines
![Page 12: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/12.jpg)
Squash All speculatively modified lines have to be
invalidated SpE is changed to E and SpS to S
![Page 13: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/13.jpg)
Performance Optimizations
Forwarding Data Between Epochs: Predictable data dependences are synchronized
Dirty and Speculatively Loaded State: Usually if a dirty line is speculatively loaded, it is
flushed – this can be avoided Suspending Violations:
When we have to evict a speculative line, we don’t need to squash
![Page 14: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/14.jpg)
Multiple writers If two epochs write to the same line – we
have to squash one to avoid multiple writer problem
Possible to avoid this by maintaining fine grained disambiguation bits
![Page 15: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/15.jpg)
Implementation
![Page 16: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/16.jpg)
Epoch numbers Has two parts – TID and sequence number To avoid costly comparisons during every
access – the difference is precomputed and a logically later mask is formed
Epoch numbers are maintained at one place for one chip
![Page 17: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/17.jpg)
Speculative state implementation
![Page 18: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/18.jpg)
Multiple writers - implementation False violations are also handled in the same
way
![Page 19: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/19.jpg)
Correctness considerations Speculation fails if the speculative state is lost Exceptions are handled only when the
homefree token is got System calls are also postponed
![Page 20: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/20.jpg)
Methodology Detailed out-of-order simulation based on
MIPS R10000 is done Fork and other synchronization overhead is
10 cycles
![Page 21: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/21.jpg)
Results Normalized execution cycles
![Page 22: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/22.jpg)
Results Buk and equake – memory performance is a
bottleneck When increased more than 4 processors ijpeg
performance degrades Number of threads available is less Some conflicts in cache
![Page 23: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/23.jpg)
Overheads Violations
Cache locality is important ORB size can be further reduced – early release of
ORB
![Page 24: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/24.jpg)
Communication overhead Buk is insensitive
![Page 25: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/25.jpg)
Multiprocessor performance Advantages
More cache storage Disadvantage
Increased communication latency
![Page 26: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/26.jpg)
Conclusion By using TLS even integer programs can be
parallelized to get speedup The approach is scalable and can be applied
to various other architectures which support multiple threads
There are applications that are insensitive to communication latency – so large scale parallel architectures using TLS are possible
![Page 27: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649d385503460f94a11a2e/html5/thumbnails/27.jpg)
Thanks!