a scalable approach to thread-level speculation j. gregory steffan, christopher b. colohan, antonia...

A Scalable Approach to Thread-Level Speculation

J. Gregory Steffan, Christopher B. Colohan,

Antonia Zhai, and Todd C. Mowry

Carnegie Mellon University

Outline Motivation Thread level speculation (TLS) Coherence scheme Optimizations Methodology Results Conclusion

Motivation Leading chip manufactures going for multi-

core architectures Usually used to increase throughput To exploit these parallel resources to increase

performance – need to parallelize programs Integer programs hard to parallelize Use speculation – thread level speculation

(TLS)!

Thread level speculation (TLS)

Scalable Approach The paper aims to design a scalable approach

which applies to wide variety of multi-processor like architectures

Only limitation is that the architecture should be shared memory based

The TLS is implemented over the invalidation based cache coherence protocol

Example Each cache line has special bits

SL – speculative load has accessed the line SM – the line is speculatively modified

Thread is squashed if Line is present SL is set If epoch number indicates an earlier thread

Speculation level We are concerned only

with the speculation level – level in the cache hierarchy where the cache protocol begins

We can ignore all the other levels

Cache line states Apart from the cache

state bits we need SL and SM bits

A cache line with speculative bits set cannot be replaced

The thread is either squashed or the operation is delayed

Basic cache coherence protocol When a processor wants to load a value, it

atleast needs shared access to the line When it wants to write, it needs exclusive

access Coherence mechanism issues invalidation

message when it receives request for exclusive access

Coherence mechanism

Commit When the homefree token arrives there is no

possibility of further squashes SpE is changed to E and SpS to S Lines with SM bit set has to have D bit set If a line is speculatively modified and shared,

we have to get exclusive access for that line Ownership required buffer (ORB) is used to track

such lines

Squash All speculatively modified lines have to be

invalidated SpE is changed to E and SpS to S

Performance Optimizations

Forwarding Data Between Epochs: Predictable data dependences are synchronized

Dirty and Speculatively Loaded State: Usually if a dirty line is speculatively loaded, it is

flushed – this can be avoided Suspending Violations:

When we have to evict a speculative line, we don’t need to squash

Multiple writers If two epochs write to the same line – we

have to squash one to avoid multiple writer problem

Possible to avoid this by maintaining fine grained disambiguation bits

Implementation

Epoch numbers Has two parts – TID and sequence number To avoid costly comparisons during every

access – the difference is precomputed and a logically later mask is formed

Epoch numbers are maintained at one place for one chip

Speculative state implementation

Multiple writers - implementation False violations are also handled in the same

way

Correctness considerations Speculation fails if the speculative state is lost Exceptions are handled only when the

homefree token is got System calls are also postponed

Methodology Detailed out-of-order simulation based on

MIPS R10000 is done Fork and other synchronization overhead is

10 cycles

Results Normalized execution cycles

Results Buk and equake – memory performance is a

bottleneck When increased more than 4 processors ijpeg

performance degrades Number of threads available is less Some conflicts in cache

Overheads Violations

Cache locality is important ORB size can be further reduced – early release of

ORB

Communication overhead Buk is insensitive

Multiprocessor performance Advantages

More cache storage Disadvantage

Increased communication latency

Conclusion By using TLS even integer programs can be

parallelized to get speedup The approach is scalable and can be applied

to various other architectures which support multiple threads

There are applications that are insensitive to communication latency – so large scale parallel architectures using TLS are possible

Thanks!

a scalable approach to thread-level speculation j. gregory steffan, christopher b. colohan, antonia...

Documents

chip slide

delayed slide

way slide

exclusive access slide

cache coherence protocol

coherence mechanism

speculative line

line sm