implicitly-multithreaded processors il park and babak falsafi and t. n. vijaykumar presented by:...

32
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture News, 2003

Upload: thomas-payne

Post on 03-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar

Presented by: Ashay Rane

Published in: SIGARCH Computer Architecture News, 2003

Page 2: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Agenda

Overview (IMT, state-of-art)

IMT enhancements

Key results

Critique

Relation to Term Project

Page 3: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Implicitly Multithreaded Processor (IMT)

SMT with speculation

Optimizations to basic SMT support

Average perf. improvement of 24%Max: 69%

Page 4: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

State-of-the-art

Pentium 4 HT

IBM POWER5

MIPS MT

Page 5: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Speculative SMT operation

When branch encountered, start executing likely path “speculatively”

i.e. allow for rollback (thread squash) in certain circumstances (misprediction, dependence)

Overcome cost, overhead with savings in execution time and power (but worth the effort)

Complication because commit by independent threads (buffer for each thread). Also issue, register renaming, cache & TLB conflicts.

If dependence violation, squash thread and restart execution

Page 6: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

How to buffer speculative data?

Load/Store Queue (LSQ) Buffers data (along with its address) Helps enforce dependency check Makes rollback possible

Cache-based approaches

Page 7: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

IMT: Most significant improvements

Assistance from Multiscalar compiler

Resource- and dependence-aware fetch policy

Multiplexing threads on a single hardware context

Overlapping thread startup operations with previous threads execution

Page 8: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

What does Compiler do?Extracts threads from program (loops)

Generates thread descriptor data about registers read and written and control flow exits (for rename tables)

Annotates instructions with special codes (“forward” & “release”) for dependence checking

Page 9: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Fetch PolicyHardware keeps track of resource utilization

Resource requirement prediction from past four execution instances

When dependencies exist (detected from compiler-generated data), bias towards non-speculative threads

Goal is to reduce number of thread squashes

Page 10: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Multiplexing threads on a single hardware context

Observations: Threads usually short Number of contexts less (2-8)

Hence frequent switching, less overlap

Page 11: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Multiplexing (contd.)Larger threads can lead to:

Speculation buffer overflow Increased dependence mis-speculation Hence thread squashing

Each execution context can further support multiple threads (3-6)

Page 12: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Multiplexing: Required Hardware

Per context per thread: Program Counter Register rename table

LSQ shared among threads running on 1 execution context

Page 13: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Multiplexing: Implementation Issues

LSQ shared but it needs to maintain loads and stores for each thread separately

Therefore, create “gaps” for yet-to-be-fetched instructions / data

If space falls short, squash subsequent thread

What if threads from one program are mapped to different contexts?

IMT searches through other contexts

Easier to have multiple LSQs per context per thread but not good cost and power consumption

Page 14: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Register renamingRequired because multiple threads may use

same registers

Separate rename tables

Master Rename Table (global)Local Rename Table (per thread)Pre-assign table (per thread)

Page 15: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Register renaming: FlowThread Invocation:

Copy from Master table into Local table (to reflect current status)

Also use “create” and “use” mask of thread descriptor(to for dependence check)

Before every subsequent thread invocation: Pre-assign rename maps into Pre-assign table Copy from Pre-assign table to Master table and

mark registers as “busy”. So no successor thread can use them before current thread writes to them.

Page 16: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Hiding thread startup delay

Rename tables to be setup before execution begins

Occupies table bandwidth, hence cannot be done for a number of threads in parallel

Hence overlap setting up of rename tables with previous thread’s execution

Page 17: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Load/Store QueuePer context

Speculative load / store: Search through current and other contexts for dependence

No searching for non-speculative loads

Searching can take time, so schedules load-dependent instructions accordingly

Page 18: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Key Results

Page 19: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Average improvement: 24%

Reduction in data dependence stalls

Little overhead of optimizations

Not all benchmark programs

Page 20: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

• Assuming 2-3 threads per context, 6-8 LSQ entries per thread.

• Performance relative to IMT with unlimited resources

Page 21: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

• ICOUNT: Favor least number of instructions remaining to be executed

• Biased-ICOUNT: Favor non-speculative threads

• Worst-case resource estimation

• Reduced thread squashing

Page 22: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

• TME: Executes both paths of an unpredictable branch (but such branches uncommon)

• DMT:– Hardware-selection of threads. So spawns threads on

backward-branch or function call instead of loops.– Also spawns threads out of order. So lower accuracy of branch

prediction.

Page 23: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Critique

Page 24: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Compiler Support

Improvement in applications compiled using Multiscalar compiler

Scientific computing applications, not for desktop applications

Page 25: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

LSQ LimitationsLSQ size deciding the size of speculative

thread

Pentium 4 (without SMT):48 Loads, 24 Stores

Pentium 4 HT:24 Loads, 12 Stores per thread

IBM Power5:32 Loads, 32 Stores per thread

Page 26: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

LSQ Limitations: AlternativeCache-based approach

i.e. Partition the cache to support different versions

Extra support required, but scalable

Page 27: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Register file size IMT considers register file sizes of 128 and up.

Pentium 4 (as well as HT):Register file size = 128

IBM POWER5:Register file size = 80

Page 28: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Searching LSQ

Since loads and stores organized as per thread, search involves all locations of other threads.

If loads/stores organized according to addresses then lesser values to search.

Can make use of associativity of cache

Page 29: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Searching LSQ (contd.)

Page 30: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

So how is performance still high?Assistance from Compiler

Resource and dependency-aware fetching

Multiple threads on an execution context

Overlapping rename table creation with execution

Page 31: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Term project “Cache-based throughput improvement

techniques for Speculative SMT processors”

Optimizations from IMT

Increasing granularity to reduce number of thread squashes

Page 32: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture

Thank you