multiscalar processors gurindar s. sohi scott e. breach t.n. vijaykumar university of...

Multiscalar processors

Gurindar S. SohiScott E. BreachT.N. Vijaykumar

University of Wisconsin-Madison

Outline

Motivation Multiscalar paradigm Multiscalar architecture Software and hardware support Distribution of cycles Results Conclusion

Motivation Current architecture techniques reaching

their limits Amount of ILP that can be extracted by

superscalar processor is limited Kunle Olukotun (stanford university)

Limits of ILP

Parallelism that can be extracted from a single program is very limited – 4 or 5 in integer programs

Limits of instruction-level parallelism- David W. Wall (1990)

Limitations of superscalar

Branch prediction accuracy limits ILP Every 5 instruction is a branch Executing an instruction across 5 branches

leads to useful result only 60% of the time (with branch prediction accuracy 90%)

There are branches which are difficult to predict – increasing the window size doesn’t always means executing useful instructions

Limitations of superscalar.. contd

Large window size Issuing more instructions per cycle needs large

window of instructions Each cycle search the whole window to find

instructions to issue Increases the pipeline length

Issue complexity To issue an instruction dependence checks

have to be performed with other issuing instructions

To issue n instructions complexity of issue is n2

Limitations of superscalar.. contd

Load and store queue limitations Loads and stores cannot be reordered

before knowing their addresses One load or store waiting for its address

can block the entire processor

Superscalar limitation example

Consider the following hypothetical loop:

Iter 1: inst 1 inst 2… inst nIter 2: inst 1 inst 2…

If window size is less than n, superscalar considers only one iteration at a time

Possible improvementIter 1: iter 2: inst 1 inst 1 inst 2 inst 2… … … inst n inst n

Multiscalar paradigm

Divide the program (CFG) into multiple tasks (not necessarily parallel)

Execute the tasks in different processing elements, residing in the same die – communication cost is less

Sequential semantics is preserved by hardware and software mechanisms

Tasks are typically re-executed if there is any violations

Crossing the limits of superscalar

Branch prediction Each thread executes independently Each thread is limited by branch prediction – but

number of useful instructions available is much larger than superscalar

Window size Each processing element has its own window Total size of the windows in a die can be very

large, while each window can be of moderate size

Crossing the limits of superscalar.. contd

Issue Complexity Each processing element issue only a

few instructions – simplifies logic

Loads and Stores Loads and stores can executed without

waiting for the previous thread’s load or store

Multiscalar architecture

A possible microarchitecture

Multiscalar execution

The sequencer walks over the CFG According the hints inserted in the code, it

assigns tasks to PEs PEs execute the tasks in parallel Maintaining sequential semantics

Register dependencies Memory dependencies

Tasks are assigned in the ring order and are committed in the ring order

Register Dependencies

Register dependencies can be easily identified using compiler

Dependencies are always synchronized Registers that a task may write are

maintained in a create mask Reservations are created in the successor

tasks using the accum mask If the reservation exist (value not arrived),

the instruction reading the register waits

Memory dependencies

Cannot be statically found Multiscalar uses an aggressive

approach – speculate always The loads don’t wait for stores in the

predecessor tasks Hardware checks for violation and the

task is re-executed if it violates any memory dependency

Task commit

Speculative tasks are not allowed to modify memory

Store values are buffered in hardware When the processing element becomes

head – it retires its values into memory In order to maintain sequential semantics

the tasks retire in order – ring arrangement of processing elements

Compiler support

Structure of CFG Sequencer needs information of tasks Compiler or a assembly code analyzer

marks the structure of the CFG – task boundaries

Sequencer walks through this information

Compiler support .. contd Communication information

Gives the create mask as part of task header Sets the forward and stop bits Register value is forwarded if forward bit is

set Task is done when it sees a stop bit Also needs to give release information

Hardware support

Need to buffer speculative values Need to detect memory dependence

violations If a speculative thread loads a value its

address is recorded in ARB If a thread stores into some location, then

ARB is checked to see if there was a load from the same location by a later thread

Also the speculative values are buffered

Cycle distribution

Best scenario – all processing element does useful work always – never happens

Possible wastage Non-useful computation

If the task is squashed later due to incorrect value or incorrect prediction

No computation Waits for some dependency to be resolved Waits to commit its result

Remains idle No task assigned

Non-useful computation

Synchronization of memory values Squashes usually occur on global or static data

values Easy to predict this dependency Explicitly synchronizations can be inserted to

eliminate squashes due these dependencies Early validation of prediction

For example loop exit testing can be done at the beginning of the iteration

No computation

Intra-task dependences These can be eliminated through a variety of

hardware and software techniques Inter-task dependences

Possible scope for scheduling to reduce the wait time

Load balancing Tasks retire in-order Some tasks finish fast and wait for a long time to

become the head task

Differences with other paradigms

Major improvement over superscalar VLIW – limited because of the limits of

static optimizations Multiprocessor

Very much similar Communication costs is very less Leads to fine grained thread parallelism

Methodology

Simulator which uses MIPS code 5 stage pipeline Sequencer has a 1024 entry direct

mapped cache of task descriptors

Results

Results

Compress – long critical path Eqntott and cmppt – has parallel loops

with good coverage Espresso – one loop has load

balancing issue Sc – also has load imbalance Tomcatv – good parallel loops Cmp and wc – intra task dependences

Conclusion

Multiscalar paradigm has very good potential

Tackles the major limits of superscalar Lots of scope for compiler and

hardware optimizations Paper gives a good introduction to the

paradigm and also discusses the major optimization opportunities

Discussion

BREAK!

multiscalar processors gurindar s. sohi scott e. breach t.n. vijaykumar university of...

Documents