multiscalar processors gurindar s. sohi scott e. breach t.n. vijaykumar university of...
Post on 21-Dec-2015
216 views
TRANSCRIPT
Multiscalar processors
Gurindar S. SohiScott E. BreachT.N. Vijaykumar
University of Wisconsin-Madison
Outline
Motivation Multiscalar paradigm Multiscalar architecture Software and hardware support Distribution of cycles Results Conclusion
Motivation Current architecture techniques reaching
their limits Amount of ILP that can be extracted by
superscalar processor is limited Kunle Olukotun (stanford university)
Limits of ILP
Parallelism that can be extracted from a single program is very limited – 4 or 5 in integer programs
Limits of instruction-level parallelism- David W. Wall (1990)
Limitations of superscalar
Branch prediction accuracy limits ILP Every 5 instruction is a branch Executing an instruction across 5 branches
leads to useful result only 60% of the time (with branch prediction accuracy 90%)
There are branches which are difficult to predict – increasing the window size doesn’t always means executing useful instructions
Limitations of superscalar.. contd
Large window size Issuing more instructions per cycle needs large
window of instructions Each cycle search the whole window to find
instructions to issue Increases the pipeline length
Issue complexity To issue an instruction dependence checks
have to be performed with other issuing instructions
To issue n instructions complexity of issue is n2
Limitations of superscalar.. contd
Load and store queue limitations Loads and stores cannot be reordered
before knowing their addresses One load or store waiting for its address
can block the entire processor
Superscalar limitation example
Consider the following hypothetical loop:
Iter 1: inst 1 inst 2… inst nIter 2: inst 1 inst 2…
If window size is less than n, superscalar considers only one iteration at a time
Possible improvementIter 1: iter 2: inst 1 inst 1 inst 2 inst 2… … … inst n inst n
Multiscalar paradigm
Divide the program (CFG) into multiple tasks (not necessarily parallel)
Execute the tasks in different processing elements, residing in the same die – communication cost is less
Sequential semantics is preserved by hardware and software mechanisms
Tasks are typically re-executed if there is any violations
Crossing the limits of superscalar
Branch prediction Each thread executes independently Each thread is limited by branch prediction – but
number of useful instructions available is much larger than superscalar
Window size Each processing element has its own window Total size of the windows in a die can be very
large, while each window can be of moderate size
Crossing the limits of superscalar.. contd
Issue Complexity Each processing element issue only a
few instructions – simplifies logic
Loads and Stores Loads and stores can executed without
waiting for the previous thread’s load or store
Multiscalar architecture
A possible microarchitecture
Multiscalar execution
The sequencer walks over the CFG According the hints inserted in the code, it
assigns tasks to PEs PEs execute the tasks in parallel Maintaining sequential semantics
Register dependencies Memory dependencies
Tasks are assigned in the ring order and are committed in the ring order
Register Dependencies
Register dependencies can be easily identified using compiler
Dependencies are always synchronized Registers that a task may write are
maintained in a create mask Reservations are created in the successor
tasks using the accum mask If the reservation exist (value not arrived),
the instruction reading the register waits
Memory dependencies
Cannot be statically found Multiscalar uses an aggressive
approach – speculate always The loads don’t wait for stores in the
predecessor tasks Hardware checks for violation and the
task is re-executed if it violates any memory dependency
Task commit
Speculative tasks are not allowed to modify memory
Store values are buffered in hardware When the processing element becomes
head – it retires its values into memory In order to maintain sequential semantics
the tasks retire in order – ring arrangement of processing elements
Compiler support
Structure of CFG Sequencer needs information of tasks Compiler or a assembly code analyzer
marks the structure of the CFG – task boundaries
Sequencer walks through this information
Compiler support .. contd Communication information
Gives the create mask as part of task header Sets the forward and stop bits Register value is forwarded if forward bit is
set Task is done when it sees a stop bit Also needs to give release information
Hardware support
Need to buffer speculative values Need to detect memory dependence
violations If a speculative thread loads a value its
address is recorded in ARB If a thread stores into some location, then
ARB is checked to see if there was a load from the same location by a later thread
Also the speculative values are buffered
Cycle distribution
Best scenario – all processing element does useful work always – never happens
Possible wastage Non-useful computation
If the task is squashed later due to incorrect value or incorrect prediction
No computation Waits for some dependency to be resolved Waits to commit its result
Remains idle No task assigned
Non-useful computation
Synchronization of memory values Squashes usually occur on global or static data
values Easy to predict this dependency Explicitly synchronizations can be inserted to
eliminate squashes due these dependencies Early validation of prediction
For example loop exit testing can be done at the beginning of the iteration
No computation
Intra-task dependences These can be eliminated through a variety of
hardware and software techniques Inter-task dependences
Possible scope for scheduling to reduce the wait time
Load balancing Tasks retire in-order Some tasks finish fast and wait for a long time to
become the head task
Differences with other paradigms
Major improvement over superscalar VLIW – limited because of the limits of
static optimizations Multiprocessor
Very much similar Communication costs is very less Leads to fine grained thread parallelism
Methodology
Simulator which uses MIPS code 5 stage pipeline Sequencer has a 1024 entry direct
mapped cache of task descriptors
Results
Results
Compress – long critical path Eqntott and cmppt – has parallel loops
with good coverage Espresso – one loop has load
balancing issue Sc – also has load imbalance Tomcatv – good parallel loops Cmp and wc – intra task dependences
Conclusion
Multiscalar paradigm has very good potential
Tackles the major limits of superscalar Lots of scope for compiler and
hardware optimizations Paper gives a good introduction to the
paradigm and also discusses the major optimization opportunities
Discussion
BREAK!