instruction level parallelism ilp advanced computer architecture cse 8383 spring 2004 2/19/2004...
TRANSCRIPT
Instruction Level Parallelism ILP
Advanced Computer Architecture CSE 8383
Spring 2004 2/19/2004Presented By: Sa’ad Al-HarbiSaeed Abu Nimeh
Outline What’s ILP ILP vs Parallel Processing Sequential execution vs ILP execution Limitations of ILP ILP Architectures
Sequential Architecture Dependence Architecture Independence Architecture
ILP Scheduling Open Problems References
What’s ILP Architectural technique that allows the
overlap of individual machine operations ( add, mul, load, store …)
Multiple operations will execute in parallel (simultaneously)
Goal: Speed Up the execution Example:
load R1 R2 add R3 R3, “1”
add R3 R3, “1” add R4 R3, R2add R4 R4, R2 store [R4] R0
Example: Sequential vs ILP Sequential execution (Without ILP)
Add r1, r2 r8 4 cyclesAdd r3, r4 r7 4 cycles 8 cycles
ILP execution (overlap execution)Add r1, r2 r8 Add r3, r4 r7
Total of 5 cycles
ILP vs Parallel Processing
ILP Overlap individual
machine operations (add, mul, load…) so that they execute in parallel
Transparent to the user
Goal: speed up execution
Parallel Processing Having separate
processors getting separate chunks of the program ( processors programmed to do so)
Nontransparent to the user
Goal: speed up and quality up
ILP Challenges
In order to achieve parallelism we should not have dependences among instructions which are executing in parallel: H/W terminology Data Hazards ( RAW,
WAR, WAW) S/W terminology Data Dependencies
Dependences and Hazards Dependences are a property of
programs If two instructions are data
dependent they can not execute simultaneously
A dependence results in a hazard and the hazard causes a stall
Data dependences may occur through registers or memory
Types of Dependencies
Name dependencies Output dependence Anti-dependence
Data True dependence Control Dependence Resource Dependence
Name dependences Output dependence
When instruction I and J write the same register or memory location. The ordering must be preserved to leave the correct value in the register
add r7,r4,r3div r7,r2,r8
Anti-dependenceWhen instruction j writes a register or memory location that instruction I reads
i: add r6,r5,r4j: sub r5,r8,r11
Data Dependences An instruction j is data
dependent on instruction i if either of the following hold:
instruction i produces a result that may be used by instruction j , or
instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i
LOOP LD F0, 0(R1)
ADD F4, F0, F2
SD F4, 0(R1)
SUB R1, R1, -8
BNE R1, R2, LOOP
Control Dependences A control dependence determines the ordering of an
instruction i, with respect to a branch instruction so that the instruction i is executed in correct program order.
Example:If p1 {
S1;};
If p2 { S2;};
Two constraints imposed by control dependences:
1. An instruction that is control dependent on a branch cannot be moved before the branch
2. An instruction that is not control dependent on a branch cannot be moved after the branch
Resource dependences An instruction is resource-
dependent on a previously issued instruction if it requires a hardware resource which is still being used by a previously issued instruction. e.g.
div r1, r2, r3 div r4, r2, r5
ILP Architectures Computer Architecture: is a contract
(instruction format and the interpretation of the bits that constitute an instruction) between the class of programs that are written for the architecture and the set of processor implementations of that architecture.
In ILP Architectures: + information embedded in the program pertaining to available parallelism between instructions and operations in the program
ILP Architectures Classifications Sequential Architectures: the program is not
expected to convey any explicit information regarding parallelism. (Superscalar processors)
Dependence Architectures: the program explicitly indicates the dependences that exist between operations (Dataflow processors)
Independence Architectures: the program provides information as to which operations are independent of one another. (VLIW processors)
Sequential architecture and superscalar processors Program contains no explicit
information regarding dependencies that exist between instructions
Dependencies between instructions must be determined by the hardware It is only necessary to determine
dependencies with sequentially preceding instructions that have been issued but not yet completed
Compiler may re-order instructions to facilitate the hardware’s task of extracting parallelism
Superscalar Processors Superscalar processors attempt to
issue multiple instructions per cycle However, essential dependencies are
specified by sequential ordering so operations must be processed in sequential order
This proves to be a performance bottleneck that is very expensive to overcome
Dependence architecture and data flow processors The compiler (programmer) identifies the
parallelism in the program and communicates it to the hardware (specify the dependences between operations)
The hardware determines at run-time when each operation is independent from others and perform scheduling
Here, no scanning of the sequential program to determine dependences
Objective: execute the instruction at the earliest possible time (available input operands and functional units).
Dependence architectures Dataflow processors Dataflow processors are representative of
Dependence architectures Execute instruction at earliest possible time
subject to availability of input operands and functional units
Dependencies communicated by providing with each instruction a list of all successor instructions
As soon as all input operands of an instruction are available, the hardware fetches the instruction
The instruction is executed as soon as a functional unit is available
Few Dataflow processors currently exist
Dataflow strengths and limitations Dataflow processors use control
parallelism alone to fully utilize the FU. Dataflow processor is more successful
than others at looking far down the execution path to find control parallelism
When successful its better than speculative execution: Every instruction is executed is useful Processor does not have to deal with error
conditions, because of speculative operations
Independence architecture and VLIW processors By knowing which operations are independent,
the hardware needs no further checking to determine which instructions can be issued in the same cycle
The set of independent operations >> the set of dependent operations
Only a subset of independent operations are specified
The compiler may additionally specify on which functional unit and in which cycle an operation is executed
The hardware needs to make no run-time decisions
VLIW processors Operation vs instruction
Operation: is an unit of computation (add, load, branch = instruction in sequential ar.)
Instruction: set of operations that are intended to be issued simultaneously
Compiler decides which operation to go to each instruction (scheduling)
All operations that are supposed to begin at the same time are packaged into a single VLIW instruction
VLIW strengths In hardware it is very simple:
consisting of a collection of function units (adders, multipliers, branch units, etc.) connected by a bus, plus some registers and caches
More silicon goes to the actual processing (rather than being spent on branch prediction, for example),
It should run fast, as the only limit is the latency of the function units themselves.
Programming a VLIW chip is very much like writing microcode
VLIW limitations The need for a powerful compiler, Increased code size arising from
aggressive scheduling policies, Larger memory bandwidth and register-
file bandwidth, Limitations due to the lock-step
operation, binary compatibility across implementations with varying number of functional units and latencies
Summary: ILP Architectures
Sequential Architecture
Dependence Architecture
Independence Architectures
Additional info required in the program
None Specification of dependences between operations
Minimally, a partial list of independences. A complete specification of when and where each operation to be executed
Typical kind of ILP processor
Superscalar Dataflow VLIW
Dependences analysis
Performed by HW Performed by compiler
Performed by compiler
Independences analysis
Performed by HW Performed by HW Performed by compiler
Scheduling Performed by HW Performed by HW Performed by compiler
Role of compiler Rearranges the code to make the analysis and scheduling HW more successful
Replaces some analysis HW
Replaces virtually all the analysis and scheduling HW
ILP Scheduling
Static Scheduling boosted by parallel code optimization
done by the compiler The processor
receives dependency-free and optimized code for parallel execution
Typical for VLIWs and a few pipelined processors (e.g. MIPS)
Dynamic Scheduling without static parallel
code optimization
done by the processor
The code is not optimized for parallel execution. The processor detects and resolves dependencies on its own
Early ILP processors (e.g. CDC 6600, IBM 360/91 etc.)
Dynamic Scheduling boosted by static parallel
code optimization
done by processor in conjunction with parallel optimizing compiler
The processor receives optimized code for parallel execution, but it detects and resolves dependencies on its own
Usual practice for pipelined and superscalar processors (e.g. RS6000)
ILP Scheduling: Trace scheduling An optimization technique that has
been widely used for VLIW, superscalar, and pipelined processors.
It selects a sequence of basic blocks as a trace and schedules the operations from the trace together.
Example:Instr1Instr2Branch xInstr3
Trace Scheduling Extract more ILP Increase machine fetch bandwidth
by storing logically consecutive blocks in physically contiguous cache location (possible to fetch multiple basic blocks in one cycle)
Trace scheduling can be implemented by hardware or software
Trace Scheduling in HW Hardware technique makes use of a large
amount of information in dynamic execution to format traces dynamically and schedule the instructions in trace more efficiently.
Since the dependency and memory access addresses have been solved in dynamic execution, instructions in trace can be reordered more easily and efficiently.
Example: trace cache approach
Trace scheduling in SW Supplement to machines without
hardware trace scheduling support. Formats traces based on static
profiled data, and schedules instructions using traditional compiler scheduling and optimization technique.
It faces some difficulties like code explosion and exception handling.
ILP open problems Pipelined scheduling : Optimized scheduling of pipelined
behavioral descriptions. Two simple type of pipelining (structural and functional).
Controller cost : Most scheduling algorithms do not consider the controller costs which is directly dependent on the controller style used during scheduling.
Area constraints : The resource constrained algorithms could have better interaction between scheduling and floorplanning.
Realism : Scheduling realistic design descriptions that contain
several special language constructs. Using more realistic libraries and cost functions. Scheduling algorithms must also be expanded to
incorporate different target architectures.
References Instruction-Level Parallel Processing: History, Overview and Perspective. B. Ramakrishna Rau,
Joseph A. Fisher. Journal of Supercomputing, Vol. 7, No. 1, Jan. 1993, pages 9-50.
Limits of Control Flow on Parallelism. Monica S. Lam, Robert P. Wilson. 19th ISCA, May 1992, pages 19-21.
Global Code Generation for Instruction-Level Parallelism: Trace Scheduling-2. Joseph A. Fisher. Technical Report, HPLabs HPL-93-43, Jun. 1993.
VLIW at IBM Research http://www.research.ibm.com/vliw
Intel and HP hope to speed CPUs with VLIW technology that's riskier than RISC, Dick Pountain http://www.byte.com/art/9604/sec8/art3.htm
Hardware and Software Trace Scheduling http://charlotte.ucsd.edu/users/yhu/paperlist/summary.html
ILP open problemshttp://www.ececs.uc.edu/~ddel/projects/dss/hls_paper/node9.html
Computer Architecture A Quantitative Approach, Hennessy & Patterson, 3rd edition, M Kaufmann