chapter 3 & appendix c - cs.sjtu.edu.cnshen-yy/cs359/slides/4_pipelining_d.pdf · pipelining...

18
1 CS359: Computer Architecture Chapter 3 & Appendix C Part B: ILP and Its Exploitation Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University

Upload: others

Post on 20-Oct-2019

16 views

Category:

Documents


0 download

TRANSCRIPT

1

CS359: Computer Architecture

Chapter 3 & Appendix CPart B: ILP and Its Exploitation

Yanyan ShenDepartment of Computer Science and EngineeringShanghai Jiao Tong University

2

Chapter 3: ILP and Its Exploitation

Outline

3.1 Concepts and Challenges 3.2 Basic Compiler Techniques 3.3 Reducing Branch Costs with Advanced

Branch Prediction 3.4 Overcoming Data Hazards with Dynamic

Scheduling 3.5 Hardware-based Speculation

3

Chapter 3: ILP and Its Exploitation

Background

“Instruction Level Parallelism (ILP)” refers to overlap execution of instructions Pipelining become universal technique in 1985

There are two main approaches to exploit ILPHardware-based dynamic approaches

Used in server and desktop processors Compiler-based static approaches

Not as successful outside of scientific applications

The goal of this chapter: 1) to look at the limitation imposed by data and control hazards; 2) Investigate the topic of increasing the ability of the compiler

(software) and the processor (hardware) to exploit ILP

4

Chapter 3: ILP and Its Exploitation

Pipeline CPI

When exploiting instruction-level parallelism, the goal is to minimize CPI

Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data hazard stalls + Control Stalls

Ideal pipeline CPI is a measure of the maximum performance attainable by the implementation

By reducing each of the terms in the right-hand side, we decrease the overall pipeline CPI or improve IPC (instructions per clock)

5

Chapter 3: ILP and Its Exploitation

Major Techniques for Improving Pipeline CPI

6

Chapter 3: ILP and Its Exploitation

What is Instruction-Level Parallelism

Basic block: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit

Parallelism with a basic block is limited The average dynamic branch frequency is often between

15% and 25% Thus, typical size of basic block = 3-6 instructions

Conclusion: Must optimize across branches

7

Chapter 3: ILP and Its Exploitation

Dependences and Hazards

Dependences:

Determining how one instruction depends on another is critical to determining 1) how much parallelism exists in a program and 2) how that parallelism can be exploited!

Hazards:

When there are hazards, the pipeline may be stopped in order to resolve the hazards

8

Chapter 3: ILP and Its Exploitation

Three Types of Dependences

Data dependences (also called true data dependences) Data flow between instructions

Name dependences (anti-dependence and output dependence)

Control dependence

9

Chapter 3: ILP and Its Exploitation

Data Dependences

Dependencies are a property of programs

Pipeline organization determines if a dependenceresults in an actual hazard being detected and if the hazard causes a stall

Data dependence conveys: Possibility of a hazard Order in which results must be calculatedUpper bound on exploitable instruction level parallelism

10

Chapter 3: ILP and Its Exploitation

Name Dependences

Two instructions use the same register or memory location but no flow of informationNot a true data dependence, but is a problem when

reordering instructionsAntidependence: instruction j writes a register or

memory location that instruction i reads Initial ordering (i before j) must be preserved

Output dependence: instruction i and instruction jwrite the same register or memory location Ordering must be preserved

To resolve, use renaming techniques

11

Chapter 3: ILP and Its Exploitation

Control Dependence

Ordering of instruction i with respect to a branch instruction

Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controller by the branch

An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

12

Chapter 3: ILP and Its Exploitation

Data Hazards

• j tries to read a source before i writes it• Program order must be preserved!

Read after write (RAW)

• j tries to write an operand before it is written by i

Write after write (WAW)

• j tries to write a destination before it is read by i, so iincorrectly gets the new value

Write after read (WAR)

13

Chapter 3: ILP and Its Exploitation

Outline

3.1 Concepts and Challenges 3.2 Basic Compiler Techniques 3.3 Reducing Branch Costs with Advanced

Branch Prediction 3.4 Overcoming Data Hazards with Dynamic

Scheduling 3.5 Hardware-based Speculation

14

Chapter 3: ILP and Its Exploitation

Latencies of FP operations

A compiler’s ability to perform pipeline scheduling is dependent both on the amount of ILP available in the program and on the latencies of the functional units in the pipeline

To avoid a pipeline stall, the execution of a dependent instruction must be separated from

the source instruction by a distance in clock cycles equal to the pipeline latency of that

source instruction

15

Chapter 3: ILP and Its Exploitation

Pipeline Stalls

Loop: L.D F0,0(R1)stallADD.D F4,F0,F2stallstallS.D F4,0(R1)DADDUI R1,R1,#-8stall (assume integer load latency is 1)BNE R1,R2,Loop

Example:for (i=999; i>=0; i=i-1)x[i] = x[i] + s;

16

Chapter 3: ILP and Its Exploitation

Pipeline Scheduling

Scheduled code:

Loop:L.D F0,0(R1)DADDUI R1,R1,#-8

ADD.D F4,F0,F2stallstallS.D F4,8(R1)BNE R1,R2,Loop

17

Chapter 3: ILP and Its Exploitation

Loop Unrolling

Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1) ;drop DADDUI & BNEL.D F6,-8(R1)ADD.D F8,F6,F2S.D F8,-8(R1) ;drop DADDUI & BNEL.D F10,-16(R1)ADD.D F12,F10,F2S.D F12,-16(R1) ;drop DADDUI & BNEL.D F14,-24(R1)ADD.D F16,F14,F2S.D F16,-24(R1)DADDUI R1,R1,#-32BNE R1,R2,Loop

Unroll by a factor of 4 (assume # elements is divisible by 4)Eliminate unnecessary instructions

18

Chapter 3: ILP and Its Exploitation

Loop Unrolling/Pipeline Scheduling

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)ADD.D F4,F0,F2ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2S.D F4,0(R1)S.D F8,-8(R1)DADDUI R1,R1,#-32S.D F12,16(R1)S.D F16,8(R1)BNE R1,R2,Loop