csc 252: computer organization spring 2020: lecture 13 · 2020-02-28 · carnegie mellon 3...

CSC 252: Computer OrganizationSpring 2020: Lecture 13

Substitute Instructor: Sandhya Dwarkadas

Department of Computer ScienceUniversity of Rochester

Carnegie Mellon

2

Announcement• Programming assignment 3 due soon

• Details: https://www.cs.rochester.edu/courses/252/spring2020/labs/assignment3.html

• Due on Feb. 28, 11:59 PM• You (may still) have 3 slip days

Today Due

https://www.cs.rochester.edu/courses/252/spring2020/labs/assignment3.html

Carnegie Mellon

3

Announcement• Programming assignment 3 is in x86 assembly language. Seek

help from TAs.• TAs are best positioned to answer your questions about

programming assignments!!!• Programming assignments do NOT repeat the lecture materials.

They ask you to synthesize what you have learned from the lectures and work out something new.

CS:APP3e2/28/20 4

Building Blocks

Combinational Logic■ Compute Boolean functions of

inputs■ Continuously respond to input

changes■ Operate on data and implement

control

Storage Elements■ Store bits■ Addressable memories■ Non-addressable registers■ Loaded only as clock rises

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Clock

ALU

fun

A

B

MUX0

1

=

Clock

Carnegie Mellon

Combinational Logic

Microarchitecture (So far): Single Cycle

5

Register File

FlagsZ S O

PC

Clock

Memory

Inst. Rd/Wr Reg. IDs

CurrentReg.

Values

Cur. Flag ValuesEnable?

New Flag Values

Cur. PC

NewPC

New Reg. Valus

Enable?

ALU

Logic for generating ALU select signal

Logic for generating new PC value

Logic for generating new flag value

Logic for deciding all the enable signal values

Read current_states;next_states = f(current_states);

When clock rises, current_states = next_states;

Carnegie Mellon

Executing a MOV instruction

6

ALU

Select

Clock

Register File

Write Reg. ID

Read Reg. ID 1

Read Reg. ID 2

RA

RB

newData

Enable

Logic 1

Logic 2

71012

s0s1s2s3

FlagsZ S O

PC

Logic 3

Clock

nPCoPC

Logic 4

3…

s4s5…

• How do we modify the hardware to execute a move instruction?

Logic 5 EnableF

Flags [s2…s9]

Drmmovq rA, D(rB) 4 0 rA rB

rmmovq %rsi,0x41c(%rsp) 40 64 1c 04 00 00 00 00 00 00

Memory

move rA to the memory address rB + D

Carnegie Mellon

7

ALU

Select

Clock

Register File

Write Reg. ID

Read Reg. ID 1

Read Reg. ID 2

RA

RB

newData

Enable

Logic 1

Logic 2

40641

s0s1s2s3

FlagsZ S O

PC

Logic 3

Clock

nPCoPC

Logic 4

c…

s4s5… Logic 5 EnableF

Flags [s2…s9]

Memory

• Need new logic (Logic 6) to select the input to the ALU for Enable.• How about other logics?

MUX

[s4…s19]

Logic 6

Address

Drmmovq rA, D(rB) 4 0 rA rB

Data to write

move rA to the memory address rB + D

Enable

Carnegie Mellon

8

ALU

Select

Clock

Register File

Write Reg. ID

Read Reg. ID 1

Read Reg. ID 2

RA

RB

newData

Enable

Logic 1

Logic 2

40641

s0s1s2s3

FlagsZ S O

PC

Logic 3

Clock

nPCoPC

Logic 4

c…


Flags [s2…s9]

Memory

MUX

[s4…s19]

Address

How About Memory to Register MOV?

Dmrmovq D(rB), rA 4 0 rA rB

move data at memory address rB + D to rA

Data to writeEnable Logic 6

Carnegie Mellon

9

ALU

Select

Clock

Register File

Write Reg. ID

Read Reg. ID 1

Read Reg. ID 2

RA

RB

newData

Enable

Logic 1

Logic 2

40641

s0s1s2s3

FlagsZ S O

PC

Logic 3

Clock

nPCoPC

Logic 4

c…


Flags [s2…s9]

Memory

MUX

[s4…s19]

Address

How About Memory to Register MOV?

Dmrmovq D(rB), rA 4 0 rA rB

move data at memory address rB + D to rA

Data to writeEnable Logic 6

MUX

Data read back

Logic 7

Carnegie Mellon

Combinational Logic

Microarchitecture (with MOV)

10

Register File

FlagsZ S O

PC

Clock

Memory

Inst.Rd/Wr

Reg. IDsCurrent

Reg. Values

Cur. Flag ValuesEnable?

New Flag Values

Cur. PC

NewPC

New Reg. Valus

Enable?

Read current_states;next_states = f(current_states);

When clock rises, current_states = next_states;

DataNew Data Addr.

next_states has to be ready before the clock rises again

Carnegie Mellon

Microarchitecture OverviewThink of it as a state machineEvery cycle, one instruction gets

executed. At the end of the cycle, architecture states get modified.

States (All updated as clock rises)

■ PC register■ Cond. Code register■ Data memory■ Register file

11

Combinationallogic

Datamemory

Registerfile

%rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

Carnegie Mellon

• state set according to second irmovq instruction

• combinational logic starting to react to state changes

12

0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000

0x016: je dest # Not taken

0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300

Cycle 3:

Cycle 4:

Cycle 5:

0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200Cycle 2:

0x000: irmovq $0x100,%rbx # %rbx <-- 0x100Cycle 1:

ClockCycle 1

① ③ ④②

Cycle 2 Cycle 3 Cycle 4

Combinationallogic

Datamemory

Registerfile

%rbx = 0x100

PC0x014

CC100

Readports

Writeports

Read Write

Carnegie Mellon

• state set according to second irmovq instruction

• combinational logic generates results for addq instruction

13

Combinationallogic

Datamemory

Registerfile

%rbx = 0x100

PC0x014

CC100

Readports

Writeports

0x016

000 %rbx<--

0x300

Read Write




Cycle 3:

Cycle 4:

Cycle 5:



ClockCycle 1

① ③ ④②


Carnegie Mellon

• state set according to addqinstruction

• combinational logic starting to react to state changes

14




Cycle 3:

Cycle 4:

Cycle 5:



ClockCycle 1

① ③ ④②


Combinationallogic

Datamemory

Registerfile

%rbx = 0x300

PC0x016

CC000

Readports

Writeports

Read Write

Carnegie Mellon

• state set according to addqinstruction

• combinational logic generates results for je instruction

15




Cycle 3:

Cycle 4:

Cycle 5:



ClockCycle 1

① ③ ④②


Combinationallogic

Datamemory

Registerfile

%rbx = 0x300

PC0x016

CC000

Readports

Writeports

0x01f

Read Write

Carnegie Mellon

Principles:• Execute each instruction one at a time, one after another• Express every instruction as series of simple steps• Dedicated hardware structure for completing each step• Follow same general flow for each instruction type

Fetch: Read instruction from instruction memoryDecode: Read program registersExecute: Compute value or addressMemory: Read or write dataWrite Back: Write program registersPC: Update program counter

16

Another Way to Look At the Microarchitecture

Carnegie Mellon

Fetch■ Read instruction from instruction memoryDecode■ Read program registersExecute■ Compute value or addressMemory■ Read or write dataWrite Back■ Write program registersPC■ Update program counter

17

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode, ifunrA , rB

valC

Registerfile

Registerfile

A B M

E

Registerfile

Registerfile

A B M

E

PC

valP

srcA, srcBdstA, dstB

valA, valB

aluA, aluB

Cnd

valE

Addr, Data

valM

PCvalE, valM

newPC

Carnegie Mellon

Stage Computation: Arith/Log. Ops

18

OPq rA, rBicode:ifun ¬ M1[PC]rA:rB ¬ M1[PC+1]

valP ¬ PC+2

Fetch

Read instruction byteRead register byte

Compute next PCvalA ¬ R[rA]valB ¬ R[rB]

DecodeRead operand ARead operand B

valE ¬ valB OP valASet CC

ExecutePerform ALU operationSet condition code register

MemoryR[rB] ¬ valEWrite

backWrite back result

PC ¬ valPPC update Update PC

OPq rA, rB 6 fn rA rB

Carnegie Mellon

Stage Computation: rmmovq

19

rmmovq rA, D(rB)icode:ifun ¬ M1[PC]rA:rB ¬ M1[PC+1]valC ¬ M8[PC+2]valP ¬ PC+10

Fetch

Read instruction byteRead register byteRead displacement DCompute next PC

valA ¬ R[rA]valB ¬ R[rB]

DecodeRead operand ARead operand B

valE ¬ valB + valCExecute

Compute effective address

M8[valE] ¬ valAMemory Write value to memory Writeback

PC ¬ valPPC update Update PC

rmmovq rA, D(rB) 4 0 rA rB D

Carnegie Mellon

Stage Computation: Jumps

• Compute both addresses• Choose based on setting of condition codes and branch condition

20

jXX Desticode:ifun ¬ M1[PC]

valC ¬ M8[PC+1]valP ¬ PC+9

Fetch

Read instruction byte

Read destination addressFall through address

Decode

Cnd ¬ Cond(CC,ifun)Execute

Take branch?MemoryWriteback

PC ¬ Cnd ? valC : valPPC update Update PC

Carnegie Mellon

21

Processor Microarchitecture• Sequential, single-cycle microarchitecture implementation

• Basic idea• Hardware implementation

• Pipelined microarchitecture implementation• Basic Principles• Difficulties: Control Dependency• Difficulties: Data Dependency

Carnegie Mellon

Real-World Pipelines: Car Washes

Idea• Divide process into independent stages• Move objects through stages in sequence• At any given times, multiple objects being processed

22

Sequential Pipelined

Carnegie Mellon

Computational Example

System• Computation requires total of 300 picoseconds• Additional 20 picoseconds to save result in register• Must have clock cycle time of at least 320 ps

23

Combinationallogic

Reg

300 ps 20 ps

Clock

Delay = 320 psThroughput = 3.12 Giga Instructions Per second (GIPS)

Carnegie Mellon

3-Stage Pipelined Version

System• Divide combinational logic into 3 blocks of 100 ps each• Can begin new operation as soon as previous one passes through

stage A.• Begin new operation every 120 ps

• Overall latency increases• 360 ps from start to finish

24

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Delay = 360 psThroughput = 8.33 GIPS

Carnegie Mellon

Pipeline Diagrams

Unpipelined

• Cannot start new operation until previous one completes

3-Stage Pipelined

• Up to 3 operations in process simultaneously

25

Time

OP1OP2OP3

Time

A B CA B C

A B C

OP1OP2OP3

CS:APP3e

Operating a Pipeline

26

Time

OP1OP2OP3

A B CA B C

A B C

0 120 240 360 480 640

Clock

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

239

CS:APP3e

Time

OP1OP2OP3

A B CA B C

A B C

0 120 240 360 480 640

Clock

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

241


27

CS:APP3e

Time

OP1OP2OP3

A B CA B C

A B C

0 120 240 360 480 640

Clock

Reg

Reg

Reg

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb.logic

A

Comb.logic

B

Comb.logic

C

Clock

300


28

CS:APP3e

Time

OP1OP2OP3

A B CA B C

A B C

0 120 240 360 480 640

Clock

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

359


29

CS:APP3e

Pipeline Trade-offsPros:Increase throughput. Can process more instructions in a given time span.Cons: Increase latency as new registers are needed between pipeline stages.

30

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Combinationallogic

Reg

300 ps 20 ps

Clock

CS:APP3e

Unbalanced PipelineA pipeline’s delay is limited by the slowest stage. This

limits the cycle time and the throughput

31

Reg

Clock

Reg

Comb.logic

B

Reg

Comb.logic

C

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps

Comb.logicA

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 psCycle time: 120 ps

Delay: 360 ps

Cycle time: 170 ps

Delay: 510 ps

CS:APP3e

Unbalanced PipelineA pipeline’s delay is limited by the slowest stage. This

limits the cycle time and the throughput

32

Reg

Clock

Reg

Comb.logic

B

Reg

Comb.logic

C

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps

Comb.logicA

Cycle time: 170 ps

Delay: 510 ps

Time

OP1OP2OP3

A B CA B C

A B C

170 ps

CS:APP3e

Mitigating Unbalanced PipelineSolution 1: Further pipeline the slow stages

■ Not always possible. What to do if we can’t further pipeline a stage?

Solution 2: Use multiple copies of the slow component

33

Reg

Reg

Comb.logic

B

Reg

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps

Comb.logicA

Comb.logic

B

MUX

Reg

What Logic?

Clock

What logic do you need there?Hint: it needs to control the clock signals of the two registers and the select signal of the MUX.

select

Comb.logicC

Copy 1

Copy 2

CS:APP3e

Mitigating Unbalanced PipelineData sent to copy 1 in odd cycles and to copy 2 in even cycles.This is called 2-way interleaving. Effectively the same as pipelining

Comb. logic B into two sub-stages.The cycle time is reduced to 70 ps (as opposed to 120 ps) at the

cost of extra hardware.

34

Reg

Reg

Comb.logic

B

Reg

50 ps 20 ps 100 ps 20 ps 50 ps 20 ps

Comb.logicA

Comb.logic

B

MUX

Reg

What Logic?

Clock

select

Comb.logicC

Copy 1

Copy 2

CS:APP3e2/28/20 36

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode, ifunrA, rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

pState

valP

srcA, srcBdstA, dstB

valA, valB

aluA, aluB

Bch

valE

Addr, Data

valM

PC

valE, valM

valM

icode, valCvalP

PC

Adding Pipeline Registers

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

valP

d_srcA, d_srcB

valA, valB

aluA, aluB

Bch valE

Addr, Data

valM

PC

W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM

icode, ifun,rA, rB, valC

E

M

W

F

D

valP

f_PC

predPC

Instructionmemory

Instructionmemory

M_icode, M_Bch, M_valA

CS:APP3e

Pipeline StagesFetch

■ Select current PC■ Read instruction■ Compute incremented PC

Decode■ Read program registers

Execute■ Operate ALU

Memory■ Read or write data memory

Write Back■ Update register file

37

CS:APP3e2/28/20 38

Predicting the PC

■ Start fetch of new instruction after current one has completed fetch stage! Not enough time to reliably determine next instruction

■ Guess which instruction will follow! Recover if prediction was incorrect

F

D rB

M_icode

PredictPC

valC valPicode ifun rA

Instructionmemory

Instructionmemory

PCincrement

PCincrement

predPC

Needregids

NeedvalC

Instrvalid

AlignAlignSplitSplit

Bytes 1-5Byte 0

SelectPC

M_BchM_valA

W_icodeW_valM

CS:APP3e2/28/20 39

One Prediction StrategyInstructions that Don’t Transfer Control

■ Predict next PC to be valP■ Always reliable

Call and Unconditional Jumps■ Predict next PC to be valC (destination)■ Always reliable

Conditional Jumps■ Predict next PC to be valC (destination)■ Only correct if branch is taken

! Typically right 60% of time

Return Instruction■ Don’t try to predict

CS:APP3e

Today: Making the Pipeline Really WorkControl Dependencies

■ What is it?■ Software mitigation: Inserting Nops■ Software mitigation: Delay Slots

Data Dependencies■ What is it?■ Software mitigation: Inserting Nops

40

CS:APP3e

nop

41

Control Dependency

1

F

2

DF

3

EDF

4

MEDF

5

MW

EDFF

Definition: Outcome of instruction A determines whether or not instruction B should be executed or not.

Jump instruction example below:■ jne L1 determines whether irmovq $1, %rax should be

executed■ But jne doesn’t know its outcome until after its Execute stage

xorg %rax, %rax jne L1 # Not taken

irmovq $3, %rax # Target + 1

irmovq $1, %rax # Fall ThroughL1 irmovq $4, %rcx # Target

nop

6

WMEDD

7

WMEE

8

WMM

9

WWFF DD EE MM WW

FF DD EE MM

csc 252: computer organization spring 2020: lecture 13 · 2020-02-28 · carnegie mellon 3...

Documents