intel and faer’s reach to teach

Intel and FAER’s Reach to TeachA Program on Computer Architecture

Part 1: PIPELINED PROCESSORS

R. Govindarajan and Matthew Jacob SERC, Indian Institute of Science, Bangalore

2

Pipelined Processor Architecture1. Terminology and assumptions

2. Review: Computer organization; Data representation

3. Pipelined processor architecture

4. ILP (Instruction Level Parallelism) processor architecture

3

What is Computer Architecture?architecture in the English dictionary

art and science of designing and building habitable structures structures → computer systems inhabitants → computer programs

a structure, or structures collectively a style and method of design and construction

(e.g., Moghul architecture) The study of computer structures; design,

evaluation, description

4

Computer Architect vs Computer Designer vs Logic Designer Computer Architect develops the Instruction

Set Architecture (ISA: description of instructions which are allowed and semantics of what each instruction does when executed) and computer system architecture

Computer Designer develops detailed machine organization (blocks, specifications, testing)

Logic Designer implements these blocks

5

Basics: Computer Organization

CacheMemory

I/O

Bus

I/OI/O

MMU

ALU Registers

CPU

CU

REGISTERS

General Purpose Integer Registers FP Registers

Special Purpose Program Counter Stack Pointer Link Register Instruction

Register

6

Basics: Laws, Principles, Rules Amdahl’s Law: The performance improvement

to be gained from using some faster mode of execution is limited by the fraction of time the slower mode is used

pffTime

TimeSpeedupafter

before

)1(1

speedup

7

Principle of Locality of Reference A program property; programs tend to reuse

instructions and data 90-10 rule: 90% of execution time spent in 10%

of code Temporal locality: recently accessed things are

likely to be accessed in near future Spatial locality: things whose addresses are

close in space tend to be accessed close together in time

8

General Principle of Locality Denning SJCC 1972, Blevin & Ramamurthy

IEEE Trans Comp 1976 During any interval of time, resource

demands are non-uniformly distributed Correlation between immediate past and

immediate future resource demand patterns tends to be high, and correlation between disjoint resource demand patterns tends to 0 as the distance between them tends to infinity

Direction and strength of linear relationship between 2 random variables

Correlation

9

`Moore’s Law’

200

0198

0198

1 198

3198

4198

5198

6198

7198

8198

9199

0199

1199

2199

3199

4199

5199

6199

7199

8199

9198

2

µProc60%/yr.(2X/1.5yr)

Memory9%/yr.(2X/10 yrs)

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

1

10

100

1000

DRAM

CPU

10

Background: Data RepresentationBinary, bit, Byte

Commonly used representations are:

Character data: ASCII code

Signed Integer data: 2s complement1s complement, sign-magnitude

Real data: Floating pointExample: IEEE single precision floating point

standard

11

2s Complement Representation

xxx nn 021...

The n bit quantity

represents the signed integer value

2

0

1

1 22n

i

i

i

n

n xx

least significant bit

12

IEEE Floating Point Representation32 bit value (s, f, e), where f is a 23 bit fraction and e an 8 bit exponent, evaluates to

1272.1)1( es f

Normalized form

Special forms (zero, infinity, NaN, denormals)

13

Instruction Set Architecture Description of machine from view of the

programmer/compiler Example: Intel x86 ISA

Includes specification of1. The different kinds of instructions available

(instruction set)

2. How operands are specified (addressing modes)

3. What each instruction looks like (instruction format)

14

Kinds of Instructions1. Arithmetic/logical instructions

Add, subtract, multiply, divide, compare (int/fp) Or, and, not, xor Shift (left/right, arithmetic/logical), rotate

2. Data transfer instructions Load (Move data value to a register from memory) Store (Move data value to memory location from register) Move

3. Control transfer instructions Jump, conditional branch, function call, return

4. Other instructions Example: halt

15

Operand Addressing Modes• Operands to an instruction

• Source: input value to instruction• Destination: where result is to go

• Addressing Mode• How the location of operand is specified

• An operand can be either• in a memory location• in a register

16

Addressing ModesHow the location of operands is specified

Register Direct - in a register, add R1, R2, R3 Immediate - part of the instrn, add R1, R1, #4 Register indirect - in memory, register specifying

the address of memory, add R1, R2, (R3) Base-Displacement - memory addr. is sum of base

(reg.) and offset, add R1, 8(R3) Absolute - memory addr. specified in instrn Indexed - addr. is sum of base + index Others (Auto increment/decrement, PC relative)

17

Terms: Byte addressable Memory: A sequence of locations, each

containing some information referenced by an address

Address Space Memory address space, Register address space

Addressability: how much data in a location? Example: In byte-addressable memory, each location contains 8 bits (1 byte)

Word: data in a set of contiguous locations Word Length: Maximum data accessed in a single

fetch

18

Terms: Byte ordering, Alignment

Word at 400 Big Endian byte ordering 1AC8 B246

Little Endian byte ordering 46B2 C81A

Word aligned: at a word boundary Word at 400 is word alignedWord at 402 is not, but it is short-word aligned

1A C8 46B2 F0 8C DF1EData (in hex)

400 406404402Address (in dec)

0001 1010 1100 1000 1011 0010 0100 0110

0100 0110 1011 0010 1100 1000 0001 1010

Decimal: 449,360,454

Decimal: 1,186,121,754

19

ISA Example: MIPS32 ISA Registers: 32 integer GPRs (R0,R1,…,R31)

R0 is hardwired to 0 R31 is implicitly used by jal instruction HI and LO: Special purpose registers used

implicitly by multiply and divide instructions Addressing modes

Register direct Base displacement (by loads and stores) Immediate Absolute (by jump instructions) PC relative (by branch instructions)

20

MIPS32 ISA.Instruction Mnemonic Example Meaning

Data Transfer Instructions Load LB, LBU, LH, LHU,

LUI, LW lw R2, 4(R3) R2 Mem[R3+4]

Store SB, SH, SW sb R2, -8(R4) Mem[R4 - 8] R2 Int. ALU Instructions

Add ADD,ADDI,ADDIU add R1, R2, R3 R1 R2 + R3 Subtract SUB, SUBU sub R1, R2, R3 R1 R2 – R3 Multiply MULT, MULTU mult R1, R2 LO LSW ( R1*R2)

HI MSW (R1*R2) Divide DIV, DIVU div R1, R2 LO R1 div R2

HI R1 mod R2 Logical AND,ANDI,OR,ORI

NOR, XOR, XORI ori R1, R2, #0xF0 R1 R1 | SE (F0)

Shift SLL, SLV, SRA, SR sr R1, R2, #4 R1 0000 || (R2)31-4 Comparison SLT, SLTI, SLTU slti R1, R2, #16 R1 1 if R2 < SE(16)

0 otherwise

21

MIPS32 ISA..Instruction Mnemonic Example Meaning

Control Transfer Instructions Conditional Branch

BEQ, BGEZ, BLTZ, BLEZ, BGTZ, BNE

bltz R2, -16 PC PC –12 if R2 < 0

Jump J, JR j <target> PC (PC)31-28 ||target||00 Jump & Link JAL, JALR jalr R2 R31 PC + 8

PC R2 System Call SYSCALL syscall

Notation we will use for instructions:

Opcode Destination, Source1, Source2

Example: ADD R1, R2, R3

ADD R1 ← R2, R3

22

Steps in Instruction Processing1. Fetch the instruction from memory

Get instruction whose address is in PC from memory into IR

Increment PC

2. Decode the instruction Understand instruction, addressing modes, etc Calculate effective addresses of the operands to

the instruction and fetch the operand values

3. Execute the instruction Do the required operation

4. Write back the result of the instruction

Program Counter

Instruction Register

23

Timeline of events

PC to memory

Instruction in IR

PC++; Decode

Op1 eff add calc

Op1 fetched

Op2 eff add calc

Op2 fetched

Op done

Write result

Processor/Memory Speed disparity: 2-3 orders of magnitude

24

Assumptions

Activity is overlapped in time where possible. PC increment and instruction fetch? Instruction decode and effective address calc?

Load-store ISA: the only instructions that take operands from memory are loads & stores

Main memory delays not typically seen by instruction processor Cache memories (more on this in a later lecture)

Register file with 2 read ports and 1 write port

25

Processor cycle time: time required to do

Cache memory access Register access + some logic (like decode) ALU operation

Instruction can be processed in 3-5 cycles Jump: IFetch, Decode/OpFetch, DoOp ALU: IFetch, Decode/OpFetch, DoOp, WriteReg Load: IFetch, Decode, EffAddr, Cache, WriteReg

26

Performance of Processor

Which is more important? execution time of an instruction, or throughput of instruction execution (number of

instructions executed per unit time) Cycles per instruction (CPI) In our example, CPI between 3 and 5 Objective of Pipelining

To improve CPI; make it close to 1

27

Steps in Instruction Processing1. Instruction Fetch: instruction is fetched

from memory and PC is incremented

2. Instruction Decode: instruction is decoded and register operands fetched

3. Execute if arithmetic operation. Else, calculate effective address

4. Memory operation: if Load/Store, do memory access

5. Write back computed value to destination register

IF

WB

MEM

EX

ID

28

Pipelining

IF WBMEMEXID

IF WBMEMEXID

IF WBMEMEXID

IF WBMEMEXID

• Instruction execution time: 5 cycles

• Instruction execution throughput: 1 instruction per cycle

• It may not always be possible for instructions to progress through the pipeline in this way

time

29

Pipeline Hazards

Hazard: a situation that prevents the next instruction of the program from executing during its designated clock cycle

1. Structural hazard: Happens due to request for the same hardware resource by 2 or more instructions at the same time

2. Data hazard: Happens when one instruction depends on the result of previous instruction that is still in the pipeline

3. Control hazard: Happens due to control transfer instructions

30

1. Structural Hazards

B B BB

MEM & IF need to use memory

IF WBMEMEXID LW R3 ← mem [8(R2)]i

IF WBMEMEXIDi + 1

IF WBMEMEXIDi + 2

IFi + 3

IF WBMEMEXIDi + 3

B

31

2. Data Hazards

IF WBMEMEXID add R3 ← R1, R2i

IF WBMEMEXIDi + 1 sub R4 ← R3, R8B B BB

i + 1 WBMEMEXID

i + 1 WBMEMEXID

B B BB

32

A Data Hazard Solution Interlock: Hardware that detects data

dependency and stalls dependent instructions time

instr 0 1 2 3 4 5 6

ADD IF ID EX MEM WB

SUB IF stall stall ID EX MEM

OR stall stall IF IDEX

33

Another Data Hazard Solution Forwarding or Bypassing: forward the result

as soon as available to EX

add R3 ← R1, R2

IF WBMEMEXID

or R7 ← R3, R6

IF EXID

sub R5 ← R3, R4IF MEMEXID

34

Other Data Hazards Solutions Delayed loads

Require that instruction that uses load value be separated from the load instruction

Instruction Scheduling Reorder instructions so that dependent

instructions are far enough apart Compile time vs run time instruction scheduling

35

Instruction Scheduling

Before Scheduling:

LW R3 ← 0(R1)

ADDI R5 ← R3, #1

ADD R2 ← R2, R3

LW R13 ← 0(R11)

ADD R12 ← R13, R3

After Scheduling:

LW R3 ← 0(R1)

LW R13 ← 0(R11)

ADDI R5 ← R3, #1

ADD R2 ← R2, R3

ADD R12 ← R13, R3

1 stall

1 stall

2 stalls (following load) 0 stalls

36

3. Control Hazards

BEQZ R3, out

IF EXID

Fetch instrn. (i +1) or from target? IF MEMEXID

IF ID

B B BB B

B B B BFetch instrn. (i +1) or from target?

Branch resolved; appropriate instruction correctly fetched

IF WBMEMEXID

Branch condition & target resolved here

37

Lecture Summary Computer architecture is the study of

computer structures; design, evaluation, description

It builds on a background of computer organization, the study of how data can be represented and manipulated

Pipelined processors improve program execution time (instruction execution throughput) by overlapping in time the execution of many instructions

38

Next Week Instruction Level Parallelism (ILP) and how it

is exploited by current processors to improve program execution time even more

intel and faer’s reach to teach

Documents

computer architecturepart

character data

data representationbinary

fraction of time

execution time

interval of time

accessed things

signmagnitudereal data