1 lecture 3 mips isa and performance peng liu [email protected]

80
1 Lecture 3 MIPS ISA and Performance Peng Liu [email protected]

Upload: allen-mills

Post on 29-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

1

Lecture 3 MIPS ISA and Performance

Peng Liu

[email protected]

Page 2: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

2

MIPS Instruction Encoding

• MIPS instructions are encoded in different forms, depending upon the arguments

– R-format, I-format, J-format

• MIPS architecture has three instruction formats, all 32 bits in length

– Regularity is simpler and improves performance

• A 6 bit opcode appears at the beginning of each instruction

– Control logic based on decode instruction type

Page 3: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

3

Instruction Formats

• I-format: used for instructions with immediates, lw and sw (since the offset counts as an immediate), and the branches (beq and bne),

– (but not the shift instructions; later)

• J-format: used for j and jal

• R-format: used for all other instructions

• It will soon become clear why the instructions have been partitioned in this way.

Page 4: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

4

I-Format

Page 5: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

5

I-Format Example

Page 6: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

6

I-Format Example: Load/Store

Page 7: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

7

J-Format (1/2)

• Define “fields” of the following number of bits each:

6 bits 26 bits

opcode target address

• As usual, each field has a name:

• Key Concepts– Keep opcode field identical to R-format and I-format for

consistency.

– Combine all other fields to make room for large target address.

Page 8: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

8

J-Format (2/2)

• Summary:– New PC = { PC[31..28], target address, 00 }

• Understand where each part came from!

• Note: In Verilog, { , , } means concatenation { 4 bits , 26 bits , 2 bits } = 32 bit address

– { 1010, 11111111111111111111111111, 00 } = 10101111111111111111111111111100

– We use Verilog in this class

Page 9: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

9

R-Format (1/2)• Define “fields” of the following number of bits each: 6 + 5 + 5

+ 5 + 5 + 6 = 32

6 5 5 5 65

opcode rs rt rd functshamt

• For simplicity, each field has a name:

Page 10: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

10

R-Format (2/2)

• More fields:– rs (Source Register): generally used to specify register

containing first operand– rt (Target Register): generally used to specify register

containing second operand (note that name is misleading)– rd (Destination Register): generally used to specify register

which will receive result of computation

Page 11: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

11

R-Format Example

• MIPS Instruction:add $8,$9,$10

0 9 10 8 320Binary number per field representation:

Decimal number per field representation:

hex representation: 012A 4020hex

decimal representation: 19,546,144ten

000000 01001 01010 01000 10000000000hex

On Green Card: Format in column 1, opcodes in column 3

Page 12: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

12

MIPS I Operation Overview

• Arithmetic Logical:– Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU

– AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI

– SLL, SRL, SRA, SLLV, SRLV, SRAV

• Memory Access:– LB, LBU, LH, LHU, LW, LWL,LWR

– SB, SH, SW, SWL, SWR

Page 13: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

13

MIPS logical instructionsInstruction Example Meaning Comment

and and $1,$2,$3 $1 = $2 & $3 3 reg. operands; Logical AND

or or $1,$2,$3 $1 = $2 | $3 3 reg. operands; Logical OR

xor xor $1,$2,$3 $1 = $2 ^$3 3 reg. operands; Logical XOR

nor nor $1,$2,$3 $1 = ~($2 |$3) 3 reg. operands; Logical NOR

and immediate andi $1,$2,10 $1 = $2 & 10 Logical AND reg, constant

or immediate ori $1,$2,10 $1 = $2 | 10 Logical OR reg, constant

xor immediate xori $1, $2,10 $1 = ~$2 &~10 Logical XOR reg, constant

shift left logical sll $1,$2,10 $1 = $2 << 10 Shift left by constant

shift right logical srl $1,$2,10 $1 = $2 >> 10 Shift right by constant

shift right arithm. sra $1,$2,10 $1 = $2 >> 10 Shift right (sign extend)

shift left logical sllv $1,$2,$3 $1 = $2 << $3 Shift left by variable

shift right logical srlv $1,$2, $3 $1 = $2 >> $3 Shift right by variable

shift right arithm. srav $1,$2, $3 $1 = $2 >> $3 Shift right arith. by variable

Q: Can some multiply by 2i ? Divide by 2i ? Invert?

Page 14: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

14

M I P S Reference Data :CORE INSTRUCTION SET (1)

NAME MNE-MON-IC

FOR-MAT

OPERATION (in Verilog)

OPCODE/FUNCT (hex)

Add add R R[rd] = R[rs] + R[rt] (1)

0 / 20hex

Add Immediate

addi I R[rt] = R[rs] + SignExtImm (1)(2)

8hex

Branch On Equal

beq I if(R[rs]==R[rt]) PC=PC+4+ BranchAddr (4)

4hex

(1) May cause overflow exception(2) SignExtImm = { 16{immediate[15]}, immediate }(3) ZeroExtImm = { 16{1b’0}, immediate }(4) BranchAddr = { 14{immediate[15]}, immediate, 2’b0}

Page 15: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

15

MIPS Data Transfer InstructionsInstruction Comment

sw 500($4), $3 Store word

sh 502($2), $3 Store half

sb 41($3), $2 Store byte

lw $1, 30($2) Load word

lh $1, 40($3) Load halfword

lhu $1, 40($3) Load halfword unsigned

lb $1, 40($3) Load byte

lbu $1, 40($3) Load byte unsigned

lui $1, 40 Load Upper Immediate (16 bits shifted left by 16)

Q: Why need lui?

0000 … 0000

LUI R5

R5

Page 16: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

16

Multiply / Divide• Start multiply, divide

– MULT rs, rt

– MULTU rs, rt

– DIV rs, rt

– DIVU rs, rt

• Move result from multiply, divide– MFHI rd

– MFLO rd

• Move to HI or LO– MTHI rd

– MTLO rd

Registers

HI LO

Page 17: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

17

MIPS Arithmetic InstructionsInstruction Example Meaning Comments

add add $1,$2,$3 $1 = $2 + $3 3 operands; exception possible

subtract sub $1,$2,$3 $1 = $2 – $3 3 operands; exception possible

add immediate addi $1,$2,100 $1 = $2 + 100 + constant; exception possible

add unsigned addu $1,$2,$3 $1 = $2 + $3 3 operands; no exceptions

subtract unsigned subu $1,$2,$3 $1 = $2 – $3 3 operands; no exceptions

add imm. unsign. addiu $1,$2,100 $1 = $2 + 100 + constant; no exceptions

multiply mult $2,$3 Hi, Lo = $2 x $3 64-bit signed product

multiply unsigned multu$2,$3 Hi, Lo = $2 x $3 64-bit unsigned product

divide div $2,$3 Lo = $2 ÷ $3, Lo = quotient, Hi = remainder

Hi = $2 mod $3

divide unsigned divu $2,$3 Lo = $2 ÷ $3, Unsigned quotient & remainder

Hi = $2 mod $3

Move from Hi mfhi $1 $1 = Hi Used to get copy of Hi

Move from Lo mflo $1 $1 = Lo Used to get copy of Lo

Q: Which add for address arithmetic? Which add for integers?

Page 18: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

18

Green Card: ARITHMETIC CORE INSTRUCTION SET (2)

NAME MNE-MON-IC

FOR-MAT

OPERATION (in Verilog)

OPCODE/FMT / FT/ FUNCT

(hex)

Branch On FP True

bc1t FI if (FPcond) PC=PC + 4 +BranchAddr (4)

11/8/1/--

Load FP Single lwc1 I F[rt] = M[R[rs] + SignExtImm] (2)

11/8/1/--

Divide div R Lo=R[rs]/R[rt]; Hi=R[rs]%R[rt]

31/--/--/--

Page 19: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

19

When does MIPS Sign Extend?• When value is sign extended, copy upper bit to full value:

Examples of sign extending 8 bits to 16 bits:

00001010 00000000 0000101010001100 11111111 10001100

• When is an immediate operand sign extended?– Arithmetic instructions (add, sub, etc.) always sign extend immediates even for the unsigned versions of the

instructions!

– Logical instructions do not sign extend immediates (They are zero extended)

– Load/Store address computations always sign extend immediates

• Multiply/Divide have no immediate operands however:– “unsigned” treat operands as unsigned

• The data loaded by the instructions lb and lh are extended as follows (“unsigned” don’t extend):– lbu, lhu are zero extended

– lb, lh are sign extended

Q: Then what is does add unsigned (addu) mean since not immediate?

Page 20: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

20

MIPS Compare and Branch

• Compare and Branch– BEQ rs, rt, offset if R[rs] == R[rt] then PC-relative branch

– BNE rs, rt, offset <>

• Compare to zero and Branch– BLEZ rs, offset if R[rs] <= 0 then PC-relative branch

– BGTZ rs, offset >

– BLT <

– BGEZ >=

– BLTZAL rs, offset if R[rs] < 0 then branch and link (into R 31)

– BGEZAL >=!

• Remaining set of compare and branch ops take two instructions

• Almost all comparisons are against zero!

Page 21: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

21

MIPS jump, branch, compare InstructionsInstruction Example Meaning

branch on equal beq $1,$2,100 if ($1 == $2) go to PC+4+100Equal test; PC relative branch

branch on not eq. bne $1,$2,100 if ($1!= $2) go to PC+4+100Not equal test; PC relative

set on less than slt $1,$2,$3 if ($2 < $3) $1=1; else $1=0 Compare less than; 2’s comp.

set less than imm. slti $1,$2,100 if ($2 < 100) $1=1; else $1=0 Compare < constant; 2’s comp.

set less than uns. sltu $1,$2,$3 if ($2 < $3) $1=1; else $1=0 Compare less than; natural numbers

set l. t. imm. uns. sltiu $1,$2,100 if ($2 < 100) $1=1; else $1=0 Compare < constant; natural numbers

jump j 10000 go to 10000Jump to target address

jump register jr $31 go to $31For switch, procedure return

jump and link jal 10000 $31 = PC + 4; go to 10000For procedure call

Page 22: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

22

Signed vs. Unsigned Comparison

$1= 0…00 0000 0000 0000 0001

$2= 0…00 0000 0000 0000 0010

$3= 1…11 1111 1111 1111 1111

• After executing these instructions:

slt $4,$2,$1 ; if ($2 < $1) $4=1; else $4=0

slt $5,$3,$1 ; if ($3 < $1) $5=1; else $5=0

sltu $6,$2,$1 ; if ($2 < $1) $6=1; else $6=0

sltu $7,$3,$1 ; if ($3 < $1) $7=1; else $7=0

• What are values of registers $4 - $7? Why?

$4 = ; $5 = ; $6 = ; $7 = ;

Page 23: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

23

Name Number Usage Preserved across a call?

$zero 0 the value 0 n/a $v0-$v1 2-3 return values no $a0-$a3 4-7 arguments no $t0-$t7 8-15 temporaries no $s0-$s7 16-23 saved yes $t18-$t19 24-25 temporaries no $sp 29 stack pointer yes $ra 31 return address yes

MIPS Assembler Register Convention

• “caller saved”

• “callee saved”

• On Green Card in Column #2 at bottom

Page 24: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

24

What C code properly fills in the blank in loop on right?

1: A[i++] >= 10 2: A[i++] >= 10 | A[i] < 0 3: A[i] >= 10 || A[i++] < 0 4: A[i++] >= 10 || A[i] < 0 5: A[i] >= 10 && A[i++] < 0 6 None of the above

Peer Instruction: $s3=i, $s4=j, $s5=@A

do j = j + 1 while (______);

Loop: addiu $s4,$s4,1 # j = j + 1sll $t1,$s3,2 # $t1 = 4 * iaddu $t1,$t1,$s5 # $t1 = @ A[i]lw $t0,0($t1) # $t0 = A[i]addiu $s3,$s3,1 # i = i + 1slti $t1,$t0,10 # $t1 = $t0 < 10beq $t1,$0, Loop # goto Loopslti $t1,$t0, 0 # $t1 = $t0 < 0bne $t1,$0, Loop # goto Loop

Page 25: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

26

Green Card: OPCODES, BASE CONVERSION, ASCII (3)

MIPS opcode (31:26)

(1) MIPS funct (5:0)

(2) MIPS funct (5:0)

Binary Deci-mal

Hexa-deci-mal

ASCII

(1) sll add.f 00 0000 0 0 NUL

j srl mul.f 00 0010 2 2 STX

lui sync floor.w.f 00 1111 15 f SI

lbu and cvt.w.f 10 0100 36 24 $

(1) opcode(31:26) == 0(2) opcode(31:26) == 17 ten (11 hex );

if fmt(25:21)==16 ten (10 hex ) f = s (single);if fmt(25:21)==17 ten (11 hex ) f = d (double)

Note: 3-in-1 - Opcodes, base conversion, ASCII!

Page 26: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

27

Green Card

• green card /n./ [after the "IBM System/360 Reference Data" card] A summary of an assembly language, even if the color is not green. For example,

"I'll go get my green card so I can check the addressing mode for that instruction."

www.jargon.net

Image from Dave's Green Card Collection: http://www.planetmvs.com/greencard/

Page 27: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

28

Peer InstructionWhich instruction has same representation as 35ten?

A. add $0, $0, $0

B. subu $s0,$s0,$s0

C. lw $0, 0($0)

D. addi $0, $0, 35

E.  subu $0, $0, $0

F.  Trick question! Instructions are not numbers

Registers numbers and names: 0: $0, 8: $t0, 9:$t1, ..15: $t7, 16: $s0, 17: $s1, .. 23: $s7

Opcodes and function fields (if necessary)

add: opcode = 0, funct = 32subu: opcode = 0, funct = 35addi: opcode = 8lw: opcode = 35

opcode rs rt offset

rd functshamtopcode rs rt

opcode rs rt immediate

rd functshamtopcode rs rt

rd functshamtopcode rs rt

Page 28: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

30

Branch & Pipelines

execute

Branch

Delay Slot

Branch Target

By the end of Branch instruction, the CPU knows whether or not the branch will take place.

However, it will have fetched the next instruction by then, regardless of whether or not a branch will be taken.

Why not execute it?

ifetch execute

ifetch execute

ifetch execute

LL: slt $1, $3, $5

li $3, #7

sub $4, $4, 1

bz $4, LL

addi $5, $3, 1

Time

ifetch execute

Page 29: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

31

Delayed Branches

• In the “Raw” MIPS, the instruction after the branch is executed even when the branch is taken

– This is hidden by the assembler for the MIPS “virtual machine”

– allows the compiler to better utilize the instruction pipeline (???)

• Jump and link (jal inst):– Put the return addr. Into link register ($31):

• PC+4 (logical architecture)

• PC+8 physical (“Raw”) architecture delay slot executed

– Then jump to destination address

li $3, #7

sub $4, $4, 1

bz $4, LL

addi $5, $3, 1

subi $6, $6, 2

LL: slt $1, $3, $5

Delay Slot Instruction

Page 30: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

32

Filling Delayed Branches

Inst Fetch Dcd & Op Fetch ExecuteBranch:

Inst Fetch Dcd & Op Fetch

Inst Fetch

Executeexecute successoreven if branch taken!

Then branch targetor continue Single delay slot

impacts the critical path

•Compiler can fill a single delay slot with a useful instruction 50% of the time.

• try to move down from above jump

•move up from target, if safe

add $3, $1, $2

sub $4, $4, 1

bz $4, LL

NOP

...

LL: add rd, ...

Is this violating the ISA abstraction?

Page 31: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

33

Summary: Salient features of MIPS I

• 32-bit fixed format inst (3 formats)

• 32 32-bit GPR (R0 contains zero) and 32 FP registers (and HI LO)

– partitioned by software convention

• 3-address, reg-reg arithmetic instr.

• Single address mode for load/store: base+displacement– no indirection, scaled

• 16-bit immediate plus LUI

• Simple branch conditions

– compare against zero or two registers for =,– no integer condition codes

• Delayed branch

– execute instruction after a branch (or jump) even if the branch is taken (Compiler can fill a delayed branch with useful work about 50% of the time)

Page 32: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

34

And in Conclusion...

• Continued rapid improvement in Computing– 2X every 1.5 years in processor speed;

every 2.0 years in memory size; every 1.0 year in disk capacity; Moore’s Law enables processor, memory (2X transistors/chip/ ~1.5 ro 2.0 yrs)

• 5 classic components of all computers Control Datapath Memory Input Output}

Processor

Page 33: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

35

MIPS Machine Instruction Review: Instruction Format Summary

Page 34: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

36

Addressing Modes Summary

• Register addressing– Operand is a register (e.g. ALU)

• Base/displacement addressing (ex. load/store)– Operand is at the memory location that is the sum of

– a base register + a constant

• Immediate addressing (e.g. constants)– Operand is a constant within the instruction itself

• PC-relative addressing (e.g. branch)– Address is the sum of PC and constant in instruction (e.g.

branch)

• Pseudo-direct addressing (e.g. jump)– Target address is concatenation of field in instruction and the

PC

Page 35: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

37

Addressing Modes Summary

Page 36: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

38

Break

Page 37: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

39

Computer Performance Metrics

• Response Time (latency)– How long does it take for my job to run?

– How long does it take to execute a job?

– How long must I wait for the database query?

• Throughput– How many jobs can the machine run at once?

– What is the average execution rate?

– How many queries per minute?

• If we upgrade a machine with a new processor what to we increase?

• If we add a new machine to the lab what do we increase?

Page 38: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

40

Performance = Execution Time

• Elapsed Time– Counts everything (disk and memory accesses, I/O, etc.)

– A useful number, but often not good for comparison purposes• E.g., OS & multiprogramming time make it difficult to compare CPUs

• CPU time (CPU = Central Processing Unit = processor)– Doesn’t count I/O or time spent running other programs

– Can be broken up into system time, and user time

• Our focus: user CPU time– Time spent executing the lines of code that are “in” our

program

– Includes arithmetic, memory, and control instructions, …

Page 39: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

41

Clock Cycles

• Instead of reporting execution time in seconds, we often use cycles

• Clock “ticks” indicate when to start activities

• Cycle time = time between ticks = seconds per cycle• Clock rate (frequency) = cycles per second (1Hz. =

1cycle/sec)

– A 2Ghz clock has a 500 picoseconds (ps) cycle time.

Page 40: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

42

Performance and Execution Time

• The program should be something real people care about– Desktop: MS office, edit, compile– Server: web, e-commerce, database– Scientific: physics, weather forecasting

Page 41: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

43

Measuring Clock Cycles

• Clock cycles/program is not an intuitive or easily determined value, so

– Clock Cycles = Instructions x Clock Cycles Per Instruction

• Cycles Per Instruction (CPI) used often

• CPI is an average since the number of cycles per instruction varies from instruction to instruction

– Average depends on instruction mix, latency of each inst. Type etc.

• CPIs can be used to compare two implementations of the same ISA, but is not useful alone for comparing different ISAs

– An X86 add is different from a MIPS add

Page 42: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

44

Using CPI

• Drawing on the previous equation:

• To improve performance (i.e. reduce execution time)– Increase clock rate (decrease clock cycle time) OR

– Decrease CPI OR

– Reduce the number of instructions

• Designers balance cycle time against the number of cycles required– Improving one factor may make the other one worse

ExecutionTime Instructions CPI ClockCycleTime

Instructions CPIExecutionTime

ClockRate

Page 43: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

45

Clock Rate ≠ Performance

• Mobile Intel Pentium 4 Vs Intel Pentium M– 2.4GHz 1.6GHz

– P4 is 50% faster?

• Performance on Mobilemark with same memory and disk– Word, excel, photoshop, powerpoint, etc.

– Mobile Pentium 4 is only 15% faster

• What is the relative CPI?– ExecTime=ICxCPI/ClockRate

– ICxCPIM/1.6 = 1.15xICxCPI4/2.4

– CPI4/CPIM=2.4/(1.15x1.6)=1.3

Page 44: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

46

CPI Calculation

• Different instruction types require different numbers of cycles

• CPI is often reported for types of instructions

• where CPIi is the CPI for the type of instructions

and ICi is the count of that type of instruction

1

( )n

i ii

ClockCylces CPI IC

Page 45: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

47

CPI Calculation

• To compute the overall average CPI use

1

ni

ii

InstructionCountClockCylces CPI

InstructionCount

Page 46: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

48

Computing CPI Example

• Given this machine, the CPI is the sum of CPI x Frequency

• Average CPI is 0.5 + 0.4 + 0.4 + 0.2 = 1.5

• What fraction of the time for data transfer?

Page 47: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

49

What is the Impact of Displacement Based Memory Addressing Mode?

• Assume 50% of MIPS loads and stores have a zero displacement.

Page 48: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

50

Speedup

• Speedup allows us to compare different CPUs or optimizations

• Example– Original CPU takes 2sec to run a program

– New CPU takes 1.5sec to run a program

– Speedup = 1.333 or speedup or 33%

CPUtimeOldSpeedup

CPUtimeNew

Page 49: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

51

Amdahl’s Law

• If an optimization improves a fraction f of execution time by a factor of a

• This formula is known as Amdahl’s Law

• Lessons from– If f->100%, then speedup = a

– If a->∞, the speedup = 1/(1-f)

• Summary– Make the common case fast

– Watch out for the non-optimized component

1

[(1 ) / ]* (1 ) /

ToldSpeedup

f f a Told f f a

Page 50: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

52

Evaluating Performance

• Performance best determined by running a real application– Use programs typical of expected workload

• e.g. compiler/editors, scientific applications, graphics, etc.

• Microbenchmarks– Small programs, synthetic or kernels from larger applications

– Nice for architects and designers

– Can be misleading

• Benchmarks– Collection of real programs that companies have agreed on

– Components: programs, inputs & outputs, measurements rules, metrics

– Can still be abused

Page 51: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

55

Benchmarks

• System Performance Evaluation Cooperative (SPEC)

• Scientific computing: Linpack, SpecOMP, SpecHPC,…

• Embedded benchmarks: EEMBC, Dhrystone, …

• Enterprise computing

– TPC-C, TPC-W, TPC-H

– SpecJbb, SpecSFS, SpecMail, Streams,

• Multiprocessor: PARSEC, SPLASH-2, EEMBC (multicore)

• Other

– 3Dmark, ScienceMark, Winstone, iBench, AquaMark, …

• Watch out: your results will be as good as your benchmarks

– Make sure you know what the benchmark is designed to measure

– Performance is not the only metric for computing systems• Cost, power consumption, reliability, real-time performance, …

Page 52: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

56

Summarizing Performance

• Combining results from multiple programs into 1 benchmark score

– Sometimes misleading, always controversial, … and inevitable

– We all like quoting a single number

• 3 types of means– Arithmetic: for times

– Harmonic: for rates

– Geometric: for rations

1 n

i ii i

AM Weight Timen

1

ni

i i

nHM

Weight

Rate

1

1

n n

ii

GM Ratio

Page 53: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

57

Using the Means

• Arithmetic mean:– When you have individual performance scores in latency

• Harmonic mean:– When you have individual performance scores in throughput

• Geometric mean:– Nice property: GM(X/Y)= GM(X)/GM(Y)

– But difficult to related back to execution times

• Note– Always look at the results for individual programs

Page 54: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

58

Performance Summary

• Performance is specific to a particular programs– Total execution time is a consistent summary of performance

• For a given architecture performance increases come from:

– Increase in clock rate (without adverse CPI effects)

– Improvements in processor organization that lower CPI

– Compiler enhancements that lower CPI and/or instruction count

– Algorithm/Language choices that affect instruction count

• Pitfall: expecting improvement in one aspect of a machine’s performance to affect the total performance

Page 55: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

59

Break

Page 56: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

60

Translation Hierarchy

• High-level->Assembly->Machine

Page 57: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

61

CompilerConverts High-level Language to Machine Code

Page 58: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

62

Intermediate Representation (IR)

• Three-address code (TAC)

j= 2 * I + 1;

If (j >= n)

j = 2*I + 3;

return a[j];

t1= 2 * I

t2 = t1 + 1

j = t2

t3 = j < n

If t3 goto L0

t4 = 2 * I

t5 = t4 + 3

j = t5

L0: t6 = a[j]

return t6

Single assignment

Page 59: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

63

Optimizing Compilers

• Provide efficient mapping of program to machine– Code selection and ordering– Eliminating minor inefficiencies– Register allocation

• Don’t (usually) improve asymptotic efficiency– Up to programmer to select best overall algorithm– Big-O savings are (often) more important than constant

factors• But constant factors also matter

• Optimization types– Local: inside basic blocks– Global: across basic blocks e.g. loops

Page 60: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

64

Limitations of Optimizing Compilers

• Operate Under Fundamental Constraints– Improved code has same output– Improved code has same side effects– Improved code is as correct as original code

• Most analysis is performed only within procedures– Whole-program analysis is too expensive in most cases

• Most analysis is based only on static information– Compiler has difficulty anticipating run-time inputs– Profile driven dynamic compilation becoming more prevalent– Dynamic compilation (e.g. Java JIT)

• When in doubt, the compiler must be conservative

Page 61: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

65

Preview: Code Optimization

• Goal: Improve performance by:– Removing redundant work

• Unreachable code

• Common-subexpression elimination

• Induction variable elimination

– Using simpler and faster operations• Dealing with constants in the compiler

• Strength reduction

– Managing register well

ExecutionTime = Instructions x CPI x ClockCycleTime

Page 62: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

66

Constant Folding

• Detect & combine values that will be constant.

• Respect laws of arithmetic on target machine.

a = 10 * 5 + 6 – b;

Page 63: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

67

Constant Propagation

• When x gets assigned a constant c, replace users of x with uses of c, as long as x remains unchanged.

• Very useful for turning constants into immediates, for code generator.

Page 64: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

68

Constant Propagation

Page 65: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

69

Reduction in Strength

• Some operations are faster (lower CPI) than others:– Multiplication slower than addition

– Mult or Divide by 2n slower than shift left or right

– Multiplies take 12 cycle on MIPS R2000

– ALU instruction take 1 cycle

• Can replace complex instruction with equivalent simple instructions

• Correctness:– Overflow behavior preserved?

– Other side-effects of operation? (interrupts)

Page 66: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

70

Reduction in Strength

• How to replace multiplication with addition?

• Often presents in for loops & array walks. while (i<100) arr[i++] = 0;

Page 67: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

71

Common Sub-Expression Elimination

Page 68: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

72

Common Sub-Expression Elimination

Page 69: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

73

Induction Variable Elimination

while (i<100) arr [i++] = 0;

• Remember replacing multiply with add in loop?

• Why not get rid of induction variable I altogether?

• Induction variable is a variable that is changed by a constant amount on every iteration of a loop.

Page 70: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

74

Register Allocation & Assignment

• Register provide fastest data access in the processor.

• Two terms:– What variables belong in register? (allocation)

– What register will hold this value? (assignment)

• Once a value is in a register, we’d like not to spill & restore it to use same value again.

Page 71: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

75

Assembly Code Generation

Page 72: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

76

Compiler Performance on Bubble Sort

Page 73: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

77

Assembler

• Expands macros and pseudo instructions as well as converts constants.

• Primary purpose is to produce an object file– Machine language instructions

– Application data

– Information for memory organization

Page 74: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

78

Object File

Page 75: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

79

Linker

• Linker combines multiple object modules– Identify where code/data will be placed in memory– Resolve code/data cross references– Produces executable if all references found

• Steps– Place code and data modules in memory– Determine the address of data and instruction labels– Patch both the internal and external references

• Separation between compiler and linker makes standard libraries and efficient solution to maintaining modular code.

Page 76: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

80

Loader

• Loader used at run-time– Reads executable file header for size of text/data segments

– Create address space sufficiently large

– Copy program from executable on disk into memory

– Copy arguments to main program’s stack

– Initialize machine registers and set stack pointer

– Jump to start-up routine

– Terminate program when execution completes

Page 77: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

81

SPIM System Calls• SPIM provides a small number of OS “system calls”

• Set system call code in $v0 and pass arguments as needed

• See web site for details and examples.

Service Code Argument Result

print_int 1 $a0 = integer

print_float 2 $f12 = float

print_double 3 $f12 = double

print_string 4 $a0 = string

read_int 5 Integer in $v0

read_float 6 Float if $f0

read_double 7 Double in $f0

read_string 8 $a0 = buffer, $a1 = length Data in buffer

Sbrk 9 $a0 = amount Address in $v0

Exit 10

Page 78: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

82

MIPS Assembler Directives

• SPIM supports a subset of the MIPS assembler directives

• Some of the directives include– .asciiz – Store a null-terminated string in memory

– .data –Start of data segment

– .global –Identify an exported symbol

– .text –Start of text segment

– .word –Store words in memory

Page 79: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

83

HomeWork

• HW1– http://mypage.zju.edu.cn/liupeng

Page 80: 1 Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn

84

Acknowledgements

• These slides contain material from courses: – UCB CS152

– Stanford EE108B