1 lecture 3 mips isa and performance peng liu [email protected]

1

Lecture 3 MIPS ISA and Performance

Peng Liu

[email protected]

2

MIPS Instruction Encoding

• MIPS instructions are encoded in different forms, depending upon the arguments

– R-format, I-format, J-format

• MIPS architecture has three instruction formats, all 32 bits in length

– Regularity is simpler and improves performance

• A 6 bit opcode appears at the beginning of each instruction

– Control logic based on decode instruction type

3

Instruction Formats

• I-format: used for instructions with immediates, lw and sw (since the offset counts as an immediate), and the branches (beq and bne),

– (but not the shift instructions; later)

• J-format: used for j and jal

• R-format: used for all other instructions

• It will soon become clear why the instructions have been partitioned in this way.

4

I-Format

5

I-Format Example

6

I-Format Example: Load/Store

7

J-Format (1/2)

• Define “fields” of the following number of bits each:

6 bits 26 bits

opcode target address

• As usual, each field has a name:

• Key Concepts– Keep opcode field identical to R-format and I-format for

consistency.

– Combine all other fields to make room for large target address.

8

J-Format (2/2)

• Summary:– New PC = { PC[31..28], target address, 00 }

• Understand where each part came from!

• Note: In Verilog, { , , } means concatenation { 4 bits , 26 bits , 2 bits } = 32 bit address

– { 1010, 11111111111111111111111111, 00 } = 10101111111111111111111111111100

– We use Verilog in this class

9

R-Format (1/2)• Define “fields” of the following number of bits each: 6 + 5 + 5

+ 5 + 5 + 6 = 32

6 5 5 5 65

opcode rs rt rd functshamt

• For simplicity, each field has a name:

10

R-Format (2/2)

• More fields:– rs (Source Register): generally used to specify register

containing first operand– rt (Target Register): generally used to specify register

containing second operand (note that name is misleading)– rd (Destination Register): generally used to specify register

which will receive result of computation

11

R-Format Example

• MIPS Instruction:add $8,$9,$10

0 9 10 8 320Binary number per field representation:

Decimal number per field representation:

hex representation: 012A 4020hex

decimal representation: 19,546,144ten

000000 01001 01010 01000 10000000000hex

On Green Card: Format in column 1, opcodes in column 3

12

MIPS I Operation Overview

• Arithmetic Logical:– Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU

– AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI

– SLL, SRL, SRA, SLLV, SRLV, SRAV

• Memory Access:– LB, LBU, LH, LHU, LW, LWL,LWR

– SB, SH, SW, SWL, SWR

13

MIPS logical instructionsInstruction Example Meaning Comment

and and $1,$2,$3 $1 = $2 & $3 3 reg. operands; Logical AND

or or $1,$2,$3 $1 = $2 | $3 3 reg. operands; Logical OR

xor xor $1,$2,$3 $1 = $2 ^$3 3 reg. operands; Logical XOR

nor nor $1,$2,$3 $1 = ~($2 |$3) 3 reg. operands; Logical NOR

and immediate andi $1,$2,10 $1 = $2 & 10 Logical AND reg, constant

or immediate ori $1,$2,10 $1 = $2 | 10 Logical OR reg, constant

xor immediate xori $1, $2,10 $1 = ~$2 &~10 Logical XOR reg, constant

shift left logical sll $1,$2,10 $1 = $2 << 10 Shift left by constant

shift right logical srl $1,$2,10 $1 = $2 >> 10 Shift right by constant

shift right arithm. sra $1,$2,10 $1 = $2 >> 10 Shift right (sign extend)

shift left logical sllv $1,$2,$3 $1 = $2 << $3 Shift left by variable

shift right logical srlv $1,$2, $3 $1 = $2 >> $3 Shift right by variable

shift right arithm. srav $1,$2, $3 $1 = $2 >> $3 Shift right arith. by variable

Q: Can some multiply by 2i ? Divide by 2i ? Invert?

14

M I P S Reference Data :CORE INSTRUCTION SET (1)

NAME MNE-MON-IC

FOR-MAT

OPERATION (in Verilog)

OPCODE/FUNCT (hex)

Add add R R[rd] = R[rs] + R[rt] (1)

0 / 20hex

Add Immediate

addi I R[rt] = R[rs] + SignExtImm (1)(2)

8hex

Branch On Equal

beq I if(R[rs]==R[rt]) PC=PC+4+ BranchAddr (4)

4hex

(1) May cause overflow exception(2) SignExtImm = { 16{immediate[15]}, immediate }(3) ZeroExtImm = { 16{1b’0}, immediate }(4) BranchAddr = { 14{immediate[15]}, immediate, 2’b0}

15

MIPS Data Transfer InstructionsInstruction Comment

sw 500($4), $3 Store word

sh 502($2), $3 Store half

sb 41($3), $2 Store byte

lw $1, 30($2) Load word

lh $1, 40($3) Load halfword

lhu $1, 40($3) Load halfword unsigned

lb $1, 40($3) Load byte

lbu $1, 40($3) Load byte unsigned

lui $1, 40 Load Upper Immediate (16 bits shifted left by 16)

Q: Why need lui?

0000 … 0000

LUI R5

R5

16

Multiply / Divide• Start multiply, divide

– MULT rs, rt

– MULTU rs, rt

– DIV rs, rt

– DIVU rs, rt

• Move result from multiply, divide– MFHI rd

– MFLO rd

• Move to HI or LO– MTHI rd

– MTLO rd

Registers

HI LO

17

MIPS Arithmetic InstructionsInstruction Example Meaning Comments

add add $1,$2,$3 $1 = $2 + $3 3 operands; exception possible

subtract sub $1,$2,$3 $1 = $2 – $3 3 operands; exception possible

add immediate addi $1,$2,100 $1 = $2 + 100 + constant; exception possible

add unsigned addu $1,$2,$3 $1 = $2 + $3 3 operands; no exceptions

subtract unsigned subu $1,$2,$3 $1 = $2 – $3 3 operands; no exceptions

add imm. unsign. addiu $1,$2,100 $1 = $2 + 100 + constant; no exceptions

multiply mult $2,$3 Hi, Lo = $2 x $3 64-bit signed product

multiply unsigned multu$2,$3 Hi, Lo = $2 x $3 64-bit unsigned product

divide div $2,$3 Lo = $2 ÷ $3, Lo = quotient, Hi = remainder

Hi = $2 mod $3

divide unsigned divu $2,$3 Lo = $2 ÷ $3, Unsigned quotient & remainder

Hi = $2 mod $3

Move from Hi mfhi $1 $1 = Hi Used to get copy of Hi

Move from Lo mflo $1 $1 = Lo Used to get copy of Lo

Q: Which add for address arithmetic? Which add for integers?

18

Green Card: ARITHMETIC CORE INSTRUCTION SET (2)

NAME MNE-MON-IC

FOR-MAT

OPERATION (in Verilog)

OPCODE/FMT / FT/ FUNCT

(hex)

Branch On FP True

bc1t FI if (FPcond) PC=PC + 4 +BranchAddr (4)

11/8/1/--

Load FP Single lwc1 I F[rt] = M[R[rs] + SignExtImm] (2)

11/8/1/--

Divide div R Lo=R[rs]/R[rt]; Hi=R[rs]%R[rt]

31/--/--/--

19

When does MIPS Sign Extend?• When value is sign extended, copy upper bit to full value:

Examples of sign extending 8 bits to 16 bits:

00001010 00000000 0000101010001100 11111111 10001100

• When is an immediate operand sign extended?– Arithmetic instructions (add, sub, etc.) always sign extend immediates even for the unsigned versions of the

instructions!

– Logical instructions do not sign extend immediates (They are zero extended)

– Load/Store address computations always sign extend immediates

• Multiply/Divide have no immediate operands however:– “unsigned” treat operands as unsigned

• The data loaded by the instructions lb and lh are extended as follows (“unsigned” don’t extend):– lbu, lhu are zero extended

– lb, lh are sign extended

Q: Then what is does add unsigned (addu) mean since not immediate?

20

MIPS Compare and Branch

• Compare and Branch– BEQ rs, rt, offset if R[rs] == R[rt] then PC-relative branch

– BNE rs, rt, offset <>

• Compare to zero and Branch– BLEZ rs, offset if R[rs] <= 0 then PC-relative branch

– BGTZ rs, offset >

– BLT <

– BGEZ >=

– BLTZAL rs, offset if R[rs] < 0 then branch and link (into R 31)

– BGEZAL >=!

• Remaining set of compare and branch ops take two instructions

• Almost all comparisons are against zero!

21

MIPS jump, branch, compare InstructionsInstruction Example Meaning

branch on equal beq $1,$2,100 if ($1 == $2) go to PC+4+100Equal test; PC relative branch

branch on not eq. bne $1,$2,100 if ($1!= $2) go to PC+4+100Not equal test; PC relative

set on less than slt $1,$2,$3 if ($2 < $3) $1=1; else $1=0 Compare less than; 2’s comp.

set less than imm. slti $1,$2,100 if ($2 < 100) $1=1; else $1=0 Compare < constant; 2’s comp.

set less than uns. sltu $1,$2,$3 if ($2 < $3) $1=1; else $1=0 Compare less than; natural numbers

set l. t. imm. uns. sltiu $1,$2,100 if ($2 < 100) $1=1; else $1=0 Compare < constant; natural numbers

jump j 10000 go to 10000Jump to target address

jump register jr $31 go to $31For switch, procedure return

jump and link jal 10000 $31 = PC + 4; go to 10000For procedure call

22

Signed vs. Unsigned Comparison

$1= 0…00 0000 0000 0000 0001

$2= 0…00 0000 0000 0000 0010

$3= 1…11 1111 1111 1111 1111

• After executing these instructions:

slt $4,$2,$1 ; if ($2 < $1) $4=1; else $4=0

slt $5,$3,$1 ; if ($3 < $1) $5=1; else $5=0

sltu $6,$2,$1 ; if ($2 < $1) $6=1; else $6=0

sltu $7,$3,$1 ; if ($3 < $1) $7=1; else $7=0

• What are values of registers $4 - $7? Why?

$4 = ; $5 = ; $6 = ; $7 = ;

23

Name Number Usage Preserved across a call?

$zero 0 the value 0 n/a $v0-$v1 2-3 return values no $a0-$a3 4-7 arguments no $t0-$t7 8-15 temporaries no $s0-$s7 16-23 saved yes $t18-$t19 24-25 temporaries no $sp 29 stack pointer yes $ra 31 return address yes

MIPS Assembler Register Convention

• “caller saved”

• “callee saved”

• On Green Card in Column #2 at bottom

24

What C code properly fills in the blank in loop on right?

1: A[i++] >= 10 2: A[i++] >= 10 | A[i] < 0 3: A[i] >= 10 || A[i++] < 0 4: A[i++] >= 10 || A[i] < 0 5: A[i] >= 10 && A[i++] < 0 6 None of the above

Peer Instruction: $s3=i, $s4=j, $s5=@A

do j = j + 1 while (______);

Loop: addiu $s4,$s4,1 # j = j + 1sll $t1,$s3,2 # $t1 = 4 * iaddu $t1,$t1,$s5 # $t1 = @ A[i]lw $t0,0($t1) # $t0 = A[i]addiu $s3,$s3,1 # i = i + 1slti $t1,$t0,10 # $t1 = $t0 < 10beq $t1,$0, Loop # goto Loopslti $t1,$t0, 0 # $t1 = $t0 < 0bne $t1,$0, Loop # goto Loop

26

Green Card: OPCODES, BASE CONVERSION, ASCII (3)

MIPS opcode (31:26)

(1) MIPS funct (5:0)

(2) MIPS funct (5:0)

Binary Deci-mal

Hexa-deci-mal

ASCII

(1) sll add.f 00 0000 0 0 NUL

j srl mul.f 00 0010 2 2 STX

lui sync floor.w.f 00 1111 15 f SI

lbu and cvt.w.f 10 0100 36 24 $

(1) opcode(31:26) == 0(2) opcode(31:26) == 17 ten (11 hex );

if fmt(25:21)==16 ten (10 hex ) f = s (single);if fmt(25:21)==17 ten (11 hex ) f = d (double)

Note: 3-in-1 - Opcodes, base conversion, ASCII!

27

Green Card

• green card /n./ [after the "IBM System/360 Reference Data" card] A summary of an assembly language, even if the color is not green. For example,

"I'll go get my green card so I can check the addressing mode for that instruction."

www.jargon.net

Image from Dave's Green Card Collection: http://www.planetmvs.com/greencard/

28

Peer InstructionWhich instruction has same representation as 35ten?

A. add $0, $0, $0

B. subu $s0,$s0,$s0

C. lw $0, 0($0)

D. addi $0, $0, 35

E. subu $0, $0, $0

F. Trick question! Instructions are not numbers

Registers numbers and names: 0: $0, 8: $t0, 9:$t1, ..15: $t7, 16: $s0, 17: $s1, .. 23: $s7

Opcodes and function fields (if necessary)

add: opcode = 0, funct = 32subu: opcode = 0, funct = 35addi: opcode = 8lw: opcode = 35

opcode rs rt offset

rd functshamtopcode rs rt

opcode rs rt immediate



30

Branch & Pipelines

execute

Branch

Delay Slot

Branch Target

By the end of Branch instruction, the CPU knows whether or not the branch will take place.

However, it will have fetched the next instruction by then, regardless of whether or not a branch will be taken.

Why not execute it?

ifetch execute

ifetch execute

ifetch execute

LL: slt $1, $3, $5

li $3, #7

sub $4, $4, 1

bz $4, LL

addi $5, $3, 1

Time

ifetch execute

31

Delayed Branches

• In the “Raw” MIPS, the instruction after the branch is executed even when the branch is taken

– This is hidden by the assembler for the MIPS “virtual machine”

– allows the compiler to better utilize the instruction pipeline (???)

• Jump and link (jal inst):– Put the return addr. Into link register ($31):

• PC+4 (logical architecture)

• PC+8 physical (“Raw”) architecture delay slot executed

– Then jump to destination address

li $3, #7

sub $4, $4, 1

bz $4, LL

addi $5, $3, 1

subi $6, $6, 2

LL: slt $1, $3, $5

Delay Slot Instruction

32

Filling Delayed Branches

Inst Fetch Dcd & Op Fetch ExecuteBranch:

Inst Fetch Dcd & Op Fetch

Inst Fetch

Executeexecute successoreven if branch taken!

Then branch targetor continue Single delay slot

impacts the critical path

•Compiler can fill a single delay slot with a useful instruction 50% of the time.

• try to move down from above jump

•move up from target, if safe

add $3, $1, $2

sub $4, $4, 1

bz $4, LL

NOP

...

LL: add rd, ...

Is this violating the ISA abstraction?

33

Summary: Salient features of MIPS I

• 32-bit fixed format inst (3 formats)

• 32 32-bit GPR (R0 contains zero) and 32 FP registers (and HI LO)

– partitioned by software convention

• 3-address, reg-reg arithmetic instr.

• Single address mode for load/store: base+displacement– no indirection, scaled

• 16-bit immediate plus LUI

• Simple branch conditions

– compare against zero or two registers for =,– no integer condition codes

• Delayed branch

– execute instruction after a branch (or jump) even if the branch is taken (Compiler can fill a delayed branch with useful work about 50% of the time)

34

And in Conclusion...

• Continued rapid improvement in Computing– 2X every 1.5 years in processor speed;

every 2.0 years in memory size; every 1.0 year in disk capacity; Moore’s Law enables processor, memory (2X transistors/chip/ ~1.5 ro 2.0 yrs)

• 5 classic components of all computers Control Datapath Memory Input Output}

Processor

35

MIPS Machine Instruction Review: Instruction Format Summary

36

Addressing Modes Summary

• Register addressing– Operand is a register (e.g. ALU)

• Base/displacement addressing (ex. load/store)– Operand is at the memory location that is the sum of

– a base register + a constant

• Immediate addressing (e.g. constants)– Operand is a constant within the instruction itself

• PC-relative addressing (e.g. branch)– Address is the sum of PC and constant in instruction (e.g.

branch)

• Pseudo-direct addressing (e.g. jump)– Target address is concatenation of field in instruction and the

PC

37

Addressing Modes Summary

38

Break

39

Computer Performance Metrics

• Response Time (latency)– How long does it take for my job to run?

– How long does it take to execute a job?

– How long must I wait for the database query?

• Throughput– How many jobs can the machine run at once?

– What is the average execution rate?

– How many queries per minute?

• If we upgrade a machine with a new processor what to we increase?

• If we add a new machine to the lab what do we increase?

40

Performance = Execution Time

• Elapsed Time– Counts everything (disk and memory accesses, I/O, etc.)

– A useful number, but often not good for comparison purposes• E.g., OS & multiprogramming time make it difficult to compare CPUs

• CPU time (CPU = Central Processing Unit = processor)– Doesn’t count I/O or time spent running other programs

– Can be broken up into system time, and user time

• Our focus: user CPU time– Time spent executing the lines of code that are “in” our

program

– Includes arithmetic, memory, and control instructions, …

41

Clock Cycles

• Instead of reporting execution time in seconds, we often use cycles

• Clock “ticks” indicate when to start activities

• Cycle time = time between ticks = seconds per cycle• Clock rate (frequency) = cycles per second (1Hz. =

1cycle/sec)

– A 2Ghz clock has a 500 picoseconds (ps) cycle time.

42

Performance and Execution Time

• The program should be something real people care about– Desktop: MS office, edit, compile– Server: web, e-commerce, database– Scientific: physics, weather forecasting

43

Measuring Clock Cycles

• Clock cycles/program is not an intuitive or easily determined value, so

– Clock Cycles = Instructions x Clock Cycles Per Instruction

• Cycles Per Instruction (CPI) used often

• CPI is an average since the number of cycles per instruction varies from instruction to instruction

– Average depends on instruction mix, latency of each inst. Type etc.

• CPIs can be used to compare two implementations of the same ISA, but is not useful alone for comparing different ISAs

– An X86 add is different from a MIPS add

44

Using CPI

• Drawing on the previous equation:

• To improve performance (i.e. reduce execution time)– Increase clock rate (decrease clock cycle time) OR

– Decrease CPI OR

– Reduce the number of instructions

• Designers balance cycle time against the number of cycles required– Improving one factor may make the other one worse

ExecutionTime Instructions CPI ClockCycleTime

Instructions CPIExecutionTime

ClockRate

45

Clock Rate ≠ Performance

• Mobile Intel Pentium 4 Vs Intel Pentium M– 2.4GHz 1.6GHz

– P4 is 50% faster?

• Performance on Mobilemark with same memory and disk– Word, excel, photoshop, powerpoint, etc.

– Mobile Pentium 4 is only 15% faster

• What is the relative CPI?– ExecTime=ICxCPI/ClockRate

– ICxCPIM/1.6 = 1.15xICxCPI4/2.4

– CPI4/CPIM=2.4/(1.15x1.6)=1.3

46

CPI Calculation

• Different instruction types require different numbers of cycles

• CPI is often reported for types of instructions

• where CPIi is the CPI for the type of instructions

and ICi is the count of that type of instruction

1

( )n

i ii

ClockCylces CPI IC

47

CPI Calculation

• To compute the overall average CPI use

1

ni

ii

InstructionCountClockCylces CPI

InstructionCount

48

Computing CPI Example

• Given this machine, the CPI is the sum of CPI x Frequency

• Average CPI is 0.5 + 0.4 + 0.4 + 0.2 = 1.5

• What fraction of the time for data transfer?

49

What is the Impact of Displacement Based Memory Addressing Mode?

• Assume 50% of MIPS loads and stores have a zero displacement.

50

Speedup

• Speedup allows us to compare different CPUs or optimizations

• Example– Original CPU takes 2sec to run a program

– New CPU takes 1.5sec to run a program

– Speedup = 1.333 or speedup or 33%

CPUtimeOldSpeedup

CPUtimeNew

51

Amdahl’s Law

• If an optimization improves a fraction f of execution time by a factor of a

• This formula is known as Amdahl’s Law

• Lessons from– If f->100%, then speedup = a

– If a->∞, the speedup = 1/(1-f)

• Summary– Make the common case fast

– Watch out for the non-optimized component

1

[(1 ) / ]* (1 ) /

ToldSpeedup

f f a Told f f a

52

Evaluating Performance

• Performance best determined by running a real application– Use programs typical of expected workload

• e.g. compiler/editors, scientific applications, graphics, etc.

• Microbenchmarks– Small programs, synthetic or kernels from larger applications

– Nice for architects and designers

– Can be misleading

• Benchmarks– Collection of real programs that companies have agreed on

– Components: programs, inputs & outputs, measurements rules, metrics

– Can still be abused

55

Benchmarks

• System Performance Evaluation Cooperative (SPEC)

• Scientific computing: Linpack, SpecOMP, SpecHPC,…

• Embedded benchmarks: EEMBC, Dhrystone, …

• Enterprise computing

– TPC-C, TPC-W, TPC-H

– SpecJbb, SpecSFS, SpecMail, Streams,

• Multiprocessor: PARSEC, SPLASH-2, EEMBC (multicore)

• Other

– 3Dmark, ScienceMark, Winstone, iBench, AquaMark, …

• Watch out: your results will be as good as your benchmarks

– Make sure you know what the benchmark is designed to measure

– Performance is not the only metric for computing systems• Cost, power consumption, reliability, real-time performance, …

56

Summarizing Performance

• Combining results from multiple programs into 1 benchmark score

– Sometimes misleading, always controversial, … and inevitable

– We all like quoting a single number

• 3 types of means– Arithmetic: for times

– Harmonic: for rates

– Geometric: for rations

1 n

i ii i

AM Weight Timen

1

ni

i i

nHM

Weight

Rate

1

1

n n

ii

GM Ratio

57

Using the Means

• Arithmetic mean:– When you have individual performance scores in latency

• Harmonic mean:– When you have individual performance scores in throughput

• Geometric mean:– Nice property: GM(X/Y)= GM(X)/GM(Y)

– But difficult to related back to execution times

• Note– Always look at the results for individual programs

58

Performance Summary

• Performance is specific to a particular programs– Total execution time is a consistent summary of performance

• For a given architecture performance increases come from:

– Increase in clock rate (without adverse CPI effects)

– Improvements in processor organization that lower CPI

– Compiler enhancements that lower CPI and/or instruction count

– Algorithm/Language choices that affect instruction count

• Pitfall: expecting improvement in one aspect of a machine’s performance to affect the total performance

59

Break

60

Translation Hierarchy

• High-level->Assembly->Machine

61

CompilerConverts High-level Language to Machine Code

62

Intermediate Representation (IR)

• Three-address code (TAC)

j= 2 * I + 1;

If (j >= n)

j = 2*I + 3;

return a[j];

t1= 2 * I

t2 = t1 + 1

j = t2

t3 = j < n

If t3 goto L0

t4 = 2 * I

t5 = t4 + 3

j = t5

L0: t6 = a[j]

return t6

Single assignment

63

Optimizing Compilers

• Provide efficient mapping of program to machine– Code selection and ordering– Eliminating minor inefficiencies– Register allocation

• Don’t (usually) improve asymptotic efficiency– Up to programmer to select best overall algorithm– Big-O savings are (often) more important than constant

factors• But constant factors also matter

• Optimization types– Local: inside basic blocks– Global: across basic blocks e.g. loops

64

Limitations of Optimizing Compilers

• Operate Under Fundamental Constraints– Improved code has same output– Improved code has same side effects– Improved code is as correct as original code

• Most analysis is performed only within procedures– Whole-program analysis is too expensive in most cases

• Most analysis is based only on static information– Compiler has difficulty anticipating run-time inputs– Profile driven dynamic compilation becoming more prevalent– Dynamic compilation (e.g. Java JIT)

• When in doubt, the compiler must be conservative

65

Preview: Code Optimization

• Goal: Improve performance by:– Removing redundant work

• Unreachable code

• Common-subexpression elimination

• Induction variable elimination

– Using simpler and faster operations• Dealing with constants in the compiler

• Strength reduction

– Managing register well

ExecutionTime = Instructions x CPI x ClockCycleTime

66

Constant Folding

• Detect & combine values that will be constant.

• Respect laws of arithmetic on target machine.

a = 10 * 5 + 6 – b;

67

Constant Propagation

• When x gets assigned a constant c, replace users of x with uses of c, as long as x remains unchanged.

• Very useful for turning constants into immediates, for code generator.

68

Constant Propagation

69

Reduction in Strength

• Some operations are faster (lower CPI) than others:– Multiplication slower than addition

– Mult or Divide by 2n slower than shift left or right

– Multiplies take 12 cycle on MIPS R2000

– ALU instruction take 1 cycle

• Can replace complex instruction with equivalent simple instructions

• Correctness:– Overflow behavior preserved?

– Other side-effects of operation? (interrupts)

70

Reduction in Strength

• How to replace multiplication with addition?

• Often presents in for loops & array walks. while (i<100) arr[i++] = 0;

71

Common Sub-Expression Elimination

72

Common Sub-Expression Elimination

73

Induction Variable Elimination

while (i<100) arr [i++] = 0;

• Remember replacing multiply with add in loop?

• Why not get rid of induction variable I altogether?

• Induction variable is a variable that is changed by a constant amount on every iteration of a loop.

74

Register Allocation & Assignment

• Register provide fastest data access in the processor.

• Two terms:– What variables belong in register? (allocation)

– What register will hold this value? (assignment)

• Once a value is in a register, we’d like not to spill & restore it to use same value again.

75

Assembly Code Generation

76

Compiler Performance on Bubble Sort

77

Assembler

• Expands macros and pseudo instructions as well as converts constants.

• Primary purpose is to produce an object file– Machine language instructions

– Application data

– Information for memory organization

78

Object File

79

Linker

• Linker combines multiple object modules– Identify where code/data will be placed in memory– Resolve code/data cross references– Produces executable if all references found

• Steps– Place code and data modules in memory– Determine the address of data and instruction labels– Patch both the internal and external references

• Separation between compiler and linker makes standard libraries and efficient solution to maintaining modular code.

80

Loader

• Loader used at run-time– Reads executable file header for size of text/data segments

– Create address space sufficiently large

– Copy program from executable on disk into memory

– Copy arguments to main program’s stack

– Initialize machine registers and set stack pointer

– Jump to start-up routine

– Terminate program when execution completes

81

SPIM System Calls• SPIM provides a small number of OS “system calls”

• Set system call code in $v0 and pass arguments as needed

• See web site for details and examples.

Service Code Argument Result

print_int 1 $a0 = integer

print_float 2 $f12 = float

print_double 3 $f12 = double

print_string 4 $a0 = string

read_int 5 Integer in $v0

read_float 6 Float if $f0

read_double 7 Double in $f0

read_string 8 $a0 = buffer, $a1 = length Data in buffer

Sbrk 9 $a0 = amount Address in $v0

Exit 10

82

MIPS Assembler Directives

• SPIM supports a subset of the MIPS assembler directives

• Some of the directives include– .asciiz – Store a null-terminated string in memory

– .data –Start of data segment

– .global –Identify an exported symbol

– .text –Start of text segment

– .word –Store words in memory

83

HomeWork

• HW1– http://mypage.zju.edu.cn/liupeng

84

Acknowledgements

• These slides contain material from courses: – UCB CS152

– Stanford EE108B

1 lecture 3 mips isa and performance peng liu [email protected]

Documents