1 lecture 3 mips isa and performance peng liu [email protected]
TRANSCRIPT
2
MIPS Instruction Encoding
• MIPS instructions are encoded in different forms, depending upon the arguments
– R-format, I-format, J-format
• MIPS architecture has three instruction formats, all 32 bits in length
– Regularity is simpler and improves performance
• A 6 bit opcode appears at the beginning of each instruction
– Control logic based on decode instruction type
3
Instruction Formats
• I-format: used for instructions with immediates, lw and sw (since the offset counts as an immediate), and the branches (beq and bne),
– (but not the shift instructions; later)
• J-format: used for j and jal
• R-format: used for all other instructions
• It will soon become clear why the instructions have been partitioned in this way.
4
I-Format
5
I-Format Example
6
I-Format Example: Load/Store
7
J-Format (1/2)
• Define “fields” of the following number of bits each:
6 bits 26 bits
opcode target address
• As usual, each field has a name:
• Key Concepts– Keep opcode field identical to R-format and I-format for
consistency.
– Combine all other fields to make room for large target address.
8
J-Format (2/2)
• Summary:– New PC = { PC[31..28], target address, 00 }
• Understand where each part came from!
• Note: In Verilog, { , , } means concatenation { 4 bits , 26 bits , 2 bits } = 32 bit address
– { 1010, 11111111111111111111111111, 00 } = 10101111111111111111111111111100
– We use Verilog in this class
9
R-Format (1/2)• Define “fields” of the following number of bits each: 6 + 5 + 5
+ 5 + 5 + 6 = 32
6 5 5 5 65
opcode rs rt rd functshamt
• For simplicity, each field has a name:
10
R-Format (2/2)
• More fields:– rs (Source Register): generally used to specify register
containing first operand– rt (Target Register): generally used to specify register
containing second operand (note that name is misleading)– rd (Destination Register): generally used to specify register
which will receive result of computation
11
R-Format Example
• MIPS Instruction:add $8,$9,$10
0 9 10 8 320Binary number per field representation:
Decimal number per field representation:
hex representation: 012A 4020hex
decimal representation: 19,546,144ten
000000 01001 01010 01000 10000000000hex
On Green Card: Format in column 1, opcodes in column 3
12
MIPS I Operation Overview
• Arithmetic Logical:– Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU
– AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI
– SLL, SRL, SRA, SLLV, SRLV, SRAV
• Memory Access:– LB, LBU, LH, LHU, LW, LWL,LWR
– SB, SH, SW, SWL, SWR
13
MIPS logical instructionsInstruction Example Meaning Comment
and and $1,$2,$3 $1 = $2 & $3 3 reg. operands; Logical AND
or or $1,$2,$3 $1 = $2 | $3 3 reg. operands; Logical OR
xor xor $1,$2,$3 $1 = $2 ^$3 3 reg. operands; Logical XOR
nor nor $1,$2,$3 $1 = ~($2 |$3) 3 reg. operands; Logical NOR
and immediate andi $1,$2,10 $1 = $2 & 10 Logical AND reg, constant
or immediate ori $1,$2,10 $1 = $2 | 10 Logical OR reg, constant
xor immediate xori $1, $2,10 $1 = ~$2 &~10 Logical XOR reg, constant
shift left logical sll $1,$2,10 $1 = $2 << 10 Shift left by constant
shift right logical srl $1,$2,10 $1 = $2 >> 10 Shift right by constant
shift right arithm. sra $1,$2,10 $1 = $2 >> 10 Shift right (sign extend)
shift left logical sllv $1,$2,$3 $1 = $2 << $3 Shift left by variable
shift right logical srlv $1,$2, $3 $1 = $2 >> $3 Shift right by variable
shift right arithm. srav $1,$2, $3 $1 = $2 >> $3 Shift right arith. by variable
Q: Can some multiply by 2i ? Divide by 2i ? Invert?
14
M I P S Reference Data :CORE INSTRUCTION SET (1)
NAME MNE-MON-IC
FOR-MAT
OPERATION (in Verilog)
OPCODE/FUNCT (hex)
Add add R R[rd] = R[rs] + R[rt] (1)
0 / 20hex
Add Immediate
addi I R[rt] = R[rs] + SignExtImm (1)(2)
8hex
Branch On Equal
beq I if(R[rs]==R[rt]) PC=PC+4+ BranchAddr (4)
4hex
(1) May cause overflow exception(2) SignExtImm = { 16{immediate[15]}, immediate }(3) ZeroExtImm = { 16{1b’0}, immediate }(4) BranchAddr = { 14{immediate[15]}, immediate, 2’b0}
15
MIPS Data Transfer InstructionsInstruction Comment
sw 500($4), $3 Store word
sh 502($2), $3 Store half
sb 41($3), $2 Store byte
lw $1, 30($2) Load word
lh $1, 40($3) Load halfword
lhu $1, 40($3) Load halfword unsigned
lb $1, 40($3) Load byte
lbu $1, 40($3) Load byte unsigned
lui $1, 40 Load Upper Immediate (16 bits shifted left by 16)
Q: Why need lui?
0000 … 0000
LUI R5
R5
16
Multiply / Divide• Start multiply, divide
– MULT rs, rt
– MULTU rs, rt
– DIV rs, rt
– DIVU rs, rt
• Move result from multiply, divide– MFHI rd
– MFLO rd
• Move to HI or LO– MTHI rd
– MTLO rd
Registers
HI LO
17
MIPS Arithmetic InstructionsInstruction Example Meaning Comments
add add $1,$2,$3 $1 = $2 + $3 3 operands; exception possible
subtract sub $1,$2,$3 $1 = $2 – $3 3 operands; exception possible
add immediate addi $1,$2,100 $1 = $2 + 100 + constant; exception possible
add unsigned addu $1,$2,$3 $1 = $2 + $3 3 operands; no exceptions
subtract unsigned subu $1,$2,$3 $1 = $2 – $3 3 operands; no exceptions
add imm. unsign. addiu $1,$2,100 $1 = $2 + 100 + constant; no exceptions
multiply mult $2,$3 Hi, Lo = $2 x $3 64-bit signed product
multiply unsigned multu$2,$3 Hi, Lo = $2 x $3 64-bit unsigned product
divide div $2,$3 Lo = $2 ÷ $3, Lo = quotient, Hi = remainder
Hi = $2 mod $3
divide unsigned divu $2,$3 Lo = $2 ÷ $3, Unsigned quotient & remainder
Hi = $2 mod $3
Move from Hi mfhi $1 $1 = Hi Used to get copy of Hi
Move from Lo mflo $1 $1 = Lo Used to get copy of Lo
Q: Which add for address arithmetic? Which add for integers?
18
Green Card: ARITHMETIC CORE INSTRUCTION SET (2)
NAME MNE-MON-IC
FOR-MAT
OPERATION (in Verilog)
OPCODE/FMT / FT/ FUNCT
(hex)
Branch On FP True
bc1t FI if (FPcond) PC=PC + 4 +BranchAddr (4)
11/8/1/--
Load FP Single lwc1 I F[rt] = M[R[rs] + SignExtImm] (2)
11/8/1/--
Divide div R Lo=R[rs]/R[rt]; Hi=R[rs]%R[rt]
31/--/--/--
19
When does MIPS Sign Extend?• When value is sign extended, copy upper bit to full value:
Examples of sign extending 8 bits to 16 bits:
00001010 00000000 0000101010001100 11111111 10001100
• When is an immediate operand sign extended?– Arithmetic instructions (add, sub, etc.) always sign extend immediates even for the unsigned versions of the
instructions!
– Logical instructions do not sign extend immediates (They are zero extended)
– Load/Store address computations always sign extend immediates
• Multiply/Divide have no immediate operands however:– “unsigned” treat operands as unsigned
• The data loaded by the instructions lb and lh are extended as follows (“unsigned” don’t extend):– lbu, lhu are zero extended
– lb, lh are sign extended
Q: Then what is does add unsigned (addu) mean since not immediate?
20
MIPS Compare and Branch
• Compare and Branch– BEQ rs, rt, offset if R[rs] == R[rt] then PC-relative branch
– BNE rs, rt, offset <>
• Compare to zero and Branch– BLEZ rs, offset if R[rs] <= 0 then PC-relative branch
– BGTZ rs, offset >
– BLT <
– BGEZ >=
– BLTZAL rs, offset if R[rs] < 0 then branch and link (into R 31)
– BGEZAL >=!
• Remaining set of compare and branch ops take two instructions
• Almost all comparisons are against zero!
21
MIPS jump, branch, compare InstructionsInstruction Example Meaning
branch on equal beq $1,$2,100 if ($1 == $2) go to PC+4+100Equal test; PC relative branch
branch on not eq. bne $1,$2,100 if ($1!= $2) go to PC+4+100Not equal test; PC relative
set on less than slt $1,$2,$3 if ($2 < $3) $1=1; else $1=0 Compare less than; 2’s comp.
set less than imm. slti $1,$2,100 if ($2 < 100) $1=1; else $1=0 Compare < constant; 2’s comp.
set less than uns. sltu $1,$2,$3 if ($2 < $3) $1=1; else $1=0 Compare less than; natural numbers
set l. t. imm. uns. sltiu $1,$2,100 if ($2 < 100) $1=1; else $1=0 Compare < constant; natural numbers
jump j 10000 go to 10000Jump to target address
jump register jr $31 go to $31For switch, procedure return
jump and link jal 10000 $31 = PC + 4; go to 10000For procedure call
22
Signed vs. Unsigned Comparison
$1= 0…00 0000 0000 0000 0001
$2= 0…00 0000 0000 0000 0010
$3= 1…11 1111 1111 1111 1111
• After executing these instructions:
slt $4,$2,$1 ; if ($2 < $1) $4=1; else $4=0
slt $5,$3,$1 ; if ($3 < $1) $5=1; else $5=0
sltu $6,$2,$1 ; if ($2 < $1) $6=1; else $6=0
sltu $7,$3,$1 ; if ($3 < $1) $7=1; else $7=0
• What are values of registers $4 - $7? Why?
$4 = ; $5 = ; $6 = ; $7 = ;
23
Name Number Usage Preserved across a call?
$zero 0 the value 0 n/a $v0-$v1 2-3 return values no $a0-$a3 4-7 arguments no $t0-$t7 8-15 temporaries no $s0-$s7 16-23 saved yes $t18-$t19 24-25 temporaries no $sp 29 stack pointer yes $ra 31 return address yes
MIPS Assembler Register Convention
• “caller saved”
• “callee saved”
• On Green Card in Column #2 at bottom
24
What C code properly fills in the blank in loop on right?
1: A[i++] >= 10 2: A[i++] >= 10 | A[i] < 0 3: A[i] >= 10 || A[i++] < 0 4: A[i++] >= 10 || A[i] < 0 5: A[i] >= 10 && A[i++] < 0 6 None of the above
Peer Instruction: $s3=i, $s4=j, $s5=@A
do j = j + 1 while (______);
Loop: addiu $s4,$s4,1 # j = j + 1sll $t1,$s3,2 # $t1 = 4 * iaddu $t1,$t1,$s5 # $t1 = @ A[i]lw $t0,0($t1) # $t0 = A[i]addiu $s3,$s3,1 # i = i + 1slti $t1,$t0,10 # $t1 = $t0 < 10beq $t1,$0, Loop # goto Loopslti $t1,$t0, 0 # $t1 = $t0 < 0bne $t1,$0, Loop # goto Loop
26
Green Card: OPCODES, BASE CONVERSION, ASCII (3)
MIPS opcode (31:26)
(1) MIPS funct (5:0)
(2) MIPS funct (5:0)
Binary Deci-mal
Hexa-deci-mal
ASCII
(1) sll add.f 00 0000 0 0 NUL
j srl mul.f 00 0010 2 2 STX
lui sync floor.w.f 00 1111 15 f SI
lbu and cvt.w.f 10 0100 36 24 $
(1) opcode(31:26) == 0(2) opcode(31:26) == 17 ten (11 hex );
if fmt(25:21)==16 ten (10 hex ) f = s (single);if fmt(25:21)==17 ten (11 hex ) f = d (double)
Note: 3-in-1 - Opcodes, base conversion, ASCII!
27
Green Card
• green card /n./ [after the "IBM System/360 Reference Data" card] A summary of an assembly language, even if the color is not green. For example,
"I'll go get my green card so I can check the addressing mode for that instruction."
www.jargon.net
Image from Dave's Green Card Collection: http://www.planetmvs.com/greencard/
28
Peer InstructionWhich instruction has same representation as 35ten?
A. add $0, $0, $0
B. subu $s0,$s0,$s0
C. lw $0, 0($0)
D. addi $0, $0, 35
E. subu $0, $0, $0
F. Trick question! Instructions are not numbers
Registers numbers and names: 0: $0, 8: $t0, 9:$t1, ..15: $t7, 16: $s0, 17: $s1, .. 23: $s7
Opcodes and function fields (if necessary)
add: opcode = 0, funct = 32subu: opcode = 0, funct = 35addi: opcode = 8lw: opcode = 35
opcode rs rt offset
rd functshamtopcode rs rt
opcode rs rt immediate
rd functshamtopcode rs rt
rd functshamtopcode rs rt
30
Branch & Pipelines
execute
Branch
Delay Slot
Branch Target
By the end of Branch instruction, the CPU knows whether or not the branch will take place.
However, it will have fetched the next instruction by then, regardless of whether or not a branch will be taken.
Why not execute it?
ifetch execute
ifetch execute
ifetch execute
LL: slt $1, $3, $5
li $3, #7
sub $4, $4, 1
bz $4, LL
addi $5, $3, 1
Time
ifetch execute
31
Delayed Branches
• In the “Raw” MIPS, the instruction after the branch is executed even when the branch is taken
– This is hidden by the assembler for the MIPS “virtual machine”
– allows the compiler to better utilize the instruction pipeline (???)
• Jump and link (jal inst):– Put the return addr. Into link register ($31):
• PC+4 (logical architecture)
• PC+8 physical (“Raw”) architecture delay slot executed
– Then jump to destination address
li $3, #7
sub $4, $4, 1
bz $4, LL
addi $5, $3, 1
subi $6, $6, 2
LL: slt $1, $3, $5
Delay Slot Instruction
32
Filling Delayed Branches
Inst Fetch Dcd & Op Fetch ExecuteBranch:
Inst Fetch Dcd & Op Fetch
Inst Fetch
Executeexecute successoreven if branch taken!
Then branch targetor continue Single delay slot
impacts the critical path
•Compiler can fill a single delay slot with a useful instruction 50% of the time.
• try to move down from above jump
•move up from target, if safe
add $3, $1, $2
sub $4, $4, 1
bz $4, LL
NOP
...
LL: add rd, ...
Is this violating the ISA abstraction?
33
Summary: Salient features of MIPS I
• 32-bit fixed format inst (3 formats)
• 32 32-bit GPR (R0 contains zero) and 32 FP registers (and HI LO)
– partitioned by software convention
• 3-address, reg-reg arithmetic instr.
• Single address mode for load/store: base+displacement– no indirection, scaled
• 16-bit immediate plus LUI
• Simple branch conditions
– compare against zero or two registers for =,– no integer condition codes
• Delayed branch
– execute instruction after a branch (or jump) even if the branch is taken (Compiler can fill a delayed branch with useful work about 50% of the time)
34
And in Conclusion...
• Continued rapid improvement in Computing– 2X every 1.5 years in processor speed;
every 2.0 years in memory size; every 1.0 year in disk capacity; Moore’s Law enables processor, memory (2X transistors/chip/ ~1.5 ro 2.0 yrs)
• 5 classic components of all computers Control Datapath Memory Input Output}
Processor
35
MIPS Machine Instruction Review: Instruction Format Summary
36
Addressing Modes Summary
• Register addressing– Operand is a register (e.g. ALU)
• Base/displacement addressing (ex. load/store)– Operand is at the memory location that is the sum of
– a base register + a constant
• Immediate addressing (e.g. constants)– Operand is a constant within the instruction itself
• PC-relative addressing (e.g. branch)– Address is the sum of PC and constant in instruction (e.g.
branch)
• Pseudo-direct addressing (e.g. jump)– Target address is concatenation of field in instruction and the
PC
37
Addressing Modes Summary
38
Break
39
Computer Performance Metrics
• Response Time (latency)– How long does it take for my job to run?
– How long does it take to execute a job?
– How long must I wait for the database query?
• Throughput– How many jobs can the machine run at once?
– What is the average execution rate?
– How many queries per minute?
• If we upgrade a machine with a new processor what to we increase?
• If we add a new machine to the lab what do we increase?
40
Performance = Execution Time
• Elapsed Time– Counts everything (disk and memory accesses, I/O, etc.)
– A useful number, but often not good for comparison purposes• E.g., OS & multiprogramming time make it difficult to compare CPUs
• CPU time (CPU = Central Processing Unit = processor)– Doesn’t count I/O or time spent running other programs
– Can be broken up into system time, and user time
• Our focus: user CPU time– Time spent executing the lines of code that are “in” our
program
– Includes arithmetic, memory, and control instructions, …
41
Clock Cycles
• Instead of reporting execution time in seconds, we often use cycles
• Clock “ticks” indicate when to start activities
• Cycle time = time between ticks = seconds per cycle• Clock rate (frequency) = cycles per second (1Hz. =
1cycle/sec)
– A 2Ghz clock has a 500 picoseconds (ps) cycle time.
42
Performance and Execution Time
• The program should be something real people care about– Desktop: MS office, edit, compile– Server: web, e-commerce, database– Scientific: physics, weather forecasting
43
Measuring Clock Cycles
• Clock cycles/program is not an intuitive or easily determined value, so
– Clock Cycles = Instructions x Clock Cycles Per Instruction
• Cycles Per Instruction (CPI) used often
• CPI is an average since the number of cycles per instruction varies from instruction to instruction
– Average depends on instruction mix, latency of each inst. Type etc.
• CPIs can be used to compare two implementations of the same ISA, but is not useful alone for comparing different ISAs
– An X86 add is different from a MIPS add
44
Using CPI
• Drawing on the previous equation:
• To improve performance (i.e. reduce execution time)– Increase clock rate (decrease clock cycle time) OR
– Decrease CPI OR
– Reduce the number of instructions
• Designers balance cycle time against the number of cycles required– Improving one factor may make the other one worse
ExecutionTime Instructions CPI ClockCycleTime
Instructions CPIExecutionTime
ClockRate
45
Clock Rate ≠ Performance
• Mobile Intel Pentium 4 Vs Intel Pentium M– 2.4GHz 1.6GHz
– P4 is 50% faster?
• Performance on Mobilemark with same memory and disk– Word, excel, photoshop, powerpoint, etc.
– Mobile Pentium 4 is only 15% faster
• What is the relative CPI?– ExecTime=ICxCPI/ClockRate
– ICxCPIM/1.6 = 1.15xICxCPI4/2.4
– CPI4/CPIM=2.4/(1.15x1.6)=1.3
46
CPI Calculation
• Different instruction types require different numbers of cycles
• CPI is often reported for types of instructions
• where CPIi is the CPI for the type of instructions
and ICi is the count of that type of instruction
1
( )n
i ii
ClockCylces CPI IC
47
CPI Calculation
• To compute the overall average CPI use
1
ni
ii
InstructionCountClockCylces CPI
InstructionCount
48
Computing CPI Example
• Given this machine, the CPI is the sum of CPI x Frequency
• Average CPI is 0.5 + 0.4 + 0.4 + 0.2 = 1.5
• What fraction of the time for data transfer?
49
What is the Impact of Displacement Based Memory Addressing Mode?
• Assume 50% of MIPS loads and stores have a zero displacement.
50
Speedup
• Speedup allows us to compare different CPUs or optimizations
• Example– Original CPU takes 2sec to run a program
– New CPU takes 1.5sec to run a program
– Speedup = 1.333 or speedup or 33%
CPUtimeOldSpeedup
CPUtimeNew
51
Amdahl’s Law
• If an optimization improves a fraction f of execution time by a factor of a
• This formula is known as Amdahl’s Law
• Lessons from– If f->100%, then speedup = a
– If a->∞, the speedup = 1/(1-f)
• Summary– Make the common case fast
– Watch out for the non-optimized component
1
[(1 ) / ]* (1 ) /
ToldSpeedup
f f a Told f f a
52
Evaluating Performance
• Performance best determined by running a real application– Use programs typical of expected workload
• e.g. compiler/editors, scientific applications, graphics, etc.
• Microbenchmarks– Small programs, synthetic or kernels from larger applications
– Nice for architects and designers
– Can be misleading
• Benchmarks– Collection of real programs that companies have agreed on
– Components: programs, inputs & outputs, measurements rules, metrics
– Can still be abused
55
Benchmarks
• System Performance Evaluation Cooperative (SPEC)
• Scientific computing: Linpack, SpecOMP, SpecHPC,…
• Embedded benchmarks: EEMBC, Dhrystone, …
• Enterprise computing
– TPC-C, TPC-W, TPC-H
– SpecJbb, SpecSFS, SpecMail, Streams,
• Multiprocessor: PARSEC, SPLASH-2, EEMBC (multicore)
• Other
– 3Dmark, ScienceMark, Winstone, iBench, AquaMark, …
• Watch out: your results will be as good as your benchmarks
– Make sure you know what the benchmark is designed to measure
– Performance is not the only metric for computing systems• Cost, power consumption, reliability, real-time performance, …
56
Summarizing Performance
• Combining results from multiple programs into 1 benchmark score
– Sometimes misleading, always controversial, … and inevitable
– We all like quoting a single number
• 3 types of means– Arithmetic: for times
– Harmonic: for rates
– Geometric: for rations
1 n
i ii i
AM Weight Timen
1
ni
i i
nHM
Weight
Rate
1
1
n n
ii
GM Ratio
57
Using the Means
• Arithmetic mean:– When you have individual performance scores in latency
• Harmonic mean:– When you have individual performance scores in throughput
• Geometric mean:– Nice property: GM(X/Y)= GM(X)/GM(Y)
– But difficult to related back to execution times
• Note– Always look at the results for individual programs
58
Performance Summary
• Performance is specific to a particular programs– Total execution time is a consistent summary of performance
• For a given architecture performance increases come from:
– Increase in clock rate (without adverse CPI effects)
– Improvements in processor organization that lower CPI
– Compiler enhancements that lower CPI and/or instruction count
– Algorithm/Language choices that affect instruction count
• Pitfall: expecting improvement in one aspect of a machine’s performance to affect the total performance
59
Break
60
Translation Hierarchy
• High-level->Assembly->Machine
61
CompilerConverts High-level Language to Machine Code
62
Intermediate Representation (IR)
• Three-address code (TAC)
j= 2 * I + 1;
If (j >= n)
j = 2*I + 3;
return a[j];
t1= 2 * I
t2 = t1 + 1
j = t2
t3 = j < n
If t3 goto L0
t4 = 2 * I
t5 = t4 + 3
j = t5
L0: t6 = a[j]
return t6
Single assignment
63
Optimizing Compilers
• Provide efficient mapping of program to machine– Code selection and ordering– Eliminating minor inefficiencies– Register allocation
• Don’t (usually) improve asymptotic efficiency– Up to programmer to select best overall algorithm– Big-O savings are (often) more important than constant
factors• But constant factors also matter
• Optimization types– Local: inside basic blocks– Global: across basic blocks e.g. loops
64
Limitations of Optimizing Compilers
• Operate Under Fundamental Constraints– Improved code has same output– Improved code has same side effects– Improved code is as correct as original code
• Most analysis is performed only within procedures– Whole-program analysis is too expensive in most cases
• Most analysis is based only on static information– Compiler has difficulty anticipating run-time inputs– Profile driven dynamic compilation becoming more prevalent– Dynamic compilation (e.g. Java JIT)
• When in doubt, the compiler must be conservative
65
Preview: Code Optimization
• Goal: Improve performance by:– Removing redundant work
• Unreachable code
• Common-subexpression elimination
• Induction variable elimination
– Using simpler and faster operations• Dealing with constants in the compiler
• Strength reduction
– Managing register well
ExecutionTime = Instructions x CPI x ClockCycleTime
66
Constant Folding
• Detect & combine values that will be constant.
• Respect laws of arithmetic on target machine.
a = 10 * 5 + 6 – b;
67
Constant Propagation
• When x gets assigned a constant c, replace users of x with uses of c, as long as x remains unchanged.
• Very useful for turning constants into immediates, for code generator.
68
Constant Propagation
69
Reduction in Strength
• Some operations are faster (lower CPI) than others:– Multiplication slower than addition
– Mult or Divide by 2n slower than shift left or right
– Multiplies take 12 cycle on MIPS R2000
– ALU instruction take 1 cycle
• Can replace complex instruction with equivalent simple instructions
• Correctness:– Overflow behavior preserved?
– Other side-effects of operation? (interrupts)
70
Reduction in Strength
• How to replace multiplication with addition?
• Often presents in for loops & array walks. while (i<100) arr[i++] = 0;
71
Common Sub-Expression Elimination
72
Common Sub-Expression Elimination
73
Induction Variable Elimination
while (i<100) arr [i++] = 0;
• Remember replacing multiply with add in loop?
• Why not get rid of induction variable I altogether?
• Induction variable is a variable that is changed by a constant amount on every iteration of a loop.
74
Register Allocation & Assignment
• Register provide fastest data access in the processor.
• Two terms:– What variables belong in register? (allocation)
– What register will hold this value? (assignment)
• Once a value is in a register, we’d like not to spill & restore it to use same value again.
75
Assembly Code Generation
76
Compiler Performance on Bubble Sort
77
Assembler
• Expands macros and pseudo instructions as well as converts constants.
• Primary purpose is to produce an object file– Machine language instructions
– Application data
– Information for memory organization
78
Object File
79
Linker
• Linker combines multiple object modules– Identify where code/data will be placed in memory– Resolve code/data cross references– Produces executable if all references found
• Steps– Place code and data modules in memory– Determine the address of data and instruction labels– Patch both the internal and external references
• Separation between compiler and linker makes standard libraries and efficient solution to maintaining modular code.
80
Loader
• Loader used at run-time– Reads executable file header for size of text/data segments
– Create address space sufficiently large
– Copy program from executable on disk into memory
– Copy arguments to main program’s stack
– Initialize machine registers and set stack pointer
– Jump to start-up routine
– Terminate program when execution completes
81
SPIM System Calls• SPIM provides a small number of OS “system calls”
• Set system call code in $v0 and pass arguments as needed
• See web site for details and examples.
Service Code Argument Result
print_int 1 $a0 = integer
print_float 2 $f12 = float
print_double 3 $f12 = double
print_string 4 $a0 = string
read_int 5 Integer in $v0
read_float 6 Float if $f0
read_double 7 Double in $f0
read_string 8 $a0 = buffer, $a1 = length Data in buffer
Sbrk 9 $a0 = amount Address in $v0
Exit 10
82
MIPS Assembler Directives
• SPIM supports a subset of the MIPS assembler directives
• Some of the directives include– .asciiz – Store a null-terminated string in memory
– .data –Start of data segment
– .global –Identify an exported symbol
– .text –Start of text segment
– .word –Store words in memory
83
HomeWork
• HW1– http://mypage.zju.edu.cn/liupeng
84
Acknowledgements
• These slides contain material from courses: – UCB CS152
– Stanford EE108B