emulation – outline - ittckulkarni/teaching/eecs768/slides/chapter2.pdf · emulation – outline...

63
EECS 768 Virtual Machines 1 Emulation – Outline Emulation Interpretation basic, threaded, directed threaded other issues Binary translation code discovery, code location other issues Control Transfer Optimizations

Upload: vanhuong

Post on 30-Aug-2018

249 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 1

Emulation – Outline

• Emulation• Interpretation

– basic, threaded, directed threaded– other issues

• Binary translation– code discovery, code location– other issues

• Control Transfer Optimizations

Page 2: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 2

Key VM Technologies

• Emulation – binary in one ISA is executed in processor supporting a different ISA

• Dynamic Optimization – binary is improved for higher performance – may be done as part of emulation– may optimize same ISA (no emulation needed)

HP PA ISA

HP UX

HP Apps.

Optimization

Alpha

Windows

X86 apps

Emulation

Page 3: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 3

Emulation Vs. Simulation• Emulation

– method for enabling a (sub)system to present the same interface and characteristics as another

– ways of implementing emulation• interpretation: relatively inefficient instruction-at-a-time

• binary translation: block-at-a-time optimized for repeated

– e.g., the execution of programs compiled for instruction set A on a machine that executes instruction set B.

• Simulation– method for modeling a (sub)system’s operation

– objective is to study the process; not just to imitate the function

– typically emulation is part of the simulation process

Page 4: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 4

Definitions

• Guest– environment being

supported by underlying platform

• Host– underlying platform

that provides guest environment

Guest

Host

supported by

Page 5: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 5

Definitions (2)

• Source ISA or binary– original instruction set or binary– the ISA to be emulated

• Target ISA or binary– ISA of the host processor– underlying ISA

• Source/Target refer to ISAs• Guest/Host refer to platforms

Source

Target

emulated by

Page 6: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 6

Emulation

• Required for implementing many VMs.• Process of implementing the interface and

functionality of one (sub)system on a (sub)system having a different interface and functionality– terminal emulators, such as for VT100, xterm, putty

• Instruction set emulation– binaries in source instruction set can be executed on

machine implementing target instruction set– e.g., IA-32 execution layer

Page 7: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 7

Interpretation Vs. Translation

• Interpretation– simple and easy to implement, portable– low performance– threaded interpretation

• Binary translation– complex implementation– high initial translation cost, small execution cost– selective compilation

• We focus on user-level instruction set emulation of program binaries.

Page 8: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 8

Interpreter State

• An interpreter needs to maintain the complete architected state of the machine implementing the source ISA– registers– memory

• code• data• stack

Code

Data

Stack

Program Counter

Condition Codes

Reg 0

Reg 1

Reg n-1

.

.

.

Interpreter Code

Page 9: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 9

Decode – Dispatch Interpreter• Decode and dispatch interpreter

– step through the source program one instruction at a time

– decode the current instruction

– dispatch to corresponding interpreter routine

– very high interpretation cost

while (!halt && !interrupt) {inst = code[PC];opcode = extract(inst,31,6);switch(opcode) { case LoadWordAndZero: LoadWordAndZero(inst); case ALU: ALU(inst); case Branch: Branch(inst); . . .}

}Instruction function list

Page 10: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 10

Decode – Dispatch Interpreter (2)

• Instruction function: Load

LoadWordAndZero(inst){RT = extract(inst,25,5);RA = extract(inst,20,5);displacement = extract(inst,15,16);if (RA == 0) source = 0;else source = regs[RA];address = source + displacement;regs[RT] = (data[address]<< 32)>> 32;PC = PC + 4;

}

Page 11: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 11

Decode – Dispatch Interpreter (3)

• Instruction function: ALU

ALU(inst){RT = extract(inst,25,5);RA = extract(inst,20,5);RB = extract(inst, 15,5);source1 = regs[RA];source2 = regs[RB];extended_opcode = extract(inst,10,10);switch(extended_opcode) {

case Add: Add(inst);case AddCarrying: AddCarrying(inst);case AddExtended: AddExtended(inst);. . .}

PC = PC + 4;}

Page 12: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 12

Decode – Dispatch Efficiency

• Decode-Dispatch Loop– mostly serial code– case statement (hard-to-predict indirect jump)– call to function routine– return

• Executing an add instruction– approximately 20 target instructions– several loads/stores and shift/mask steps

• Hand-coding can lead to better performance– example: DEC/Compaq FX!32

Page 13: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 13

Indirect Threaded Interpretation

• High number of branches in decode-dispatch interpretation reduces performance– overhead of 5 branches per instruction

• Threaded interpretation improves efficiency by reducing branch overhead– append dispatch code with each interpretation

routine– removes 3 branches– threads together function routines

Page 14: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 14

Indirect Threaded Interpretation (2)

LoadWordAndZero:RT = extract(inst,25,5);RA = extract(inst,20,5);displacement = extract(inst,15,16);if (RA == 0) source = 0;else source = regs(RA);address = source + displacement;regs(RT) = (data(address)<< 32) >> 32;PC = PC +4;If (halt || interrupt) goto exit;inst = code[PC];opcode = extract(inst,31,6)extended_opcode = extract(inst,10,10);routine = dispatch[opcode,extended_opcode];goto *routine;

Page 15: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 15

Indirect Threaded Interpretation (3)

Add:RT = extract(inst,25,5);RA = extract(inst,20,5);RB = extract(inst,15,5);source1 = regs(RA);source2 = regs[RB];sum = source1 + source2 ;regs[RT] = sum;PC = PC + 4;If (halt || interrupt) goto exit;inst = code[PC];opcode = extract(inst,31,6);extended_opcode = extract(inst,10,10);routine = dispatch[opcode,extended_opcode];goto *routine;

Page 16: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 16

Indirect Threaded Interpretation (4)

• Dispatch occurs indirectly through a table– interpretation routines can be modified and relocated

independently

• Advantages– binary intermediate code still portable– improves efficiency over basic interpretation

• Disadvantages– code replication increases interpreter size

Page 17: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 17

Indirect Threaded Interpretation (5)

source code

dispatch

loop

interpreter

routines

"data"

accesses

Decode-dispatch

source codeinterpreterroutines

Threaded

Page 18: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 18

Predecoding

• Parse each instruction into a pre-defined structure to facilitate interpretation– separate opcode, operands, etc.– reduces shifts / masks significantly– more useful for CICS ISAs

lwz r1, 8(r2)add r3, r3,r1stw r3, 0(r4)

071 2 08

(load word and zero)

083 1 03

373 4 00

(add)

(store word)

Page 19: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 19

Predecoding (2)struct instruction { unsigned long op; unsigned char dest, src1, src2;} code [CODE_SIZE];

Load Word and Zero:RT = code[TPC].dest;RA = code[TPC].src1;displacement = code[TPC].src2; if (RA == 0) source = 0;else source = regs[RA];address = source + displacement;regs[RT] = (data[address]<< 32) >> 32;SPC = SPC + 4; TPC = TPC + 1;If (halt || interrupt) goto exit;

opcode = code[TPC].oproutine = dispatch[opcode];goto *routine;

Page 20: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 20

Direct Threaded Interpretation

• Allow even higher efficiency by– removing the memory access to the centralized table– requires predecoding – dependent on locations of interpreter routines

• loses portability

001048d01 2 08

(load word and zero)

001048003 1 03

001049103 4 00

(add)

(store word)

Page 21: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 21

Direct Threaded Interpretation (2)

• Predecode the source binary into an intermediate structure

• Replace the opcode in the intermediate form with the address of the interpreter routine

• Remove the memory lookup of the dispatch table

• Limits portability since exact locations of the interpreter routines are needed

Page 22: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 22

Direct Threaded Interpretation (3)

Load Word and Zero:RT = code[TPC].dest;RA = code[TPC].src1;displacement = code[TPC].src2; if (RA == 0) source = 0;else source = regs[RA];address = source + displacement;regs[RT] = (data[address]<< 32) >> 32;SPC = SPC + 4;TPC = TPC + 1;If (halt || interrupt) goto exit;routine = code[TPC].op;goto *routine;

Page 23: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 23

Direct Threaded Interpretation (4)

source code

pre-decoder

interpreterroutines

intermediatecode

Page 24: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 24

Interpreter Control Flow

GeneralDecode

(fill-in instructionstructure)

Dispatch

Inst. 1specialized

routine

Inst. 2specialized

routine

Inst. nspecialized

routine

. . .

• Decode for CISC ISA• Individual routines

for each instruction

Page 25: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 25

Interpreter Control Flow (2)

• For CISC ISAs– multiple byte opcode– make

commoncasesfast

Dispatchon

first byte

SimpleInst. 1

specializedroutine

SimpleInst. m

specializedroutine

ComplexInst. m+1

specializedroutine

SharedRoutines

ComplexInst. n

specializedroutine

Prefixset flags

... ...

Page 26: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 26

Binary Translation• Translate source binary program to target binary before

execution– is the logical conclusion of predecoding– get rid of parsing and jumps altogether– allows optimizations on the native code– achieves higher performance than interpretation– needs mapping of source state onto the host state

(state mapping)

Page 27: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 27

Binary Translation (2)

x86 Source Binary

addl %edx,4(%eax)movl 4(%eax),%edxadd %eax,4

Translate to PowerPC Target

r1 points to x86 register context blockr2 points to x86 memory imager3 contains x86 ISA PC value

Page 28: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 28

Binary Translation (3)lwz r4,0(r1) ;load %eax from register blockaddi r5,r4,4 ;add 4 to %eax lwzx r5,r2,r5 ;load operand from memorylwz r4,12(r1) ;load %edx from register blockadd r5,r4,r5 ;perform addstw r5,12(r1) ;put result into %edx addi r3,r3,3 ;update PC (3 bytes)

lwz r4,0(r1) ;load %eax from register blockaddi r5,r4,4 ;add 4 to %eaxlwz r4,12(r1) ;load %edx from register blockstwx r4,r2,r5 ;store %edx value into memory addi r3,r3,3 ;update PC (3 bytes)

lwz r4,0(r1) ;load %eax from register blockaddi r4,r4,4 ;add immediatestw r4,0(r1) ;place result back into %eaxaddi r3,r3,3 ;update PC (3 bytes)

Page 29: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 29

Binary Translation (4)

source code

binarytranslator

binary translatedtarget code

Page 30: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 30

State Mapping• Maintaining the state of the source machine on the host

(target) machine.– state includes source registers and memory contents– source registers can be held in host registers or in host

memory– reduces loads/stores significantly– easier if target registers > source registers

Page 31: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 31

Register Mapping• Map source registers to

target registers– spill registers if needed

• if target registers < source registers– map some to memory– map on per-block basis

• Reduces load/store significantly– improves performance

p r o g r a m c o u n t e r

s t a c k p o i n t e r

s o u r c e I S A t a r g e t I S A

R 3

R 2

r e g 1

r e g 2

r e g n

R 2

R 6

R N + 4

S o u r c e M e m o r yI m a g e

S o u r c e R e g i s t e rB l o c k

R 1

R 5

Page 32: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 32

Register Mapping (2)

r1 points to x86 register context blockr2 points to x86 memory imager3 contains x86 ISA PC valuer4 holds x86 register %eaxr7 holds x86 register %edx

etc.

addi r16,r4,4 ;add 4 to %eax lwzx r17,r2,r16 ;load operand from memoryadd r7,r17,r7 ;perform add of %edxaddi r16,r4,4 ;add 4 to %eaxstwx r7,r2,r16 ;store %edx value into memory

addi r4,r4,4 ;increment %eaxaddi r3,r3,9 ;update PC (9 bytes)

Page 33: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 33

Predecoding Vs.

Binary Translation

• Requirement of interpretation routines during predecoding.

• After binary translation, code can be directly executed.

Page 34: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 34

Code Discovery Problem

• May be difficult to statically translate or predecode the entire source program

• Consider x86 code

mov %ch,0 ??

31 c0 8b b5 00 00 03 08 8b bd 00 00 03 00

movl %esi, 0x08030000(%ebp) ??

Page 35: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 35

Code Discovery Problem (2)

• Contributors to code discovery problem – variable-length (CISC) instructions– indirect jumps– data interspersed with code– padding instructions to align branch targets

source ISAinstructions

inst. 1 inst. 2

inst. 3 jump

data

inst. 5 inst. 6uncond. brnch

inst. 8jump indirect to???

data in instructionstream

pad for instructionalignment

reg.

pad

Page 36: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 36

Code Location Problem

• Mapping of the source program counter to the destination PC for indirect jumps– indirect jump addresses in the translated code still

refer to source addresses for indirect jumps

x86 source code movl %eax, 4(%esp) ;load jump address from memoryjmp %eax ;jump indirect through %eax

PowerPC target code addi r16,r11,4 ;compute x86 address lwzx r4,r2,r16 ;get x86 jump address

; from x86 memory imagemtctr r4 ;move to count registerbctr ;jump indirect through ctr

Page 37: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 37

Simplified Solutions

• Fixed-width RISC ISA are always aligned on fixed boundaries

• Use special instruction sets (Java)– no jumps/branches to arbitrary locations– no data or pads mixed with instructions– all code can then be discovered

• Use incremental dynamic translation

Page 38: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 38

Incremental Code Translation

• First interpret– perform code discovery as a by-product

• Translate code– incrementally, as it is discovered

– place translated code in code cache

– use lookup table to save source to target PC mappings

• Emulation process– execute translated block

– lookup next source PC in lookup table• if translated, jump to target PC

• else, interpret and translate

Page 39: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 39

Incremental Code Translation (2)

EmulationManager

sourcebinary

TranslationMemory

SPC to TPCLookupTable

hit

miss

translatorInterpreter

Page 40: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 40

Dynamic Basic Block

• Unit of translation during dynamic translation.• Leaders identify starts of static basic blocks

– first program instruction– instruction following a branch or jump– target of a branch or jump

• Runtime control flow identify dynamic blocks– instruction following a taken branch or jump at

runtime

Page 41: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 41

Dynamic Basic Block (2)

block 1

block 2

block 3

block 4

add...load...store ...

loop: load ...add .....storebrcond skipload...sub...

skip: add...storebrcond loopadd...load...store...jump indirect......

block 5

add...load...store ...

loop: load ...add .....storebrcond skipload...sub...

skip: add...storebrcond loop

loop: load ...add .....storebrcond skip

skip: add...storebrcond loop

...

StaticBasic Blocks

block 1

block 2

block 3

block 4

DynamicBasic Blocks

Page 42: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 42

Flow of Control

• Even after all blocks are translated, control flows between translated blocks and emulation manager.

• EM connects the translated blocks during execution.

• Optimizations can reduce the overhead of going through the EM between every pair of translation blocks.

Page 43: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 43

Flow of Control (2)

translationblock

EmulationManager

translationblock

translationblock

Page 44: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 44

Tracking the Source PC

• Update SPC as part of translated code– place SPC in stub

• General approach– translator returns to EM

via branch-and-link (BL)– SPC placed in stub

immediately after BL– EM uses link register to

find SPC and hash to next target code block

CodeBlock

Branch and Link to EMNext Source PC

EmulationManager

HashTable

CodeBlock

Page 45: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 45

Emulation Manager FlowchartS t a r t w i t h

S P C

L o o k u pS P C - > T P Ci n M a p T a b l e

H i t i n T a b l e ?

B r a n c h t o T P Ca n d

E x e c u t e T r a n s l a t e dB l o c k

G e t S P Cf o r n e x t B l o c k

U s e S P C t o R e a dI n s t s . f r o m S o u r c e

M e m o r y I m a g e- - - - - - - - - - - - - - - - - - - -

I n t e r p r e t , T r a n s l a t ea n d P l a c e i n t o

T r a n l s a t i o n M e m o r y

W r i t e n e wS P C - > T P C

m a p p i n g i n t o T a b l e

N o

Y e s

Page 46: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 46

Translation Chaining

• Translation blocks are linked into chains• If the successor block has not yet being

translated– code is inserted to jump to the EM– later, after jumping to the EM, if the EM finds that

the successor block has being translated, then the jump is modified to instead point directly to the successor

Page 47: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 47

Translation Chaining (2)

translationblock

VMM

translationblock

translationblock

translationblock

VMM

translationblock

translationblock

translationblock

Without Chaining With Chaining

Page 48: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 48

Translation Chaining (3)

• Creating a link:

JAL TM

next SPC

Predecessor

Successor

get nextSPC

Set upchain

LookupSuccessor Jump TPC

12

3

4

5

Page 49: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 49

Translation Chaining (4)

9AC0: lwz r16,0(r4) ;load value from memoryadd r7,r7,r16 ;accumulate sumstw r7,0(r5) ;store to memoryaddic. r5,r5,-1 ;decrement loop count, set cr0beq cr0,pc+12 ;branch if loop exitbl F000 ;branch & link to EM4FDC ;save source PC in link register

9AE4: b 9c08 ;branch along chain51C8 ;save source PC in link register

9C08: stw r7,0(r6) ;store last value of %edxxor r7,r7,r7 ;clear %edxbl F000 ;branch & link to EM6200 ;save source PC in link register

PowerPC Translation

Page 50: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 50

Software Indirect Jump Prediction

• For blocks ending with an indirect jump– chaining cannot be used as destination can change– SPC–TPC map table lookup is expensive

• indirect jump locations seldom change– use profiling to find the common jump addresses– inline frequently used SPC addresses; most frequent

SPC destination addresses given first

If Rx == addr_1 goto target_1Else if Rx == addr_2 goto target_2Else if Rx == addr_3 goto target_3Else hash_lookup(Rx) ; do it the slow way

Page 51: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 51

Dynamic Translation Issues

• Tracking the source PC– SPC used by the emulation manager and interpreter

• Handle self-modifying code– programs modifying (perform stores) code at runtime

• Handle self-referencing code– programs perform loads from the source code

• Provide precise traps– provide precise source state at traps and exceptions

Page 52: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 52

Same – ISA Emulation

• Same source and target ISAs• Applications

– simulation– OS call emulation– program shepherding– performance optimization

Page 53: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 53

Instruction Set Issues• Register architectures

– register mappings, reservation of special registers• Condition codes

– lazy evaluation as needed• Data formats and arithmetic

– floating point– decimal– MMX

• Address resolution– byte vs word addressing

• Data Alignment– natural vs arbitrary

• Byte order– big/little endian

Page 54: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 54

Register Architectures

• GPRs of the target ISA are used for– holding source ISA GPR– holding source ISA special-purpose registers– point to register context block and memory image– holding intermediate emulator values

• Issues– target ISA registers < source ISA registers– prioritizing the use of target ISA registers

Page 55: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 55

Condition Codes

• Condition codes are not used uniformly– IA-32 ISA sets CC implicitly– SPARC and PowerPC set CC explicitly– MIPS ISA does not use CC

• Neither ISA uses CC– nothing to do

• Source ISA does not use CC, target ISA does– easy; additional ins. to generate CC values

Page 56: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 56

Condition Codes (cont…)

• Source ISA has explicit CC, target ISA no CC– trivial emulation of CC required

• Source ISA has implicit CC, target ISA no CC– very difficult and time consuming to emulate– CC emulation may be more expensive than

instruction emulation

Page 57: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 57

Condition Codes (cont…)

• Lazy evaluation– CC are seldom used– only generate CC if required– store the operands and the operation that set each

condition code

• Optimizations can also be performed to analyze code to detect cases where CC generated will never be used

Page 58: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 58

Lazy Condition Code Evaluation

add %ecx,%ebxjmp label1

. . .label1: jz target

R4 ↔ eax PPC toR5 ↔ ebx x86 registerR6 ↔ ecx mappings..R24 ↔ scratch register used by emulation codeR25 ↔ condition code operand 1 ;registersR26 ↔ condition code operand 2 ;used forR27 ↔ condition code operation ;lazy condition

;emulation code R28 ↔ jump table base address

Page 59: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 59

Lazy Condition Code Evaluation (2)

mr r25,r6 ;save operandsmr r26,r5 ;and opcode forli r27,“add” ;lazy condition code emulation add r6,r6,r5 ;translation of addb label1

...label1:

bl genZF ;branch and link genZF codebeq cr0,target ;branch on condition flag

...genZF:

add r29,r28,r27 ;add “opcode” to jump table base mtctr r29 ;copy to counter registerbctr ;branch via jump table

... ...“add”: add. r24,r25,r26 ;perform PowerPC add, set cr0 blr ;return

Page 60: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 60

Data Formats and Arithmetic

• Maintain compatibility of data transformations.• Data formats are arithmetic operations are

standardized– two’s complement representation– IEEE floating point standard– basic logical/arithmetic operations are mostly present

• Exceptions:– IA32 FP uses 80-bit intermediate results– PowerPC and HP PA have multiply-and-add (FMAC)

which has a higher precision on intermediate values– integer divide vs. using FP divide to approximate

• ISAs may have different immediate lengths

Page 61: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 61

Memory Address Resolution

• ISAs can access data items of different sizes– load / stores of bytes, halfwords, full words, as

opposes to only bytes and words

• Emulating a less powerful ISA– no issue

• Emulating a more powerful ISA– loads: load entire word, mask un-needed bits– stores: load entire word, insert data, store word

Page 62: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 62

Memory Data Alignment

• Aligned memory access– word accesses performed with two low order bits

00, halfword access must have lowest bit 0, etc.

• Target ISA does not allow unaligned access– break up all accesses into byte accesses– ISAs provide supplementary instructions to simplify

unaligned accesses– unaligned access traps, and then can be handled

Page 63: Emulation – Outline - ITTCkulkarni/teaching/EECS768/slides/chapter2.pdf · Emulation – Outline • Emulation • Interpretation –basic, threaded, directed threaded –other

EECS 768 Virtual Machines 63

Byte Order

• Ordering of bytes within a word may differ– little endian and big endian

• Target code must perform byte ordering• Guest data image is generally maintained in the

same byte order as assumed by the source ISA• Emulation software modifies addresses when

bytes within words are addressed– can be very inefficient

• Some target ISAs may support both byte orders– e.g., MIPS, IA-64