emulation unit v

EMULATION

• Emulation: binary in one ISA is executed on processor supporting a different ISA.

ways of implementing emulationInterpretation: relatively inefficient instruction -at-a-time

Binary translation: block-at-a-time optimized or repeated

e.g., the execution of programs compiled for instruction set A on a machine that executes instruction set B.

Windows 8 Tablets Won't Run PC Apps, After All

Apps written for x86 Windows PCs won't be able to run on ARM-based Win 8 tablets,

By Paul McDougall InformationWeek

Windows 8 for tablets runs on devices powered by chips designed by U.K.-based ARM.

VIRTUALIZATION vs EMULATIONIn emulation the virtual

machine simulates the complete hardware in software.

operating system for one computer architecture to be run on the architecture that the emulator is written for.

The diagram below illustrates how a Java based PC emulator fits into the hierarchy of programs that uses the real computer's hardware.

Contd…Virtualization therefore is normally faster

than emulation but the real system has to have an architecture identical to the guest system.

For example, VMWare can provide a virtual environment for running a virtual WindowsXP machine "inside" a real one. However VMWare cannot work on any real hardware other than a real x86 PC.

Basically, virtualization will use your computer's actual resources (at least the CPU) for the guest machine. With emulation, it simulates all the hardware using software.This is why virtualization is much faster, since it's using your actual CPU.

Emulation Suppose you develop for a system G

(guest, e.g. an ARM-base phone) on your workstation H (host, e.g., an x86 PC). An emulator for V running on H precisely emulates G’s

• CPU,• memory subsystem, and• I/O devices.

Ideally, programs running on the emulated G exhibit the same behaviour as when running on a real G (except for timing).

GUEST vs HOST

Guest– environment being supported by underlying platform

Host– underlying platform that provides guest environment

Source VS Target

Source ISA or binary– original instruction set or binary the ISA to be emulated

Target ISA or binary– ISA of the host processor underlying ISA

• Source/Target refer to ISAs• Guest/Host refer to platforms

Definition• Process of implementing the interface and

functionality of one (sub)system on a (sub)system having a different interface and functionality

– terminal emulators, such as for VT100

• Instruction set emulation– binaries in source instruction set can be executed on machine implementing target instruction set

– e.g., IA-32 execution layer

Terminal Emulators• Terminal emulators are software programs that

provide modern computers and devices interactive access to applications running on Mainframe Computer.

• Terminals such as the IBM 370 or VT100 and many others, are no longer produced as physical devices.

• Software running on modern operating systems simulates a "dumb" terminal and is able to render the graphical and text elements of the host application.

WHY EMULATIONBinary translation grew out of emulation

techniques in the late 1980s in order to provide for a migration path from legacy CISC machines to the newer RISC machines.

Such techniques were developed by hardware manufacturers interested in marketing their new RISC platforms.

Contd…• It would be very useful if one could take the

programs written on a CISC and use them on a RISC processor.

• In this way, one could take advantage of the capabilities of the newer processor without having to spend time and money redeveloping programs already existing on the older platform.

• But CISC binaries are incompatible with the RISC. In order to use CISC programs on a RISC processor, those programs now must be rewritten.

• A binary-to-binary translator that can automatically convert binary code from the CISC to the RISC.

Emulation MethodsWays of implementing emulation• Interpretation:

Repeats a cycle of fetch a source instruction, analyze, perform

simple and easy to implement, portablelow performance Basic, Threaded interpretation, Direct Threaded

• Binary Translation: Translates a block of source instruction to a

block of target instruction Save the translated code for repeated useBigger initial translation cost with smaller

execution costMore advantageous if translated code is

executed frequentlyCode discovery, Code location

InterpretationAn interpreter is the easier way to emulate a CPU. It

just reproduces how a basic CPU works. A basic CPU reads bytes from an address of the

memory pointed by a special register (PC or Program Counter).

These bytes contain the information about the instruction that the CPU must execute (that is what the CPU has to do).

The CPU must decode these bytes and decide what it has to do. Then it performs the action commanded, updates PC counter and reads another byte or bytes.

That is how it works an emulator interpreter. At first it reads a byte (or the number of bytes which form an instruction in the emulated CPU).

switch(memory[PC++]){case OPCODE1:

opcode1();break;

case OPCODE2:opcode2();break;....

case OPCODEn:opcodeN();break;

}

Interpreter emulator.

Basic InterpreterEmulates the whole

source machine state

An interpreter needs to maintain the complete architected state of the machine implementing the source ISA

Guest memory and context block is kept in interpreter’s memory (heap) code and data

general-purpose registers, PC, CC, control registers

Code

Data

Stack

Program Counter

Condition Codes

Reg 0

Reg 1

……

Reg n-1

Interpreter Code

Source Memory StateSource Context

Block

Interpreter Overview

Decode-and-dispatch interpreter

Decode and dispatch interpreter step through the source program one

instruction at a time decode the current instruction dispatch to corresponding interpreter

routine very high interpretation cost

Contd…while (!halt && !

interrupt) {inst = code[PC];opcode =

extract(inst,31,6);switch(opcode) { case

LoadWordAndZero: LoadWordAndZero(inst);

case ALU: ALU(inst);

case Branch: Branch(inst);

. . .}}

Instruction function: LoadLoadWordAndZero(inst){

RT = extract(inst,25,5);RA = extract(inst,20,5);displacement =

extract(inst,15,16);if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32)>> 32; PC = PC + 4;}

Contd…Instruction function: ALUALU(inst){

RT = extract(inst,25,5);RA = extract(inst,20,5);RB = extract(inst, 15,5);source1 = regs[RA];source2 = regs[RB];extended_opcode = extract(inst,10,10);switch(extended_opcode) {

case Add: Add(inst);case AddCarrying: AddCarrying(inst);case AddExtended: AddExtended(inst);

. . .}PC = PC + 4;}

Contd…While (!halt&&interrupt){ switch(opcode){ case ALU:ALU(inst); · · · · ·}

Indirect Threaded Interpretation

Indirect Threaded InterpretationThreaded interpretation improves

efficiency by reducing branch overhead

append dispatch code with each interpretation routine removes 3 branches threads together function routines

Indirect Threaded Interpretation

Put the dispatch code to the end of each interpretation routine.

Add: RT=extract(inst,25,5); RA=extract(inst,20,5); RB=extract(inst,15,5); source1=regs[RA]; source2=regs[RB]; sum=source1+source2; regs[RT]=sum; PC=PC+4; If (halt || interrupt) goto exit; inst=code[PC]; opcode=extract(inst,31,6); extended_opcode=extract(inst,10,10); routine=dispatch[ opcode, extended_opcode]; goto *routine;}

Contd…Dispatch occurs indirectly through a dispatch table routine =

dispatch[opcode,extended_opcode]; goto *routine;

Directed threaded interpretation: remove the overhead of accessing the

tableSolution: predecoding and direct

threading

Predecoding

Extracting various fields of an instruction is complicatedFields are not aligned, requiring complex bit

extractionSome related fields needed for decoding is not

adjacentIf it is in a loop, this extraction job should be repeated

PredecodingPre-parsing instructions in a form that is easier to

interpreterDone before interpretation startsPredecoding allows direct threaded interpretation

Predecoding

In PPC, opcode & extended opcode field are separated and register specifiers are not byte-aligned

Define instruction format and define an predecode instruction array based on the format

Struct instruction { unsigned long op; // 32 bit unsigned char dest; // 8 bit unsigned char src1; // 8 bit unsigned int src2; // 16 bit } code [CODE_SIZE];

Pre-decode each instruction based on this format

Predecoding

lwz r1, 8(r2)add r3, r3,r1stw r3, 0(r4)

07

1 2 08

08

3 1 03

37

3 4 00

(load word and zero)

(add)

(store word)

Contd…It is efficient to perform the repeated decoding

operations only once per source machine address.

Previous interpreter code:LoadWordAndZero(inst){

RT = extract(inst,25,5);RA = extract(inst,20,5);displacement = extract(inst,15,16);if (RA == 0) source = 0;else source = regs[RA];address = source + displacement;regs[RT] = (data[address]<< 32)>> 32;PC = PC + 4;

Contd…Predecoded instruction(contained in an array) can now be executed by the following code.

struct instruction { unsigned long op; unsigned char dest; unsigned char src1; unsigned char src1;} code [CODE_SIZE];

Load Word and Zero:RT = code[TPC].dest;RA = code[TPC].src1;displacement = code[TPC].src2; if (RA == 0) source = 0;else source = regs[RA];address = source + displacement;regs[RT] = (data[address]<< 32) >> 32SPC = SPC + 4; TPC = TPC + 1;If (halt || interrupt) goto exit; opcode = code[TPC].oproutine = dispatch[opcode];goto *routine;

Contd…Even with predecoding, indirect

threading includes a centralized dispatch table, which requiresMemory access and indirect jump

To remove this overhead, replace the instruction opcode in predecoded format by address of interpreter routine

001048d0

1 2 08

07

1 2 08

If (halt || interrupt) goto exit;opcode= code[TPC].op;routine=dispatch [opcode];goto *routine;

If (halt || interrupt) goto exit;routine= code[TPC].op;goto *routine;

SUMMARY

Binary TranslationFaster than just to get each opcode from the

emulated CPU and execute a function or code, which implements the function of this instruction, would be to translate that opcode to an opcode in the target CPU.

Binary translation means to get the native code for the emulated CPU and translate this code, using techniques related with compilation for example, into code for the target CPU.

The translated code is stored and executed every time that is called the emulation of the CPU.

Translate source binary program to target binary before execution– is the logical conclusion of predecoding– get rid of parsing and jumps altogether– allows optimizations on the native code– achieves higher performance than

interpretation– needs mapping of source state onto

the host state(statemapping)

x86 Source Binaryaddl %edx,4(%eax)movl 4(%eax),%edxadd %eax,4

Translate to PowerPC Targetr1 points to x86 register context blockr2 points to x86 memory imager3 contains x86 ISA PC value

State MappingMaintaining the state of the source

machine on the host(target) machine.

state includes source registers and memory contents

source registers can be held in host registers or in host memory

reduces loads/stores significantly easier if target registers > source

registers

Register MappingMap source registers

to target registers if target registers <

source registers map some to memory map on per-block

basis Reduces load/storesignificantly

improves performance

r1 points to x86 register context blockr2 points to x86 memory imager3 contains x86 ISA PC valuer4 holds x86 register %eaxr7 holds x86 register %edx etc.addi r16,r4,4 ;add 4 to %eaxlwzx r17,r2,r16 ;load operand from

memoryadd r7,r17,r7 ; perform add of %edxaddi r16,r4,4 ; add 4 to %eaxstwx r7,r2,r16 ; store %edx value into

memoryaddi r4,r4,4 ; increment %eaxaddi r3,r3,9 ; update PC (9 bytes)

Predecoding Vs.Binary TranslationRequirement of interpretation

routines during predecoding.After binary translation, code can

be directly executed.

Dynamic Binary Translation• Dynamic binary translation is the process of

translating executable code of one architecture to another at runtime.

• The translation is usually done by extracting a sequence of instructions from the source program and then translating them to the target architecture;

• The sequence of instructions is called a translation block. The translated block is typically cached to avoid the need of retranslation if the untranslated block were encountered again.

Architecture• The main components of an emulator using

dynamic binary translator are the dispatcher, the translator, and the translation cache.

• Dispatcher:– The purpose of the dispatcher is to control the

execution of a program by selecting the next block to be executed based on the given source address.

– It first checks whether the block at the current address is already translated; that is, whether it resides in the cache.

– If the block is found, it is executed; otherwise, the dispatcher calls the translator.

• Translator:– The translator is responsible for translating a

block of source code to an equivalent block of target code and storing the translated block into the translation cache.

– The block can, for instance, be a sequence of instruction from the current point to the next branch. Moreover, the last instruction of the block, the branch, is replaced by a branch to the dispatcher.

• Translator Cache:– The translation cache stores the translated

blocks. When the dispatcher is called, it first checks whether the next block is already translated; that is, whether the block resides in cache. One requirement for the translation cache is naturally a fast lookup of blocks.

– An important issue concerns the size of the cache: • if the size is small, little memory is consumed and

better locality is achieved;• a large cache fills up more slowly but has worse

locality and requires more memory.

• Instruction Selection:– The phase of instruction selection maps a set

of source instructions to a set of target instructions. The sizes of those sets do not need to match: • a particular PowerPC instruction requires six Alpha

instructions to be implemented. • Another issue in instruction selection are operands.

If, for instance, an immediate operand of a source instruction does not fit to the immediate field of the target instruction, that operand must first be stored somewhere (to a register, for instance) which requires an extra instruction.

Condition Codes:Many architectures have instructions which

set condition codes to reflect state. As an example, the arithmetic instructions in

Intel 386 modify the flags register as a side effect (the add instruction, for instance, modifies six bits in that register).

Generating target instructions to handle those modifications can cause a severe overhead.

Register allocation• The purpose of register allocation is to map

the registers of the source architecture to the target architecture.

• The ideal case for register allocation occurs when the target architecture has at least the same amount of available registers as the source architecture.

• If, on the other hand, the target architecture has less registers, memory will likely to be needed to hold some of the source registers.

Virtual machine is an entity that emulates a guest interface on top of a host machine

Language view:• Virtual machine = Entity that emulates an API (e.g., JAVA) on top of another

Virtualizing software = compiler/interpreter Process view:

• Machine = Entity that emulates an ABI on top of another

• Virtualizing software = runtimeOperating system view:

Machine = Entity that emulates an ISAVirtualizing software = virtual machine monitor

(VMM)

emulation unit v

Documents