emulation unit v
TRANSCRIPT
EMULATION
• Emulation: binary in one ISA is executed on processor supporting a different ISA.
ways of implementing emulationInterpretation: relatively inefficient instruction -at-a-time
Binary translation: block-at-a-time optimized or repeated
e.g., the execution of programs compiled for instruction set A on a machine that executes instruction set B.
Windows 8 Tablets Won't Run PC Apps, After All
Apps written for x86 Windows PCs won't be able to run on ARM-based Win 8 tablets,
By Paul McDougall InformationWeek
Windows 8 for tablets runs on devices powered by chips designed by U.K.-based ARM.
VIRTUALIZATION vs EMULATIONIn emulation the virtual
machine simulates the complete hardware in software.
operating system for one computer architecture to be run on the architecture that the emulator is written for.
The diagram below illustrates how a Java based PC emulator fits into the hierarchy of programs that uses the real computer's hardware.
Contd…Virtualization therefore is normally faster
than emulation but the real system has to have an architecture identical to the guest system.
For example, VMWare can provide a virtual environment for running a virtual WindowsXP machine "inside" a real one. However VMWare cannot work on any real hardware other than a real x86 PC.
Basically, virtualization will use your computer's actual resources (at least the CPU) for the guest machine. With emulation, it simulates all the hardware using software.This is why virtualization is much faster, since it's using your actual CPU.
Emulation Suppose you develop for a system G
(guest, e.g. an ARM-base phone) on your workstation H (host, e.g., an x86 PC). An emulator for V running on H precisely emulates G’s
• CPU,• memory subsystem, and• I/O devices.
Ideally, programs running on the emulated G exhibit the same behaviour as when running on a real G (except for timing).
GUEST vs HOST
Guest– environment being supported by underlying platform
Host– underlying platform that provides guest environment
Source VS Target
Source ISA or binary– original instruction set or binary the ISA to be emulated
Target ISA or binary– ISA of the host processor underlying ISA
• Source/Target refer to ISAs• Guest/Host refer to platforms
Definition• Process of implementing the interface and
functionality of one (sub)system on a (sub)system having a different interface and functionality
– terminal emulators, such as for VT100
• Instruction set emulation– binaries in source instruction set can be executed on machine implementing target instruction set
– e.g., IA-32 execution layer
Terminal Emulators• Terminal emulators are software programs that
provide modern computers and devices interactive access to applications running on Mainframe Computer.
• Terminals such as the IBM 370 or VT100 and many others, are no longer produced as physical devices.
• Software running on modern operating systems simulates a "dumb" terminal and is able to render the graphical and text elements of the host application.
WHY EMULATIONBinary translation grew out of emulation
techniques in the late 1980s in order to provide for a migration path from legacy CISC machines to the newer RISC machines.
Such techniques were developed by hardware manufacturers interested in marketing their new RISC platforms.
Contd…• It would be very useful if one could take the
programs written on a CISC and use them on a RISC processor.
• In this way, one could take advantage of the capabilities of the newer processor without having to spend time and money redeveloping programs already existing on the older platform.
• But CISC binaries are incompatible with the RISC. In order to use CISC programs on a RISC processor, those programs now must be rewritten.
• A binary-to-binary translator that can automatically convert binary code from the CISC to the RISC.
Emulation MethodsWays of implementing emulation• Interpretation:
Repeats a cycle of fetch a source instruction, analyze, perform
simple and easy to implement, portablelow performance Basic, Threaded interpretation, Direct Threaded
• Binary Translation: Translates a block of source instruction to a
block of target instruction Save the translated code for repeated useBigger initial translation cost with smaller
execution costMore advantageous if translated code is
executed frequentlyCode discovery, Code location
InterpretationAn interpreter is the easier way to emulate a CPU. It
just reproduces how a basic CPU works. A basic CPU reads bytes from an address of the
memory pointed by a special register (PC or Program Counter).
These bytes contain the information about the instruction that the CPU must execute (that is what the CPU has to do).
The CPU must decode these bytes and decide what it has to do. Then it performs the action commanded, updates PC counter and reads another byte or bytes.
That is how it works an emulator interpreter. At first it reads a byte (or the number of bytes which form an instruction in the emulated CPU).
switch(memory[PC++]){case OPCODE1:
opcode1();break;
case OPCODE2:opcode2();break;....
case OPCODEn:opcodeN();break;
}
Interpreter emulator.
Basic InterpreterEmulates the whole
source machine state
An interpreter needs to maintain the complete architected state of the machine implementing the source ISA
Guest memory and context block is kept in interpreter’s memory (heap) code and data
general-purpose registers, PC, CC, control registers
Code
Data
Stack
Program Counter
Condition Codes
Reg 0
Reg 1
……
Reg n-1
Interpreter Code
Source Memory StateSource Context
Block
Interpreter Overview
Decode-and-dispatch interpreter
Decode and dispatch interpreter step through the source program one
instruction at a time decode the current instruction dispatch to corresponding interpreter
routine very high interpretation cost
Contd…while (!halt && !
interrupt) {inst = code[PC];opcode =
extract(inst,31,6);switch(opcode) { case
LoadWordAndZero: LoadWordAndZero(inst);
case ALU: ALU(inst);
case Branch: Branch(inst);
. . .}}
Instruction function: LoadLoadWordAndZero(inst){
RT = extract(inst,25,5);RA = extract(inst,20,5);displacement =
extract(inst,15,16);if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32)>> 32; PC = PC + 4;}
Contd…Instruction function: ALUALU(inst){
RT = extract(inst,25,5);RA = extract(inst,20,5);RB = extract(inst, 15,5);source1 = regs[RA];source2 = regs[RB];extended_opcode = extract(inst,10,10);switch(extended_opcode) {
case Add: Add(inst);case AddCarrying: AddCarrying(inst);case AddExtended: AddExtended(inst);
. . .}PC = PC + 4;}
Contd…While (!halt&&interrupt){ switch(opcode){ case ALU:ALU(inst); · · · · ·}
Indirect Threaded Interpretation
Indirect Threaded InterpretationThreaded interpretation improves
efficiency by reducing branch overhead
append dispatch code with each interpretation routine removes 3 branches threads together function routines
Indirect Threaded Interpretation
Put the dispatch code to the end of each interpretation routine.
Add: RT=extract(inst,25,5); RA=extract(inst,20,5); RB=extract(inst,15,5); source1=regs[RA]; source2=regs[RB]; sum=source1+source2; regs[RT]=sum; PC=PC+4; If (halt || interrupt) goto exit; inst=code[PC]; opcode=extract(inst,31,6); extended_opcode=extract(inst,10,10); routine=dispatch[ opcode, extended_opcode]; goto *routine;}
Contd…Dispatch occurs indirectly through a dispatch table routine =
dispatch[opcode,extended_opcode]; goto *routine;
Directed threaded interpretation: remove the overhead of accessing the
tableSolution: predecoding and direct
threading
Predecoding
Extracting various fields of an instruction is complicatedFields are not aligned, requiring complex bit
extractionSome related fields needed for decoding is not
adjacentIf it is in a loop, this extraction job should be repeated
PredecodingPre-parsing instructions in a form that is easier to
interpreterDone before interpretation startsPredecoding allows direct threaded interpretation
Predecoding
In PPC, opcode & extended opcode field are separated and register specifiers are not byte-aligned
Define instruction format and define an predecode instruction array based on the format
Struct instruction { unsigned long op; // 32 bit unsigned char dest; // 8 bit unsigned char src1; // 8 bit unsigned int src2; // 16 bit } code [CODE_SIZE];
Pre-decode each instruction based on this format
Predecoding
lwz r1, 8(r2)add r3, r3,r1stw r3, 0(r4)
07
1 2 08
08
3 1 03
37
3 4 00
(load word and zero)
(add)
(store word)
Contd…It is efficient to perform the repeated decoding
operations only once per source machine address.
Previous interpreter code:LoadWordAndZero(inst){
RT = extract(inst,25,5);RA = extract(inst,20,5);displacement = extract(inst,15,16);if (RA == 0) source = 0;else source = regs[RA];address = source + displacement;regs[RT] = (data[address]<< 32)>> 32;PC = PC + 4;
Contd…Predecoded instruction(contained in an array) can now be executed by the following code.
struct instruction { unsigned long op; unsigned char dest; unsigned char src1; unsigned char src1;} code [CODE_SIZE];
Load Word and Zero:RT = code[TPC].dest;RA = code[TPC].src1;displacement = code[TPC].src2; if (RA == 0) source = 0;else source = regs[RA];address = source + displacement;regs[RT] = (data[address]<< 32) >> 32SPC = SPC + 4; TPC = TPC + 1;If (halt || interrupt) goto exit; opcode = code[TPC].oproutine = dispatch[opcode];goto *routine;
Contd…Even with predecoding, indirect
threading includes a centralized dispatch table, which requiresMemory access and indirect jump
To remove this overhead, replace the instruction opcode in predecoded format by address of interpreter routine
001048d0
1 2 08
07
1 2 08
If (halt || interrupt) goto exit;opcode= code[TPC].op;routine=dispatch [opcode];goto *routine;
If (halt || interrupt) goto exit;routine= code[TPC].op;goto *routine;
SUMMARY
Binary TranslationFaster than just to get each opcode from the
emulated CPU and execute a function or code, which implements the function of this instruction, would be to translate that opcode to an opcode in the target CPU.
Binary translation means to get the native code for the emulated CPU and translate this code, using techniques related with compilation for example, into code for the target CPU.
The translated code is stored and executed every time that is called the emulation of the CPU.
Translate source binary program to target binary before execution– is the logical conclusion of predecoding– get rid of parsing and jumps altogether– allows optimizations on the native code– achieves higher performance than
interpretation– needs mapping of source state onto
the host state(statemapping)
x86 Source Binaryaddl %edx,4(%eax)movl 4(%eax),%edxadd %eax,4
Translate to PowerPC Targetr1 points to x86 register context blockr2 points to x86 memory imager3 contains x86 ISA PC value
State MappingMaintaining the state of the source
machine on the host(target) machine.
state includes source registers and memory contents
source registers can be held in host registers or in host memory
reduces loads/stores significantly easier if target registers > source
registers
Register MappingMap source registers
to target registers if target registers <
source registers map some to memory map on per-block
basis Reduces load/storesignificantly
improves performance
r1 points to x86 register context blockr2 points to x86 memory imager3 contains x86 ISA PC valuer4 holds x86 register %eaxr7 holds x86 register %edx etc.addi r16,r4,4 ;add 4 to %eaxlwzx r17,r2,r16 ;load operand from
memoryadd r7,r17,r7 ; perform add of %edxaddi r16,r4,4 ; add 4 to %eaxstwx r7,r2,r16 ; store %edx value into
memoryaddi r4,r4,4 ; increment %eaxaddi r3,r3,9 ; update PC (9 bytes)
Predecoding Vs.Binary TranslationRequirement of interpretation
routines during predecoding.After binary translation, code can
be directly executed.
Dynamic Binary Translation• Dynamic binary translation is the process of
translating executable code of one architecture to another at runtime.
• The translation is usually done by extracting a sequence of instructions from the source program and then translating them to the target architecture;
• The sequence of instructions is called a translation block. The translated block is typically cached to avoid the need of retranslation if the untranslated block were encountered again.
Architecture• The main components of an emulator using
dynamic binary translator are the dispatcher, the translator, and the translation cache.
• Dispatcher:– The purpose of the dispatcher is to control the
execution of a program by selecting the next block to be executed based on the given source address.
– It first checks whether the block at the current address is already translated; that is, whether it resides in the cache.
– If the block is found, it is executed; otherwise, the dispatcher calls the translator.
• Translator:– The translator is responsible for translating a
block of source code to an equivalent block of target code and storing the translated block into the translation cache.
– The block can, for instance, be a sequence of instruction from the current point to the next branch. Moreover, the last instruction of the block, the branch, is replaced by a branch to the dispatcher.
• Translator Cache:– The translation cache stores the translated
blocks. When the dispatcher is called, it first checks whether the next block is already translated; that is, whether the block resides in cache. One requirement for the translation cache is naturally a fast lookup of blocks.
– An important issue concerns the size of the cache: • if the size is small, little memory is consumed and
better locality is achieved;• a large cache fills up more slowly but has worse
locality and requires more memory.
• Instruction Selection:– The phase of instruction selection maps a set
of source instructions to a set of target instructions. The sizes of those sets do not need to match: • a particular PowerPC instruction requires six Alpha
instructions to be implemented. • Another issue in instruction selection are operands.
If, for instance, an immediate operand of a source instruction does not fit to the immediate field of the target instruction, that operand must first be stored somewhere (to a register, for instance) which requires an extra instruction.
Condition Codes:Many architectures have instructions which
set condition codes to reflect state. As an example, the arithmetic instructions in
Intel 386 modify the flags register as a side effect (the add instruction, for instance, modifies six bits in that register).
Generating target instructions to handle those modifications can cause a severe overhead.
Register allocation• The purpose of register allocation is to map
the registers of the source architecture to the target architecture.
• The ideal case for register allocation occurs when the target architecture has at least the same amount of available registers as the source architecture.
• If, on the other hand, the target architecture has less registers, memory will likely to be needed to hold some of the source registers.
Virtual machine is an entity that emulates a guest interface on top of a host machine
Language view:• Virtual machine = Entity that emulates an API (e.g., JAVA) on top of another
Virtualizing software = compiler/interpreter Process view:
• Machine = Entity that emulates an ABI on top of another
• Virtualizing software = runtimeOperating system view:
Machine = Entity that emulates an ISAVirtualizing software = virtual machine monitor
(VMM)