binary translation using peephole superoptimizers
DESCRIPTION
Binary Translation Using Peephole Superoptimizers. Sorav Bansal, Alex Aiken Stanford University. Binary Translation. Allow one ISA to run on another Applications Portability (e.g., running legacy software) Virtualization Backward and Forward Compatibility On-chip binary translation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/1.jpg)
Binary Translation Using Peephole Superoptimizers
Sorav Bansal, Alex AikenStanford University
![Page 2: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/2.jpg)
Binary Translation
• Allow one ISA to run on another• Applications
– Portability (e.g., running legacy software)
– Virtualization– Backward and Forward Compatibility– On-chip binary translation– Java Virtual Machines
![Page 3: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/3.jpg)
Hypervisor
x86 hardware
x86 OS
x86app
x86app
Binary Translator
powerpcapp
powerpc OS
Binary Translation
x86 hardware
OS
x86app
x86app
Binary Translator
powerpcapp
x86 hardware
OS
x86app
x86appBinary Translator
powerpcapp
![Page 4: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/4.jpg)
Binary Translation Wish-list
Performance
Large Complex ISAs
Retargetability OS Compatibility
![Page 5: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/5.jpg)
Talk Outline
SuperoptimizationPeephole SuperoptimizationApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
![Page 6: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/6.jpg)
Superoptimization
• Superoptimizer is a unique code generator that uses brute-force search to attempt to find the optimal code
Eg. int signum(int x) { if (x > 0) return 1; if (x < 0) return –1; else return 0;}
On Motorola 68020: add.l d0, d0 subx.l d1, d1 negx.l d0 addx.l d1, d1
![Page 7: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/7.jpg)
Superoptimization
• Enumerate all sequences up to a certain length
and
• Compare each enumerated sequence with target function for equivalence
![Page 8: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/8.jpg)
Talk Outline
SuperoptimizationPeephole SuperoptimizationApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
![Page 9: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/9.jpg)
Peephole SuperoptimizationUse a superoptimizer to
automatically infer peephole optimizations
add $1, reg inc reg
mul $2, reg shl reg
… …Table of Peephole Optimizations
[S. Bansal, A. Aiken. Automatic Generation of Peephole Superoptimizers, ASPLOS 2006]
pattern replace-with
![Page 10: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/10.jpg)
Peephole SuperoptimizerStep 1
a.out
010001001011110100011101101011101010100010101010001010100010001010101001010100101010101001010000101011111101100101010101101111010010101001010100101010010101001110011111010010001101111011011101010001001101010101010101010101010101010101010100110100100101010101010101010101000011111101010111101010001111010101011101110110111011101110111010100110110010101011011
01…
01100101
mov %eax, %ecxmov %ecx, %eax
sub $123, %eaxadd $456, %eax
movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)
…
Harvest instruction sequences that
can potentially be optimized.
Canonicalize and store them. Target Sequences
![Page 11: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/11.jpg)
Peephole Superoptimization
Step 2mov %eax, %ecxmov %ecx, %eax
sub $123, %eaxadd $456, %eax
movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)
…
Target Sequences
mov %eax, %ecx
add $333, %eax
inc (%eax)
…Brute force
Optimization Optimized Sequences
![Page 12: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/12.jpg)
Equivalence Test
ExecutionTest
BooleanTest
Two sequences
pass
fail fail
not-equivalent not-equivalent
equivalent
![Page 13: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/13.jpg)
Peephole Superoptimization
Step 3mov %eax, %ecxmov %ecx, %eax
sub $123, %eaxadd $456, %eax
movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)
…
mov %eax, %ecx
add $333, %eax
inc (%eax)
…
Table of Peephole Optimizations
![Page 14: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/14.jpg)
Talk Outline
SuperoptimizationPeephole SuperoptimizersApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
![Page 15: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/15.jpg)
Application to Binary Translation
• Our approach: Use lots of peephole transformations
pattern(ppc)
translate-to(x86)
shl %eax
add %ecx,%eax
addi r1,r1,1
mullw r1,r1,2
add r1,r1,r2
inc %eax
ppcx86register map
r1eax
r1eax
r1eax; r2ecx
![Page 16: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/16.jpg)
Peephole Binary Translation
mr r1, r2mr r2, r1
lis r1, 0x12ori r1, r1, 0x3456
ldl r2, (r1)addi r2, r2, 1stl r2, (r1)
…
mov %eax, %ecx
mov $0x123456, Mr1
inc (%eax)
…
r1 eaxr2 ecx
r1 Mr1
r1 eaxr2 ecx
…
source arch.(ppc)
register map destination arch.(x86)
![Page 17: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/17.jpg)
Register Map Selection
• The best code may require changing the register map from one code point to another
• The choice of register maps affects the choice of instruction selection and vice-versa
![Page 18: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/18.jpg)
Register Map Selection
li r1, 123addi r2, r2, 1subf r2, r1, r2ori r1, r1, 31
powerpc sequence:?x86 sequence:
Instruction costsIf accesses memory, 10
Else, 1
Switching CostsRM or MR : 10
Cost Model
P0P1P2P3
exit
At entry: r1Mr1 ; r2Mr2
At exit: r1Mr1 ; r2Mr2
Example
![Page 19: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/19.jpg)
Register Map Selection
li r1, 123
r1 Mr1 ; r2 Mr2entry
addi r2,r2,1
subf r2,r1,r2
ori r1,r1,31
movl $123, Mr1r1 Mr1
0
10
incl Mr2r2 Mr2
0
10
subl Mr1, eaxr1 Mr1 ; r2 eax
10 10
exit
orl $31, Mr1 10r1 Mr1
0
10
Total 40Total 20
Grand Total 60
r1 Mr1 ; r2 Mr2
Instruction costsIf accesses memory, 10
Else, 1
Switching CostsRM or MR : 10
Greedy Strategy
P0:
P1:
P2:
P3:
![Page 20: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/20.jpg)
li r1, 123
r1 Mr1 ; r2 Mr2entry
addi r2,r2,1
subf r2,r1,r2
ori r1,r1,31
exit
movl $123, eaxr1 eax
10
1
incl ecxr2 ecx
10
1
subl eax, ecxr1 eax ; r2 ecx
0
1
orl $31, eax 1r1 eax0
20
Total 4Total 40
Grand Total 44
r1 Mr1 ; r2 Mr2
Switching CostsRM or MR : 10
Instruction costsIf accesses memory, 10
Else, 1
Register Map SelectionOptimal Solution
![Page 21: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/21.jpg)
Register Map Selection
• Use Dynamic Programming– near-optimal solution– account for translations spanning
multiple instructions– simultaneously perform instruction-
selection and register-mapping
![Page 22: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/22.jpg)
Talk Outline
SuperoptimizationPeephole SuperoptimizersApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
![Page 23: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/23.jpg)
Powerpc X86 Translator Implementation
• Superoptimizer– Use a PPC emulator (Qemu) for execution
test– Use a SAT solver (zChaff) for boolean test
• Static user-level translator– ELF 32-bit ppc/Linux binary ELF 32-bit
x86/Linux binary– Translate most (but not all) system calls
![Page 24: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/24.jpg)
Implementation
Endianness: ppc big-endian ; x86 little-endian
– Convert all memory writes to big-endian (source)
– Convert all memory reads to little-endian (dest)
Compiler Optimizations– Problem:PowerPC optimizer staggers data-
dependent instructions to reduce pipeline stalls
– Solution: Cluster data-dependent instructions in basic block before translation
• Many Issues– Condition Codes, Endianness, System Calls,
Stack and Heap, Indirect Jumps, Function Calls and Returns, Register Name Constraints, Untranslated Opcodes, Compiler Optimizations
![Page 25: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/25.jpg)
Experimental Results• Setup
– Pentium4 3.0 GHz, 1MB Cache, 4GB Memory– gcc 4.0.1, glibc 2.3.6– Use soft-float library– Statically-linked input executables
• Benchmarks– Microbenchmarks, SPEC CINT2000
• Metrics– Compare against natively-compiled code– Compare against other binary translators
• Qemu, Apple’s Rosetta
![Page 26: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/26.jpg)
Experimental Setup
• For our experiments– there are around 750 translation rules
in the peephole table– the translation table is computed
offline and it can take up to a week to compute the peephole rules
![Page 27: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/27.jpg)
Experimental Results:Setup
C source
PowerPCexecutable
x86executable
gcc <options> -arch=ppc gcc <options> -arch=x86
Peephole Binary Translation
x86executable
Compare
![Page 28: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/28.jpg)
Microbenchmarks
emptyloop A bounded for-loop doing nothing
fibo Compute first few fibonacci numbers
quicksort Quicksort on 64-bit integers
mergesort Mergesort on 64-bit integers
bubblesort Bubblesort on 64-bit integers
hanoi1 Towers of Hanoi Algorithm 1
hanoi2 Towers of Hanoi Algorithm 2
hanoi3 Towers of Hanoi Algorithm 3
traverse Traverse a linked list
binsearch Binary search on a sorted array
![Page 29: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/29.jpg)
Microbenchmarks99 11
9
81 83
75
85
107
81
69
65
319
93 92
71 70
140
90
68
61
127
128
90
84
65 62
144
80
67
62
129
0
10
20
30
40
50
60
70
80
90
100em
ptyl
oop
fibo
quic
ksor
t
mer
geso
rt
bubs
ort
hano
i1
hano
i2
hano
i3
trav
erse
bins
earc
h
O0 O2 O2 -omit-f rame-pointer
Perc
enta
ge o
f nati
ve (
%)
avg: 90% of native
![Page 30: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/30.jpg)
Experimental Results: Microbenchmarks
• We sometimes outperform native performance on these small benchmarks!– gcc generates better code for
powerpc primarily because it has the luxury of many registers
– Our register-mapping algorithm performs an efficient “re-allocation” of the PowerPC registers to x86 registers.
![Page 31: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/31.jpg)
Experimental Results:SPEC CINT2000
66
53
66
87
59
167
4243
57
95
67
153
74
0
10
20
30
40
50
60
70
80
90
100
bzip
2
gap
gzip
mcf
pars
er
twol
f
vort
ex
O0 O2
Perc
enta
ge o
f nati
ve (
%)
![Page 32: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/32.jpg)
Comparisons with Qemu and Rosetta
• Qemu– Use same PowerPC and x86 executables as used
for our own translator
• Rosetta– Runs on Mac OS X and hence supports on Mac
executables– Recompiled the benchmarks on Mac using the
same compiler version (gcc 4.0.1)– Mac Hardware: Intel Core 2 Duo 1.83GHz
processor, 32KB L1-cache, 2MB L2-cache and 2GB memory
![Page 33: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/33.jpg)
Comparisons with Qemu and Rosetta
18
12 15
48
16
55
11
65
59
85
54
43
66
53
66
87
59
167
42
0102030405060708090
100
bzip
2
gap
gzip
mcf
pars
er
twol
f
vort
ex
-O0 -O2
avg: 3% faster than rosetta avg: 12% faster than rosetta
25
13
22
64
21
58
54 53
82
49
74
43
57
95
67
153
010
20304050
607080
90100
bzip
2
gap
gzip
mcf
pars
er
twol
f
qemu rosetta peep
![Page 34: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/34.jpg)
Translation Time• Takes 2-6 minutes to translate a 650KB
executable (around 100K instructions)– majority of time spent in optimal register map
computation
• It is possible to reduce this to <10 seconds– For 98K instructions (<0.01% of time), use any
register map. Fast (<1second)– For other 2K, use optimal computation
![Page 35: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/35.jpg)
Conclusions and Future Work
• A scheme to perform efficient binary translation using a superoptimizer– Competitive performance– Simplified Design
• Other applications– Just-in-time compilation– Machine virtualization
![Page 36: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/36.jpg)
Q&A Thank you.
![Page 37: Binary Translation Using Peephole Superoptimizers](https://reader035.vdocument.in/reader035/viewer/2022062800/56814183550346895dad70ba/html5/thumbnails/37.jpg)
Backup Slides