![Page 1: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/1.jpg)
SuperPin: Parallelizing Dynamic Instrumentation
for Real-Time Performance
Steven Wallaceand Kim Hazelwood
![Page 2: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/2.jpg)
2 Hazelwood – CGO 2007
Dynamic Binary Instrumentation
Inserts user-defined instructions into executing binaries
•Easily•Efficiently•Transparently
Why?•Detect inefficiencies•Detect bugs•Security checks•Add features
Examples•Valgrind, DynamoRIO, Strata, HDTrans, Pin
EXE
Transform
CodeCache
Execute
Profile
![Page 3: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/3.jpg)
3 Hazelwood – CGO 2007
Intel Pin
• A dynamic binary instrumentation system
• Easy-to-use instrumentation interface
• Supports multiple platforms– Four ISAs – IA32, Intel64, IPF, ARM– Four OSes – Linux, Windows, FreeBSD, MacOS
• Robust and stable (Pin can run itself!)– 12+ active developers– Nightly testing of 25000 binaries on 15 platforms– Large user base in academia and industry – Active mailing list (Pinheads)
• 11,500 downloads
![Page 4: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/4.jpg)
4 Hazelwood – CGO 2007
Our Goal: Improve Performance
The latest Pin overhead numbers …
100%
120%
140%
160%
180%
200%
perlbench
sjeng
xalancbm
k
gobm
k
gcc
h264ref
omnetpp
bzip2
libquantum mcf
astar
hmmer
Rel
ativ
e to
Nat
ive
![Page 5: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/5.jpg)
5 Hazelwood – CGO 2007
Adding Instrumentation
100%
200%
300%
400%
500%
600%
700%
800%
perlbench
sjeng
xalancbm
k
gobm
k
gcc
h264ref
omnetpp
bzip2
libquantum mcf
astar
hmmer
Rel
ativ
e to
Nat
ive Pin
Pin+icount
![Page 6: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/6.jpg)
6 Hazelwood – CGO 2007
Sources of Overhead
Internal
• Compiling code & exit stubs (region detection, region formation, code generation)
• Managing code (eviction, linking)
• Managing directories and performing lookups
• Maintaining consistency (SMC, DLLs)
External
• User-inserted instrumentation
![Page 7: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/7.jpg)
7 Hazelwood – CGO 2007
“Normal Pin” Execution Flow
Instrumentation is interleaved with application
Uninstrumented Application
Instrumented Application
Pin Overhead
Instrumentation Overhead
“Pinned” Application
time
![Page 8: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/8.jpg)
8 Hazelwood – CGO 2007
“SuperPin” Execution Flow
SuperPin creates instrumented slices
Uninstrumented Application
SuperPinned Application
Instrumented Slices
![Page 9: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/9.jpg)
9 Hazelwood – CGO 2007
Issues and Design Decisions
Creating slices• How/when to start a slice• How/when to end a slice
System calls
Merging results
![Page 10: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/10.jpg)
10 Hazelwood – CGO 2007
for k
S6+fo
r k
S5+fo
r k
S4+fo
r kre
cord
si
gr3
, sl
eep
S3+
for k
S2+
sleep
S1+
Execution Timelinefo
r k
S1 S2 S3 S4 S5 S6
dete
ct
sigr4
dete
ct
exit
resu
me
dete
ct
sigr3
dete
ct
sigr6
dete
ct
sigr2
dete
ct
sigr5
resu
me
reco
rd
sigr4
, sl
eep
CPU2
CPU3
CPU4
time
reco
rd
sigr2
, sl
eep
resu
me
resu
me
reco
rd
sigr5
, sl
eep
resu
me
reco
rd
sigr6
, sl
eep
resu
me
original application
instrumentedapplication slices
CPU1
![Page 11: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/11.jpg)
11 Hazelwood – CGO 2007
Starting Slices
How?
• Master application process – ptrace
• Controlling process
• Child slices – fork– Reserve memory for transparency– Each slice has its own code cache (for now)
When?
• Timeouts– Uses a special timer process– Tunable parameter
• System calls
![Page 12: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/12.jpg)
12 Hazelwood – CGO 2007
S1
Handling System Calls
Problem: Don’t want to duplicate system calls in the main application/slices
Solutions:
• brk or anonymous mmap – duplication OK
• frequent calls – record and playback
• default – trigger new timeslice
S2
S1Instr AInstr BSysCallInstr CInstr D
Instr AInstr BMMAPInstr CInstr D
S1
Instr AInstr BSysCallPlayInstr CInstr D
Duplicate Record/Playback End Slice
![Page 13: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/13.jpg)
13 Hazelwood – CGO 2007
Ending Slices
Each slice is responsible for detecting its own end-of-slice condition
S4+
record sig4, sleep
resume
detect sig5
Challenges: • Need to efficiently capture a point in time (signature)• Need to efficiently detect when we’ve reached that point
![Page 14: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/14.jpg)
14 Hazelwood – CGO 2007
Signature Detection
End-of-slice conditions:
1. System calls – easy to detect
2. Timeouts at arbitrary points – harder to detect
Signature match:
• Instruction pointer
• Architectural state
• Top of stack
![Page 15: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/15.jpg)
15 Hazelwood – CGO 2007
Implementing Signature Detection
Uses Pin’s lightweight conditional analysis• INS_InsertIfCall – lightweight inlined check• INS_InsertThenCall – heavyweight (conditional)
analysis routine
Instrument the end-of-slice instruction pointer
1. Lightweight check – two registers
2. Heavyweight check – full architectural state
3. Heavyweight check – top 100 words on the stack
• Lightweight triggers heavyweight: ~2%
• Stack check fails: ~0%
![Page 16: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/16.jpg)
16 Hazelwood – CGO 2007
Performance Results
Icount1 – Instruments every instruction with count++
% pin –t icount1 -- ./helloHello CGOCount: 496043
Icount2 – Instruments every basic block with count += bblength
% pin –t icount2 -- ./helloHello CGOCount: 496043
![Page 17: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/17.jpg)
17 Hazelwood – CGO 2007
Performance – icount1
% pin –t icount1 -- <benchmark>
0%
500%
1000%
1500%
2000%
2500%
3000%
ammp
applu
apsi art
bzip2
crafty
eon
equake
facerec
fma3d
galgel
gap
gcc
gzip
lucas
mcf
mesa
mgrid
parser
perlbm
sixtrack
swim
twolf
vortex vpr
wupwis
AVG
Pin SuperPin
![Page 18: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/18.jpg)
18 Hazelwood – CGO 2007
Performance – icount2
% pin –t icount2 -- <benchmark>
0%
200%
400%
600%
800%
1000%am
mp
applu
apsi art
bzip2
crafty
eon
equake
facerec
fma3d
galgel
gap
gcc
gzip
lucas
mcf
mesa
mgrid
parser
perlbm
sixtrack
swim
twolf
vortex vpr
wupwis
AVG
Pin SuperPin
![Page 19: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/19.jpg)
19 Hazelwood – CGO 2007
Performance Scalability
0
100
200
300
400
500
600
1 2 4 8 12 16Max # Running Slices
Ru
nti
me
(sec
)
Running on an 8-processor HT-enabled machine (16 virtual processors)
![Page 20: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/20.jpg)
20 Hazelwood – CGO 2007
Overhead Categorization
Where is SuperPin spending its time?• Executing the application• Fork overheads• Sleeping (waiting to start a slice)• Pipeline delays
for k for k for k for k
S1 S2 S3 S4
reco
rd
sigr2
, sl
eep
dete
ct
sigr3
resu
me
reco
rd
sigr4
, sl
eep
dete
ct
exit
resu
me
sleep
dete
ct
sigr2
resu
me
dete
ct
sigr4
reco
rd
sigr3
, sl
eep
resu
me
CPU3
S2+ S4+
S1+ S3+
![Page 21: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/21.jpg)
21 Hazelwood – CGO 2007
Overhead Categorization
0
40
80
120
160
200
Ru
nti
me
(sec
)
.5s 1s 2s 4s
Timeslice (sec)
native fork&others sleep pipeline
![Page 22: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/22.jpg)
22 Hazelwood – CGO 2007
The SuperPin API
You may write SuperPin-aware Pintools:– SP_Init(fun)– SP_AddSliceBeginFunction(fun,val)– SP_AddSliceEndFunction(fun,val)– SP_EndSlice()– SP_CreateSharedArea(local,size,merge)
You may also control (via switches):– Spmsec {value}: milliseconds per timeslice– Spmp {value}: maximum slice count– Spsysrecs {value}: maximum syscalls per slice
![Page 23: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/23.jpg)
23 Hazelwood – CGO 2007
Pin Instrumentation API – icount2
VOID DoCount(INT32 c) { icount += c;}
VOID Trace(TRACE trace, VOID *v) {for (BBL bbl=Head(trace); Valid(bbl); bbl=Next(bbl)) { INS_InsertCall(BBL_InsHead(bbl), IPOINT_BEFORE, (AFUNPTR)DoCount, IARG_INT32, BBL_NumIns(bbl), IARG_END); }
}
int main(INT32 argc, CHAR **argv) {
PIN_Init(argc, argv);
TRACE_AddInstrumentFunction(Trace);
PIN_StartProgram();
return 0;
}
![Page 24: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/24.jpg)
24 Hazelwood – CGO 2007
SuperPin Version of Icount2
VOID DoCount(INT32 c) {//same as before}
VOID ToolReset(INT32 c) { icount = 0;} VOID Merge(INT32 sliceNum) { *sharedData += icount;}
VOID Trace(TRACE trace, VOID *v) {//same as before}
int main(INT32 argc, CHAR **argv) { PIN_Init(argc, argv); SP_Init(ToolReset); sharedData = (UINT64*) SP_CreateSharedArea(&icount, sizeof(icount),0); SP_AddSliceEndFunction(Merge); TRACE_AddInstrumentFunction(Trace); PIN_StartProgram(); return 0; }
![Page 25: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/25.jpg)
25 Hazelwood – CGO 2007
SuperPin Limitations
Not all instrumentation tasks are a good fit
Great fit – independent tasks• Instruction profiling (counts, distributions)• Trace generation
Requires massaging – dependent tasks• Branch prediction• Data cache simulation
– Assume a starting state, resolve later
Stick with regular Pin – path modification• Adaptive execution
![Page 26: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/26.jpg)
26 Hazelwood – CGO 2007
Future (Rainy Day) Extensions
• Adaptive parallelism detection– Hardware feedback: adapts to available processors
– OS feedback: adapts to present load
• Adaptive slice timeouts
• Slice-shared code caches
• Multithreaded application support
![Page 27: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/27.jpg)
27 Hazelwood – CGO 2007
SuperPin Summary
Allows users to leverage available parallelism for certain instrumentation tasks
• Hides the gory details
• Enables significant speedup (for the right tasks … on the right machines)
• Exposed as Pin API extensions
Download it today!
http://rogue.colorado.edu/pin
![Page 28: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/28.jpg)
28 Hazelwood – CGO 2007
Thank You
![Page 29: SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance Steven Wallace and Kim Hazelwood](https://reader036.vdocument.in/reader036/viewer/2022062305/56649cca5503460f9499227e/html5/thumbnails/29.jpg)
29 Hazelwood – CGO 2007
Merging Results
Each slice is a separate process must merge slice-local results into a collective total
Merge routines:
• Addition
• Concatenation
• User-defined
Executed in program order
Use shared memory