rtr: 1 byte/kilo- i nstruction race recording
Post on 31-Jan-2016
41 Views
Preview:
DESCRIPTION
TRANSCRIPT
RTR: 1 Byte/Kilo-InstructionRace Recording
Min Xu Rastislav Bodik Mark D. Hill
2
% gcc sim.c% a.outSegmentation fault%
% gdb a.outgdb> runProgram received SIGSEGV.In get() at hash.c:4545 a = bucket->d;
% gdb a.outgdb> runProgram exited normally.gdb>
% gcc para-sim.c% a.outSegmentation fault%
Why Do You Need a Recorder?
% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;
% gcc para-sim.c% a.outSegmentation faultRace recorded in “log”%
3Ideally …
% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;
% gcc para-sim.c% a.outSegmentation faultRace recorded in “log”%
Long recording:small logLow runtime
overheadLow cost
Applicability:Programs – data race
Systems – non-SC
4Better and Better Recorders
Low Cost
Low Overhead
Small Log Size
Applicability
Data Race Non-SC
InstRply ’87 Y Y
B. & G. ’91 YNetzer’93 YDéjà Vu
’98 YRecPlay
’00JaRec ’04
FDR ’03BugNet ’05
YReEnact
‘03 Y
CORD ‘06 YStrata ’06 YRTR ‘06 Y Y
1 Byte/Kilo-Instruction[ASPLOS’06]
A New Recorder
HardwareAcceleration
[ISCA’03]
LessHardware[ASPLOS’06]
SC & TSO[ASPLOS’06]
Result: One more step toward practical
This talk covers only RTR• Regulated Transitive Reduction algorithm
6Outline
Race Recording
RTR Algorithm
Results with Commercial Workloads
Compress log during recording replay more “regularly”
Conclusion
Technically, what’s race recording?
8Race Recording
X=6
X = 1
X++
print(X)
X = 1
X++
print(X)
-X = X*5
--
---
X = X*5-
Thread IThread J
Original Replay
X=10
Recording
X= 6
-X = X*5
--
Log
Thread IThread J
9
Goal: Reproduce same conflicts with minimum log data
Terminologies and Assumptions
ld A
Thread I Thread J
Recording
st B
st C
sub
ld B
add
st C
ld B
st A
st C
Thread I Thread J
Replay
Log
ld D
st D
ld A
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Conflicts(red)
Dependence(black)
Regulated Transitive Reduction (RTR)
11Log All Conflicts
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Log J: 23 14 35 46
Log I: 23
Log Size: 5*16=80 bytes(10 integers)
Dependence Log
16 bytes
But too many conflicts
12Netzer’s Transitive Reduction (TR)
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
TR reduced Log J: 23
35 46
Log I: 23
Log Size: 64 bytes(8 integers)
TR Reduced Log
How to further reduce log size?
13The Intuition of the RTR Algorithm
After Reduction
From I to J
From J to I
Vectors
Vectors“Regulate” Replay
14
Stricter Dependences to Aid Vectorization
1
2
3
4
1
2
3
4
ld A
Thread I Thread J
Replay
st B
st C
add
st C
ld B
st Ald D
5 5sub st C
6 6ld B st D
Log J: 23 45
Log I: 23
Log Size: 48 bytes(6 integers)
New Reduced Log
stricter
Reduced
Fewer dependencies to log
15Compress Vectorized Dependencies
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Log J: x=3,5, ∆=1
Log I: x=3, ∆=1
Log Size: 40 bytes(5 integers)
Vectorized Log
VectorDeps.
TRRTR: fewer deps + fewer byte/dep
16Deadlock Avoidance of RTR
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Recording
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Limit the strict dependencies (see paper)
i:4j:1 j:2 i:3 i:4
Replay Cycle
Results with Commercial Workloads
18Full-system Simulation Method
Commercial server hardware• GEMS: http://www.cs.wisc.edu/gems• Full-system (OS + application) executions• 4-core CMP (Sequential Consistent)
• 1-way in-order issue, 2 GHz, • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory
Commercial server software• Apache – static web serving• SpecJBB – middleware• OLTP – TPC-C like• Zeus – static web serving
L2C0
C1
C3
C2
19Log Size: 1 byte/KI
0.0
0.5
1.0
1.5
2.0byte/core/KI
ApacheJBB OLTP Zeus AVG0
50
100
150
200KB/core/s
ApacheJBB OLTP Zeus AVG
Less buffer, longer recording, smaller logs
20RTR vs. Netzer’s TR
TR
RTR
0
20
40
60
80
100
ApacheJBB OLTP ZeusAVG
Lo
g
Siz
e 28% smaller log• TR was “optimal”
21Why Does RTR Work Well?
RTR•Instructions execute at similar speed•Dependencies are often “vectorizable”
A New Recorder
HardwareAcceleration
[ISCA’03]
1 Byte/Kilo-Instruction[ASPLOS’06]
LessHardware[ASPLOS’06]
SC & TSO[ASPLOS’06]
Result: One more step toward practical
“Less hardware” & “TSO” not covered• Equally important• More details in the paper
23Conclusion
Race recording Counter nondeterminism
RTR 1 byte/kilo-instruction•Based on Netzer’s transitive reduction•Create stricter dependencies•Vectorize dependencies to compress log•Avoid overly-strict hence no deadlock
Future work•Support snooping, SMT, replayer
top related