effective and inexpensive (memory) race recording min xu thesis defense 05/04/2006 electrical and...
TRANSCRIPT
Effective and Inexpensive(Memory) Race Recording
Min Xu
Thesis Defense
05/04/2006
Electrical and Computer Engineering Department, UW-Madison
Advisors: Mark Hill, Rastislav Bodik
Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood
2OverviewIncreasingly useful to replay multithreaded code• Race recording: key to dealing with nondeterminism
A Case Study• Long recording: 1 byte/kilo-instr• Always-on recording: less than 2% overhead• Low cost: 24 KB RAM/core• Support both SC & TSO (x86-like)
Effective Inexpensive
Race Recorder
Long
Rec
ordi
ng
Mor
e App
licab
le
Low O
verh
ead
Low C
ost
3
Order-ValueHybrid
RTRAlgorithm
Thesis Contributions
Set/LRUApproximation
CoherencePiggyback
Effective Inexpensive
Low CostHardware
SmallLog Size
Low RuntimeOverhead
SC & TSOApplicability
4Outline
Motivation & Problem
An Effective and Inexpensive Race Recorder
Evaluation Method & Results
RTRAlgorithm
Set/LRUApproximation
CoherencePiggyback
Order-ValueHybrid
Conclusion & My Other Research
5slides
21
6
3
Motivation & Problem
6Multithreaded Debugging
% gcc hash.c% a.outSegmentation fault%
% gdb a.outgdb> runProgram received SIGSEGV.In get() at hash.c:4545 a = bucket->d;
% gdb a.outgdb> runProgram exited normally.gdb>
% gcc para-hash.c% a.outSegmentation fault%
% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;
% gcc para-hash.c% a.outSegmentation faultRace recorded in “log”%
7Race Recording
X=6
X = 1
X++
print(X)
X = 1
X++
print(X)
-X = X*5
--
---
X = X*5-
Thread IThread J
Original Replay
X=10
Recording
X= 6
-X = X*5
--
Log
Thread IThread J
8Recording for Multithreaded Replay
Race Recording• Not-an-issue for a single thread• Create the same general & data races
Checkpointing• Provide a snapshot of the program state• Many proposals (e.g., SafetyNet), not focus
Input Recording• Provide repeatable inputs• Some proposals (e.g., part of FDR), not focus
Focus
9A Good Race Recorder
% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;
% gcc para-hash.c% a.outSegmentation faultRace recorded in “log”%
Long recording:small log
Low runtimeoverhead
Low cost
Applicability
10Desired & Existing Race Recorders
RecordingLength
Applicability
Overhead Cost
DesiredRecorder
Small Log Size
MPRacey
Code
SC
TSONegligible Slowdown
Little Hardware
InstRply ’87
R&C ’90
Bacon’91
Netzer’93
Déjà Vu ’98
RecPlay ’00JaRec ’04Our
Recorder
Order-ValueHybrid
Set/LRUApproximation
RTRAlgorithm
CoherencePiggyback
SmallLog Size
12
Reproduce exact same conflicts: no more, no less
Problem Formulation
ld A
Thread I Thread J
Recording
st B
st C
sub
ld B
add
st C
ld B
st A
st C
Thread I Thread J
Replay
Log
ld D
st D
ld A
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Conflicts(red)
Dependence(black)
13
Detect conflicts Write log
Log All Conflicts
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Log J: 23 14 35 46
Log I: 23
Log Size: 5*16=80 bytes(10 integers)
Dependence Log
16 bytes
Assign IC(logical Timestamps)But too many conflicts
14Netzer’s Transitive Reduction
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
TR reduced Log J: 23
35 46
Log I: 23
Log Size: 64 bytes(8 integers)
TR Reduced Log
15The Intuition of the New RTR Algorithm
After Reduction
From I to J
From J to I
Vectors
VectorsRegulate Replay (RTR)
16
Stricter Dependences to Aid Vectorization
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Log J: 23 45
Log I: 23
Log Size: 48 bytes(6 integers)
New Reduced Log
stricter
Reduced
17Compress Vectorized Dependencies
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Replay
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Log J: x=3,5, ∆=1
Log I: x=3, ∆=1
Log Size: 40 bytes(5 integers)
Vectorized Log
VectorDeps.
Reduce log size to KB/core/second
Order-ValueHybrid
Set/LRUApproximation
RTRAlgorithm
CoherencePiggyback
Low RuntimeOverhead
19Detect Conflicts
1
2
3
1
2
3
4
ld A
Thread I Thread J
Recording
st B
st C
add
st C
ld B
st A
A.readers.add(I, 1)
if (C.writer != I) log(WAW)foreach C.readers if (reader != I) log(WAR)C.readers.clear( )C.writer = (I, 3)
B.writer = (I, 2) C.writer =(J, 2)
if (B.writer != J) log(RAW)B.readers.add(J,3)
…
Expensive in software
A.readers
A.writer
20Use Cache and Cache Coherence
ProcI
Tag State Data TimestampA S … 1B M … 4
ProcJ
Tag State Data TimestampA S … 3B I … 2
A.readersA.writer
B.readersB.writer
ld B
Get/S Request
Data Response
Timestamp
Detect conflict in hardware with little runtime cost
RAWDetected& Logged
21Cache Evictions and Writebacks
ProcI
Tag State Data TimestampA S … 1B M … 4
ProcJ
Tag State Data TimestampA S … 3B I … 2
st A
OK with nonsilent eviction & directory eviction
C M … 3
Directory of A: Shared(I,J) Owner()
Get/SInv
AckTimestamp? WAR
Detected& Logged
M … 4
22Implement TR and RTR in Hardware
Ideal TR requires vector timestamps• Too expensive• New idea: Pairwise-TR (use scalar timestamp)• Enable pairwise transitive reduction
Optimal RTR algorithm is likely expensive• Implement a greedy RTR algorithm• One-pass, online algorithm• Keep a sliding window of vectorizable
dependencies
23Hardware Implementation
CacheEviction/writeback Solved, more details
later
Directory protocols Solved
Snooping protocols Partly solved
Two-level coherence Not yet solved
ProcessorOut-of-order/Prefetching Solved
Unordered message Solved
Counter overflow Solved
Thread Migration Not yet solved
Order-ValueHybrid
Set/LRUApproximation
RTRAlgorithm
CoherencePiggyback
Low CostHardware
25Timestamp Approximation
Tag State Data TimestampA S … 1B M … 2
One Set of I’s $
Correct, but more evictions more logged conflicts
1
2
3
1
2
3
J
ld A
Thread I Thread J
Recording
st B
st C
add
st C
ld B
st AI ld D
Use current IC of thread
I
C M … 3
Directory of A: Shared(I)
HardwareCost
Log Size
27
Tag State Data TimestampA S … 1B M … 2
One Set of I’s $ 1
2
3
1
2
3
J
ld A
Thread I Thread J
st B
st C
add
st C
ld B
st AI ld D
C M … 3
Recording
Set/LRU Approximation
Use current IC of thread
I
LRU guarantee B’s TS > A’s TS
Set/LRU better preserve reducibilitySmall $ more misses but still small log
28Hardware Cost of Timestamps
Coupled timestamp memory: overhead cache size• Not flexible• 64B line + 64b (24b) timestamp 12.5% (4.7%)
overhead• 192 KB for a 4MB L2
Need to modify cache
Tag State Data TimestampA S … 1B M … 2
Coupled Timestamp Memory
29Decoupled Timestamp Memory
Decoupling Small timestamp memory (Set/LRU)• e.g., 32-set, 64-way 99% transitive reduction• Timestamps Memory 24 KB
No need to modify cache
Tag State Data TimestampA S … 1B M … 2
Tag State DataA S …B M …
Tag TimestampA 1B 2
Cache
Timestamp Memory
Coupled Timestamp Memory
From 192 KB to 24 KB: 8x reduction
30
Order-ValueHybrid
Set/LRUApproximation
RTRAlgorithm
CoherencePiggyback
SC & TSOApplicability
31
ld A
ld B
st A,1
st B,1
ld A
ld B
st A,1
st B,1
ld A
ld B
st A,1
st B,1
A=1B=0
A=0B=1
A=1B=1
Recording with Total Store Order (TSO)
Majority of existing MP are non-SC
TSO is well defined, x86-like
1
2
1
2
st A,1
Thread I Thread J
ld B
st B,1
ld A
A=B=0
ld A
ld B
st A,1
st B,1
A=0B=0
SC
TSO
32TSO Execution
1
2
1
2
st A,1
Thread I Thread J
ld B
st B,1
ld A
A=B=0 ld A
ld B
st A,1
st B,1
A=0B=0
st A,1
st B,1
I
WrBuf
Memory System
J
WrBuf
A=0 B=0A=0 B=0
A=1 B=1
33Order-Value-Hybrid Recording
1
2
1
2
st A,1
Thread I Thread J
ld B
st B,1
ld A
Recording
A=B=0
1
2
1
2
st A,1
Thread I Thread J
ld B
st B,1
ld A
Replay Value UsedA=0
ld A
ld B
st A,1
st B,1
A=0B=0
st A,1
st B,1I
WrBuf
Memory System
J
WrBuf
A=0 B=0
WAROmitted Value
Logged
A=0 B=0
A=1 B=1
StartMonitor A
StartMonitor B
A Changed!
StopMonitor B
34Hybrid Recording with TR and RTR
Hybrid recording• All loads get correct values• Hardware similar to OoO SC [Gharachorloo et al.
’91]
Hybrid + TR & RTR• TR will not use the omitted WAR in reduction• RTR vectorize dependencies more conservatively
Evaluation Method & Results
36Put-it-together: Determinizer/CMP
Shared L2 Cache(L1 Dir)
TSM TSM
TSM TSM
Core1
Core2
Core4
Core3
L1_I$ L1_D$
TSM
IC
L1CoherenceController
Log TRReg
RTRReg
37Simulation Method
Commercial server hardware• GEMS: http://www.cs.wisc.edu/gems• Full-system (OS + application) executions• 4-core CMP (Sequential Consistent)
• 1-way in-order issue, 2 GHz, • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory
Commercial server software• Apache – static web serving• SpecJBB – middleware• OLTP – TPC-C like• Zeus – static web serving
38Log Size: 1 byte/kilo-instr
Well within in the capability of current machines• Long recording (days – months) need improvement
0.0
0.5
1.0
1.5
2.0byte/core/kilo-instr
ApacheJBB OLTP Zeus AVG0
50
100
150
200KB/core/s
ApacheJBB OLTP Zeus AVG
39Runtime Overhead
Baseline With race recorder
0
20
40
60
80
100
Execution Time
Apache JBB OLTP Zeus
Interconnection Msg. B/W
Our recorder can be “always-on”
0
80
100
Apache JBB OLTP Zeus
60
40
20
40Benefits of RTR and Set/LRU (Log Size)
Pairwise-TR
Our RTR
Improvement by RTR
0
20
40
60
80
100
ApacheJBB OLTP ZeusAVG
Perfect TSM
24KB Set/LRU TSM
Effectiveness of Set/LRU
0
20
40
60
80
100
Apache JBB OLTP Zeus AVGL
og
S
ize
Lo
g
Siz
e
41Why RTR and Set/LRU Work Well?
RTR• Processors execute instructions at similar speed• Therefore, we can find “vectorizable”
dependencies
Set/LRU• Temporal locality makes the LRU timestamps old• We only need to know if a timestamp is “old-
enough”
42Sensitivity and Scalability
A design space of the timestamp memory (TSM)• Size: smaller TSM -> larger log• Read/write timestamp: should be used when TSM is
large• Partial timestamp: 24-bit enough• Associativity: higher better for RTR
Scalability of the recorder• Studied with modest processors (2p – 16p)• Commercial workloads, not scientific workloads• Log size increase slowly with number of cores
Conclusion & My Other Research
44Race Recording
Race recording Key to combat nondeterminism
My thesis An effective & inexpensive Recorder• RTR algorithm small log size• Coherence piggyback Negligible slowdown• Timestamp approximation Low hardware cost• Order-value hybrid support SC & TSO
Future work• Improve race recording algorithm • Improve race recorder implementation• Study race replay
45
Serializability Violation Detector [PLDI’05]Like a race detectorNo a priori annotation requirement
• “critical sections” are inferredIntend to detect bugs “actually” happen
• Check for a 2-Phase-Locking condition
Read in1
Read in2Write out1
Write out2
Write local
Read local
SharedVariables
A “Critical Section”
46Publications
FDR (ISCA’03)• Adopted by UCSD BugNet (ISCA’05)
SVD (PLDI’05)• Cited by Vaziri et al. (POPL’06)• Influenced new data race definition
RTR, Set/LRU & Hybrid• Submitted for publication
Thank you!
% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;
% gcc para-hash.c% a.outSegmentation faultRace recorded in “log”%
48Acknowledgements
Joint work with my advisors• Mark Hill, Ras Bodik
Ph.D. Committee• David Wood, Mikko Lipasti, Remzi Arpaci-Dusseau,
Barton Miller
Multifacet Group• Milo Martin, Dan Sorin, Carl Mauer, Brad Beckmann,
Kevin Moore, Alaa Alameldeen, Mike Marty, Luke Yen
Affiliates & Companies• Joe Emer, CJ Newburn, Peter Hsu, Bob Zak, Eric Bach,
Gang Luo, Alex Chow, IBM, Intel, Microsoft, Sun
49Deterministic Replay is Useful
Deterministic Replay is logically recreating a program execution
Present applications• Cyclic Debugging ([Pancake & Netzer ‘93])• Fault Tolerance (ExtraVirt [Lucchetti et al. ’05])• Intrusion Analysis (ReVirt [Dunlap et al. ’02])
Future applications• Data Recovery • Replay-based Synchronization
50Multicore and Multithreading
Multicore is common• AMD X2• IBM Power 5/6, Cell• Intel Pentium D, Core Duo• Sun SPARC T1
Multithreading is common• Server: high throughput• Scientific: high performance• Desktop/embedded: low response time
51Race Recording: Key to Determinism
Races: general race & data race [Netzer & Miller]• Both cause nondeterminism• Race recording can help, but
Existing race recorders are inadequate• Some generate large logs• Some have high runtime overhead• Some have high hardware cost (space overhead)• Support only sequential consistency
Need a better race recorder
52Recording/Replay & Debugging
Online Recorder
Crash
Dump “Core”
P1
P2
P3
P4
Checkpoint B Checkpoint C
Store log A Store log B Store log C
Checkpoint A
Crash
Read Checkpoint B
Replaying fromlog B, C
Deterministic Replayer
53Deterministic Replay & Fault Tolerance
Fault Recovery• Replay after a failure
Fault Detection• Replay then compare
(Courtesy of VMware)
54Future: Record/Replay & Undo/Redo
VM as a software platform• Ease software development• Fine granularity in Undo and Redo
Windows XP
55Future: Replay-based Synchronization
Three steps• Coarse-grain sync. fine-grain sync. hardware sync.
Results: higher performance
Works only if static control flow & fixed data addr• DSP kernels
ld Ast B
Unlock()
lock()st Ald B
Recording
ld Ast B st A
ld BReplay
Log
56Race Recording Related Work
Total-order recorders Partial-order recordersBacon ’91(Hardwar
e)
RecPlay ’00
JaRec ’04
R&C’90
Déjà Vu ’98
Bacon ’91(Hardware
)
Instant Replay ’87
Netzer ’93
Bus transactio
ns
Lamport Clocks
SchedulingBus
transaction groups
Variable versionVector clocks
Large log Small log Small log Large log Large log Small log
Low overhead
Low overhead
(sync only)
Low overhead(non-MP)
Low overhead
High overheadHigh
overhead
Low replay parallelism High replay parallelism
57Correctness of Order-Value-Hybrid
Removing WAR dependencies• Say thread I read, thread J write• Removing the WAR affects I’s read, not J’s write• But, for every dependence removed, thread I
reads correct value from the value log• Therefore, all reads get the correct value
58TR and TSO
TR affects dependencies reduced by a WAR• The WAR itself may later be removed during replay• Solution: Not use WAR in TR if the WAR can be
removed• Respond with a special flag when a loaded cache line
is stolen
1
2
1
2
st A
Thread I Thread J
st C
st B
st C
Recording
3 3ld B ld A
Must notbe reduced
59RTR and TSO
The sliding window may expose the ordered loads• Shrink the sliding window to avoid it
1
2
1
2
st A
Thread I Thread J
add
add
sub
Recording
3 3st B ld A
4 4ld C ld Bordered
in write bufffer
orderednew winfor j:3old win
for j:3
Not allowedby new window
60Deadlock Avoidance of RTR
1
2
3
4
5
6
1
2
3
4
5
6
ld A
Thread I Thread J
Recording
st B
st C
sub
ld B
add
st C
ld B
st A
st C
ld D
st D
Avoid deadlock by adhere to a SC total order
i:4j:1 j:2 i:3 i:4
Replay Cycle
61Recording Race-free Executions
No data races
Only need to record synchronization race
Deterministic replay up until the first data race
62Replay Parallelism
Replay performance depends on
(1)Number of synchronizations(2)Extra wait incurred by the
synchronizations
63Directory Protocols
Add sticky states in the directory• Retain states after writebacks• Need extra acknowledgements
Or, add extra timestamp memory in the directory• Helps to avoid extra acknowledgements
A tradeoff• Sticky states can be cheaper• But extra timestamp memory can be faster
64Snooping Protocols
Key problem is combined/implicit response• Not a problem for AMD Hammer
ProcI
Tag State Data TimestampA S … 1B M … 4
ProcJ
Tag State Data TimestampA S … 3B I … 2
st A
Get/XPull Shared
WARDetected& Logged
+ Current IC
65Nonsilent Evictions
ProcI
Tag State Data TimestampA S … 1B M … 4
ProcJ
Tag State Data TimestampA S … 3B I … 2
st A
Directory eviction: more false conflict, like snooping
C M … 3
Directory of A: Shared(J) Owner() StickyS(I,J)
Get/S
M … 4
AckTimestamp
TimestampMemory
Eviction
66Out-of-Order & Hardware Prefetching
Speculative execution• No IC assigned yet
Hardware prefetching• No IC assigned
Key idea: receive observation• Can associate a ld/st with current commit
instruction
67Unordered Messages in Interconnect
Message arrive out-of-order
Can affect reduction
But better add a sequence number• Reconstruct the message order• Enable IC compression by sending deltas
68Integer Overflow
IC and timestamps may overflow
IC: make it 64bit, will not overflow for a long time
Timestamps: use approximation techniques• MSB of IC + LSB of Timestamps
69Varying TSM Size
2 4 8 16 32 64 128 256 512 1024 2048
Size of the Timestamp Memory (KB)
0
1
2
3
Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
Apache-1TS-RTRApache-1TS-TRApache-2TS-RTRApache-2TS-TR
(64 ways, Full Timestamps, Set/LRU)
2 4 8 16 32 64 128 256 512 1024 2048
Size of the Timestamp Memory (KB)
0
1
2
3
Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
OLTP-1TS-RTROLTP-1TS-TROLTP-2TS-RTROLTP-2TS-TR
(64 ways, Full Timestamps, Set/LRU)
2 4 8 16 32 64 128 256 512 1024 2048
Size of the Timestamp Memory (KB)
0
1
2
3
Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
SPECjbb-1TS-RTRSPECjbb-1TS-TRSPECjbb-2TS-RTRSPECjbb-2TS-TR
(64 ways, Full Timestamps, Set/LRU)
2 4 8 16 32 64 128 256 512 1024 2048
Size of the Timestamp Memory (KB)
0
1
2
3
Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
Zeus-1TS-RTRZeus-1TS-TRZeus-2TS-RTRZeus-2TS-TR
(64 ways, Full Timestamps, Set/LRU)
70Varying Associativity
2 4 8 16 32 64 128 256 512 1024
Associativity of the Timestamp Memory
10
1
0.1
0.01Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
Zeus-CurrentIC-RTRZeus-CurrentIC-TRZeus-SetLRU-TRZeus-SetLRU-RTR
(64KB, Full R/W Timestamps)
2 4 8 16 32 64 128 256 512 1024
Associativity of the Timestamp Memory
10
1
0.1
0.01Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
SPECjbb-CurrentIC-RTRSPECjbb-CurrentIC-TRSPECjbb-SetLRU-TRSPECjbb-SetLRU-RTR
(64KB, Full R/W Timestamps)
2 4 8 16 32 64 128 256 512 1024
Associativity of the Timestamp Memory
10
1
0.1
0.01Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
OLTP-CurrentIC-RTROLTP-CurrentIC-TROLTP-SetLRU-TROLTP-SetLRU-RTR
(64KB, Full R/W Timestamps)
2 4 8 16 32 64 128 256 512 1024
Associativity of the Timestamp Memory
10
1
0.1
0.01Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
Apache-CurrentIC-RTRApache-CurrentIC-TRApache-SetLRU-TRApache-SetLRU-RTR
(64KB, Full R/W Timestamps)
71Varying Partial Timestamp Width
10 15 20 25 30
Partial Timestamp Width
10
1
0.1
0.01Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
Zeus-TRZeus-RTR
(64sets, 64ways, Set/LRU)
10 15 20 25 30
Partial Timestamp Width
10
1
0.1
0.01Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
SPECjbb-TRSPECjbb-RTR
(64sets, 64ways, Set/LRU)
10 15 20 25 30
Partial Timestamp Width
10
1
0.1
0.01Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
OLTP-TROLTP-RTR
(64sets, 64ways, Set/LRU)
10 15 20 25 30
Partial Timestamp Width
10
1
0.1
0.01Log
Ban
dw
idth
(M
B/c
ore
/secon
d)
Apache-TRApache-RTR
(64sets, 64ways, Set/LRU)
72Log Size Scaling
2 4 8 16
Number of Cores
0.0
0.2
0.4
0.6
0.8
1.0
Log
Siz
e (
MB
/core
/s)
ApacheSPECjbbOLTPZeus
73In Retrospect …
What are you most proud of?• RTR improves TR after 13 years
What would you do differently if doing it again?• “replaying me is deterministic” (just kidding)• I wish I focused on race recording earlier
What the industry should do?• Implement the recorder as a VMM extension