cyrus: unintrusive application-level record-replay for replay parallelism
DESCRIPTION
Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism. Nima Honarmand, Nathan Dautenhahn, Josep Torrellas and Samuel T. King ( UIUC ) Gilles Pokam and Cristiano Pereira ( Intel ). iacoma . cs .uiuc.edu. Record-and-Replay (RnR). - PowerPoint PPT PresentationTRANSCRIPT
Cyrus: Unintrusive Application-Level Cyrus: Unintrusive Application-Level Record-Replay for Replay ParallelismRecord-Replay for Replay Parallelism
Nima Honarmand, Nathan Dautenhahn,Josep Torrellas and Samuel T. King (UIUC)Gilles Pokam and Cristiano Pereira (Intel)
1
iacoma.cs.uiuc.edu
Record-and-Replay (RnR)• Record execution of a parallel program or a whole
machine– Save non-deterministic events in a log
• During replay, use the recoded log to enforce the same execution– Each thread follows the same sequence of instructions
• Use cases– Debugging– Security– High availability
2
Contribution: Cyrus RnR System
• Application-level RnR– RnR one or more programs in isolation– What users typically need
• Fast replay– Replay-time parallelism– Flexibly trade off parallelism for log size
• Unintrusive HW– No changes to snoopy cache coherence protocol
3
Capturing Non-determinism
• Sources of non-determinism– Program inputs– Memory access interleavings
• How to capture?– OS kernel extension to capture program inputs– HW support to capture memory interleavings
(HW-assisted RnR)
• This talk: recording memory interleavings
4
Recording Interleaving as Chunks
5
add ….
store A….
mul….
sub
div….
load A….
add….
Tim
e
P0 P1
Resp
Req
• Inter-processor data dependences manifest as coherence messages• Capture interleavings as ordered chunks of instructions
add ...
store A...
muldiv
….load A
….add
P0 P1
….sub
Restriction: Unintrusive HW• Unmodified snoopy protocols– In some coherence transactions, there is no reply
6
Only source is always aware → Use source-only recording
P0 P1
data
P1 rd
RAW
P0 P1
data
P1 wr
WAW
invlP0 P1
P1 wr
WAR
invl
• Requirements for HW-assisted RnR:– Do not augment or add coherence messages– Do not rely on explicit replies
Challenge 1: Enable Replay Parallelism
• Key to fast replay– Overlapped replay of chunks from diff. threads
• Previous work:– DAG-based ordering (Karma [ICS’2011])• Requires explicit replies• Augments coherence messages
7
Challenge 1: Enable Replay Parallelism
8
P0→P1
P0 P1
PredecessorSuccessor
P0 P1
P0→P1
P0 P1
?
P0 P1 P2
MonitoredApplication Non-
monitoredCommunication
Turn hardware on only when a recorded application runs.
Four cases:(1) src=monitoring, dst=monitoring(2) src=monitoring, dst=not monitoring(3) src=not monitoring, dst=monitoring(4) src=not monitoring, dst=not monitoring
Issues of source-only recording:•Cannot distinguish between (1) and (2)•(2) may result in a dependence later•Not recording in (3) and (4)
Challenge 2: Application-Level RnR
9
(1)
(2)(3)
(4)
P0 P1 P2
MonitoredApplication Non-
monitoredDependence
Challenge 2: Application-Level RnR
10
• Treat (2) as an Early Dependence– Defer and assign it to the next chunk
of the target processor
• (3) and (4) superseded by context switches – At context switch, record a
Serialization Dependence to all other processors
(1)
(2)(3)
(4)(2)
serser
Key: On-the-Fly Backend Software Pass
• Transforms source-only log to DAG (for parallelism)• Fixes the Early and Serialization dependences – To support app-level RnR
• Can trade replay parallelism for log size
11
P P P…
RecordingProcessors
Source-only LogSource-only Log
P P P…
ReplayingProcessors
On-the-flyBackend
DAG ofChunksDAG ofChunks
Memory Race Recording Unit (RRU)• HW module that observes coherence
transactions and cache evictions• Tracks loads/stores of the chunk in a
signature• Keeps signatures for multiple recent
chunks• Records for each chunk
– # of instructions– Timestamp (# of coh. transactions)– Dependences for which the chunk is
source• Dumps recorded chunks into a log in
memory
12
150
250
200
13
P0 P1 P2
Tim
eSta
mp
100
P0
P1
P2
Rd B
…Wr D
………
300
…
……
Wr A…
………
Rd D……
Rd AWr B
……
…………
C00
C01
C10
C20
RRUs Record Source-Only Log
…C00 100 - 150 100
C01 200 - 200 200
C10 250 - - 250
C20 300 - - -
ChunkTS
Successor Vector
P0 P1 P2
Backend Pass Creates DAG
14
C00 100 - 150 100
C01 200 - 200 200
C10 250 - - 250
C20 300 - - -
C00
C01
C10
C20
Chunks of P0
Chunks of P1
Chunks of P2
• Finds the target chunk for each recorded dependency– Creates bidirectional links between src and dst chunks
• This algorithm is called MaxPar
Trading Replay Parallelism for Log Size
15
CPU TID SIZE PTV STV
0 - 0 0 - 1 10 - 0 0 - 1 11 2 - 0 0 - 12 2 1 - 0 0 -
TID SIZE
C00
C01
C10
C20
C00+
C01
C10
C20
CPU TID SIZE PTV STV
0 - 0 0 - 1 11 2 - 0 0 - 12 2 1 - 0 0 -
Stitched
Serial
StSerial
TID SIZE
• Less Parallelism• Smaller log
• No Parallelism• Even Smaller log
• No Parallelism• Smallest log
Evaluation
• Using Simics– Full-system simulation with OS– Wrote a Linux kernel module to
• Records application inputs• Controls RRUs
• Model 8 + 1 processors– 8 processors for the app– 1 processor for the backend
• 10 SPLASH-2 benchmarks
16
Replay Time Normalized to Recording
17
• Large difference between MaxPar and Serial replay• On 8 processors, unoptimized MaxPar replay is only 50% slower than recording
Conclusions• Cyrus: RnR system that supports
– Application-level RnR– Unintrusive hardware– Flexible replay parallelism
• Key idea: On-the-fly software backend pass• On 8 processors:
– Large difference between MaxPar and Serial replay– Unoptimized replay of MaxPar is only 50% slower than recording– Negligible recording overhead
• Upcoming ISCA’13 paper describes our FPGA RnR prototype
18