Execution Replay for
Multiprocessor Virtual Machines
George W. DunlapDominic Lucchetti
Michael A. FettermanPeter M. Chen
Big ideas
• Detection and replay of memory races is possible on commodity hardware
• Overhead high for some workloads
• …but surprisingly low for other workloads
Execution Replay
CPU
Memory
Disk
Network
Keyboard, mouse
Interrupts
Uses of Execution Replay
• Reconstructing state– Fault tolerance
• Reconstructing execution– Debugging– Realistic trace generation
• Both– Intrusion analysis
Single-processor Replay• Basic principles well understood
– Log all non-deterministic inputs– Timing of asynchronous events
• Minimal overhead (Dunlap02)– 13% worst case– Log for months or years
• Available commercially– VMWare: Record/Replay
Replay for Multiprocessors• Memory races in multiprocessor VMs• The Ordering Requirement• The CREW Protocol
– Implementing with page protections– Relation to the Ordering Requirement– Generating constrants from CREW events
• DMA-capable devices and CREW• Performance
The Multiprocessor Challenge
• Interleaved reads and writes– Fine-grained non-determinism– Much more difficult
• Existing solutions– Hardware modification– Software instrumentation
• SMP-ReVirt– Hardware MMU to detect sharing
Multiprocessor Replay
P2
Memory
P1
P1 P2
n=3n=5
if (n<4)
Ordering Memory Accesses
• Preserving order will reproduce execution– a→b: “a happens-before b”– Ordering is transitive: a→b, b→c means
a→c
• Two instructions must be ordered if:– they both access the same memory, and– one of them is a write
Constraints: Enforcing order
• To guarantee a→d:– a→d– b→d– a→c– b→c
• Suppose we need b→c– b→c is necessary– a→d is redundant
P1
a
b
c
d
P2
overconstrained
CREW Protocol
• Each shared object in one of two states:– Concurrent-Read: all processors can read,
none can write– Exclusive-Write: one processor (the
owner) can read and write; others have no access
CREW protocol, con’t• Enforced with hardware MMU
– Read/write– Read-only– None
• Change CREW states on demand– Fault, fixup, re-execute
• CREW event– Increasing or reducing permission due to CREW
state changes
CREW Property
• If two instructions on different processors: – access the same page,– and one of them is a write,– there will be a CREW event on each
processor between them.
Generating Constraints• State: Concurrent Read
– All processors read-only
• d*: CREW fault• New state: P2 Exclusive• r: privilege reduction
– Read to None
• i: privilege increase– Read to Read/write
• Log timing of r and i• Constraint:
– r → i
P1
a
d
P2
ri
d*
Direct Memory Access
• Device accesses memory directly
• Logically another processor– Reads and writes need to be ordered– IOMMU: can’t fault/fixup/re-execute
• Observation: Transaction model
• Device: non-preemptible actor
Prototype: SMP-ReVirt
• Modified Xen hypervisor
• Implement logging, CREW protocol
• Details in paper
Evaluation questions
• What is the overhead?
• What affects performance?– In paper
• When might I want to use MP?– Log with 1, 2, or N cpus?
Evaluation Workloads
• SPLASH2 parallel application suite– FMM, LU, ocean, radix, water-spatial,
radiosity
• Kernel-build
• Dbench
Predicting results• Key changes in sharing attributes
– 4096-byte sharing granularity– “Miss” is very expensive
• SPLASH2– Good: high spatial locality / low false sharing– Bad: random access patterns / high false sharing
• The Linux kernel– Tuned to 16-byte cacheline– Involving the kernel may be expensive
Single-processor Xen guests
1.001.04
1.01 1.001.03
1.13
1.001.05
0
0.2
0.4
0.6
0.8
1
1.2
FMM LU ocean radix water-spatial
kernel-build
radiosity dbench
Norm
aliz
ed r
untim
e
Unmodified 1-cpu guest
Logging 1-cpuguest
`
Log Growth RateWorkload Log growth(GB/day) Days to fill 300GB
FMM 0.234 1280
LU 0.237 1261
Ocean 0.232 1295
Radix 0.292 1025
Water-spatial 0.232 1296
Kernel-build 0.564 531
Radiosity 0.231 1295
Dbench 0.557 538
2-processor Xen guests
1.51
1.001.08
1.601.48
2.10
1.90
1.76
1.96
1.741.83
1.99
0
0.5
1
1.5
2
2.5
FMM LU ocean radix water-spatial kernel-build
No
rma
lize
d r
un
tim
e
Unmodified 2-cpuguest
Logging 2-cpu guest
Logging 1-cpu guest
2-processor, con’t
8.70
7.21
1.85 1.88
0123456789
10
radiosity dbench
No
rma
lize
d r
un
tim
e
Unmodified 2-cpu guest
Logging 2-cpu guest
Logging 1-cpu guest
Log Growth RateWorkload Log growth(GB/day) Days to fill 300GB
FMM 34.5 8.7
LU 3.2 92.7
Ocean 4.3 69.1
Radix 39.8 7.5
Water-spatial 36.3 8.25
Kernel-build 43.3 6.9
Radiosity 88.4 3.4
Dbench 77.0 3.9
4-processor Xen guests
7.36
1.12 1.28
4.20
1.72
9.03
0
2
4
6
8
10
FMM LU ocean radix water-spatial kernel-build
Nor
mal
ized
run
time
Unmodified domain, 4 cpus
CREW logging, 4 cpus
CREW logging, 2 cpus*
CREW logging, 1 cpu
Recap• Memory races in multiprocessor VMs• The Ordering Requirement• The CREW Protocol
– Implementing with page protections– Relation to the Ordering Requirement– Generating constrants from CREW events
• DMA-capable devices and CREW• Performance
Big ideas
• Detection and replay of memory races is possible on commodity hardware
• Overhead high for some workloads
• …but surprisingly low for other workloads
Questions