respec : efficient online multiprocessor replay via speculation and external determinism

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan,

Satish Narayanasamy, Peter M. Chen, and Jason Flinn

University of Michigan, Ann Arbor

Respec: Efficient Online Multiprocessor Replayvia Speculation and External Determinism

2Dongyoon Lee

Deterministic Replay• Record and reproduce non-deterministic events

1) Offline Uses: replay repeatedly after original run• Debugging• Forensics

2) Online Uses: record and replay concurrently• Fault tolerance• Decoupled runtime checks

We focus on online replay for multi-processors

Deterministic Replay

3Dongyoon Lee

Online Deterministic Replay Uses

Server Replica

Takeover

Fault Tolerance Decoupled Runtime Checks

App Replay + Check

A + Check

B + Check

C + Check

• Need to record and replay concurrently • Both recording and replaying should be efficient

Request log

ResponseFault !!

replay

keep thesame state

P1 P2 P3 P4

A

B

C

4Dongyoon Lee

Uniprocessor Replay• Program Input (e.g. system calls, signals, etc)• Thread scheduling

Multiprocessor Replay: + Shared memory dependencies• Instrument every memory operation

PinSEL [Pereira, IISWC’08] , iDNA [Bhansali, VEE’06]

• Page protection SMP-ReVirt [Dunlap, VEE’08]

• Offline searchODR [Altekar, SOSP’09] , PRES [Park, SOSP’09]

Replay-SAT [Lee, MICRO’09]

• Hardware supportFDR [Xu, ISCA’03], Strata [Narayanasamy, ASPLOS’06], ReRun [Hower, ISCA’08], DeLorean [Montesinos, ISCA’08]

Past Solutions for Deterministic Replay

→ 10-100x

→ 2-9x

→ Slow replay

→ Custom HW

5Dongyoon Lee

Goal: Efficient online software-only multiprocessor replay

Key Idea: Speculation + Check1) Speculate data race free2) Detect mis-speculation using a cheap check3) Rollback and retry on mis-speculation

Overview of Our Approach

multi-threadedfork

Lock(l)Unlock(l)

Lock(l)

T1 T2

Checkpoint A

Recorded Process

T1’ T2’

A’

Replayed Process

Lock(l’)Unlock(l’)

Lock(l’)

SpeculateRace free

Check B’==B?Checkpoint B

6Dongyoon Lee

• Motivation/Overview• Respec Design

1. Speculate data race free2. Detect mis-speculation3. Rollback and Retry on mis-speculation

• Evaluation• Conclusion

Roadmap

7Dongyoon Lee

Observation• Reproducing program input and happens-before order of sync. operations

guarantees deterministic replay of data-race-free programs [Ronsse and

Bosschere ’99]

1) Program input ( e.g. system calls, signals, etc. )

• Record: Log system call effects• Replay: Emulate system call

2) Synchronization Operations• Record and replay happens-before order • Instrument common (not all) synchronization primitives in glibc

Deterministic Replay of Data-race-free Programs

+ total order+ total order

8Dongyoon Lee

What if a program is NOT race free?

Problem• Need to detect mis-speculation• Data race detector is too heavy-weight

Insight: External Determinism is sufficient• Not necessary to replay data races

• Ensure that the replayed process produces the same visible effects as the recorded process to an external observer

Visible effects = System output + Final program state

Solution: Divergence checks• Detect mis-speculation when the replay is not externally deterministic

9Dongyoon Lee

1) System Output Check• For every system call, compare system call argument• Ensure that the replay produces the same output as the recorded process

Divergence Check #1 – System Output

Lock(l)Unlock(l)

Lock(l)

Lock(l’)Unlock(l’)

Lock(l’)

T1 T2

Start A

Recorded Process

T1’ T2’

Start A’

Replayed Process

Check O’==O?

SysRead X

SysWrite O

SysRead X’

SysWrite O’

multi-threadedfork

10Dongyoon Lee

Benign Data Races

• Not all races cause divergence checks to fail• A data race is inconsequential if system output matches

x=1x!=0?x=1

x!=0?x!=0?

x!=0?

T1 T2Start A

Recorded Process Replayed Process

Success

SysWrite(x) SysWrite(x)

multi-threadedfork T1’ T2’

Start A’

11Dongyoon Lee

1) Need to rollback to the beginning2) Need to buffer system output till the end

Divergence due to Data Races

Start A

Recorded Process Replayed Process

Start A’multi-threaded

fork

T1 T2 T1’ T2’

FailSysWrite(x)

x=1

x=2 x=1

x=2

SysWrite(x)

12Dongyoon Lee

2) Program state check • Compare register and memory state at semi-regular intervals (epochs)• Construct a safe intermediate point

– To release buffered output– To rollback to in case of mis-speculation

Divergence Check #2 – Program State

Replayed Process

T1’ T2’Start A’

T1 T2

Recorded Process

Start A

epoc

h

ReleaseOutput Success

epoc

h

SysWrite(x)

x=1x=2 x=1

x=2SysWrite(x)

Fail

Checkpoint B B’ == B ?

13Dongyoon Lee

Recovery from Mis-speculationRollback

• Rollback both recorded and replayed processes to the previous checkpoint

Re-execute• Optimistically re-run the failed epoch• On repeated failure, switch to uniprocessor execution model

– Record and replay only one thread at a time – Parallel execution resumes after the failed interval

T1 T2 T1’ T2’

Check B’==B?

x=1

x=2

Fail

A == A’

Checkpoint B

x=1

x=2

Checkpoint Ax=1

x=2Checkpoint B

Checkpoint C

Checkpoint A

Check B’==B?

x=1

x=2

Check C’==C?

A == A’

14Dongyoon Lee

Speculative Execution

Speculator [Nightingale et al. SOSP’05]

• Buffer output during speculation

• Block execution if speculative execution is not feasible

• Release buffered output on commit

• Undo speculative changes and squash buffered output on mis-speculation

15Dongyoon Lee

• Motivation/Overview• Respec Design• Evaluation

1. Performance results2. Breakdown of performance overhead3. Rollback frequency and overhead

• Conclusion

Roadmap

16Dongyoon Lee

Evaluation Setup

Test Environment• 2 GHz 8 core Xeon processor with 3 GB of RAM • Run 1~4 worker threads (excluding control threads)• Collect the average of 10 trials (except pbzip2 and aget)

Benchmarks• PARSEC suite

– blackscholes, bodytrack, fluidanimate, swaptions, streamcluster• SPLASH-2 suite

– ocean, raytrace, volrend, water-nsq, fft, and radix• Real applications

– pbzip2, pfscan, aget, and Apache

17Dongyoon Lee

Record and Replay Performance

1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

blackscholes

bodytrack

fluidani-

mate

swaptions

streamcluster

ocean ray-trace

volrend waternsq

fft radix pfscan pbzip2 aget Apache

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Rela

tive

Ove

rhea

d

• 18% for 2 threads, 55% for 4 threads• Real applications (including Apache) showed <50% for 4 threads

18Dongyoon Lee

1) Redundant Execution Overhead (25%)

• Cost of running two executions (Lower bound of online replay)• Mainly due to sharing limited resources: memory system• Contribute 25% of total cost for 4 threads

1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

blackscholes

bodytrack

fluidani-

mate

swaptions

streamcluster

ocean ray-trace

volrend waternsq


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Rela

tive

Ove

rhea

d

Redundant execution overhead (25%)

19Dongyoon Lee

2) Epoch overhead (17%)

• Due to checkpoint cost• Due to artificial epoch barrier cost• Contribute 17% of total cost for 4 threads

1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

blackscholes

bodytrack

fluidani-

mate

swaptions

streamcluster

ocean ray-trace

volrend waternsq


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Rela

tive

Ove

rhea

d

Epoch overhead (17%)Redundant execution overhead (25%)

20Dongyoon Lee

3) Memory Comparison Overhead (16%)

• Optimization 1. compare dirty pages only• Optimization 2. parallelize comparison• Contribute 16% of total cost for 4 threads

1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

blackscholes

bodytrack

fluidani-

mate

swaptions

streamcluster

ocean ray-trace

volrend waternsq


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Rela

tive

Ove

rhea

d

Memory comparison overhead (16%)Epoch overhead (17%)Redundant execution overhead (25%)

21Dongyoon Lee

4) Logging Overhead (42%)

• Logging synchronization operations and system calls overhead• Main cost for applications with fine-grained synchronizations • Contribute 42% of total cost for 4 threads

1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 1 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

blackscholes

bodytrack

fluidani-

mate

swaptions

streamcluster

ocean ray-trace

volrend waternsq


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Rela

tive

Ove

rhea

d

Logging and other overhead (42%)Memory comparison overhead (16%)Epoch overhead (17%)Redundant execution overhead (25%)

22Dongyoon Lee

Rollback Frequency and Overhead

App. Threads Rollback Frequency Overhead Avg. Overhead

Pbzip2(100 runs) 4

84% none 41%45%15% once 66%

1% twice 105%

Aget(50 runs) 4

80% none 6%6%18% once 6%

2% twice 6%

• Pbzip2(16%) and Aget(20%) invoke one or more rollbacks

• Pbzip2: Rollbacks contribute <10% of total overhead

• Aget: Rollback overhead is negligible

• frequent checkpoints => short epochs => small amount of work to be re-done

23Dongyoon Lee

Conclusion

Goal: Deterministic replay for multithreaded programs• Software-only: no custom hardware• Online: record and replay concurrently

Contributions to replay• Speculation: speculate race-free, and rollback/retry if needed• External Determinism: Match system output and program states

Results• Performance overhead record and replay concurrently

• 2 threads: 18% • 4 threads: 55%

Thank you…

24Dongyoon Lee

Thank you

25Dongyoon Lee

Benign Data Races

Benign data races could cause frequent rollbacks

• Performance (NOT correctness) issue

• The latest Java and C++ memory model prohibits benign races=> There are only harmful races

[Manson et al. POPL’05],[Boehm et al. PLDI’08]

• Programmers should explicitly annotate intentionally racy variables (e.g. handcrafted synchronization) using volatile/atomic keywords

• Could automatically detect and instrument

26Dongyoon Lee

Implementation

Modify Linux 2.6.27 kernel• Deterministic replay

• Multithreaded fork• Record/replay program input (e.g. system calls, signals, …)• Compare program state (memory and register contents)

• Speculator [Nightingale et al. SOSP’05]• Checkpoint and rollback• Buffer system output or propagate speculative states

Modify glibc 2.5.1• Support recording/replaying low-level synchronization operations

• e.g. locks, unlock, futex waits, futex wakes

27Dongyoon Lee

Replayed process1) Emulate most system calls

• Feed logged return value and data copied into the process

2) Re-execute some system calls• Create or delete threads : clone, exit, …• Modify address space: mmap2, mprotect, …

Problem• Does NOT recreate most kernel state associated with the replayed process

(e.g. the file descriptor table)• Process can NOT transition from replaying to live execution

Solution• Recreate the OS state by re-executing native/virtualized system calls

ReVirt [Dunlap et al. OSDI’02], Zap [Osman et al. OSDI’02]

Handling System Calls

28Dongyoon Lee

Copy-on-write fork• Linux’s fork supports fork of only single thread

• Need new copy-on-write primitive for checkpointing multithreads• Should checkpoint a thread at safe point

• kernel entry/exit (system call)

Multi-threaded fork1) The initiating thread that initiates a multithreaded fork creates a barrier on

which it waits until all other threads reach a safe point2) Once all threads reach the barrier, the original thread creates the checkpoint,

then let other threads continue execution.

Semi-regular checkpoints• Adaptive epoch length

• To bound the amount of work that must be redone on rollback• Output triggered commit

• To provide acceptable latency for interactive tasks

Multi-threaded Fork (Checkpoint)

29Dongyoon Lee

1) Allow Respec to commit epochs and release system output• Buffer output during speculation• Safe to release output on commit after matching program state

2) Reduce the amount of execution that must be re-donewhen a check fails

3) Allow broader uses of replay system• Tolerating non-fail-stop faults (e.g. transient hardware fault)

• Need to detect latent faults

• Parallelizing security and reliability checks

Benefits of Program State Check

30Dongyoon Lee

Respec Log• Kernel’s system call + User-level synchronizations• MD5 checksum of address space and register state

Problem: Not all races are logged• Offline replay is NOT guaranteed to succeed• Since the recorded process has been replayed successfully at least once,

it is likely that offline replay will eventually succeed

Solution• Offline replay search tools can be used

e.g. ODR [Altekar et al. SOSP’09] , PRES [Park et al. SOSP’09] , Replay-SAT [Lee et al. MICRO’09]

Offline Replay with Respec

31Dongyoon Lee

• e.g. I/O, DMA, interrupts, signals, RDTSC, context-switch, page-fault

• Asynchronous interrupts (caused by external sources)• eg. I/O, timer, disk read completion

• Synchronous interrupts (=traps)• eg. arithmetic overflow exceptions, invoking system calls, page fault,

TLB miss

• x86 instructions (can return non-deterministic results, but do not normally trap when running in user mode)• eg. rdtsc(read timestamp counter), rdpmc(read performance

monitoring counter)

Non-Deterministic Program Input

32Dongyoon Lee

Rollback Frequency and Overhead (Pbzip2)

Threads Rollback Frequency

OriginalTime (sec) Type Respec

Time (sec) Slowdown

1 0% 4.59 Overall 4.83 5%

2 13% once 2.35

w/o rollback 2.70 15%

w/ rollback 2.97 26%

overall 2.73 16%

3 9% once2% twice 1.64



overall 1.03 24%

484% no rollback

15% once1% twice

1.33



overall 1.93 45%

• Out of 100 runs, 13-16% of executions invoke more than one rollbacks• Rollbacks contribute 8% of Respec's total overhead

33Dongyoon Lee

Rollback Frequency and Overhead (Aget)Threads Rollback

FrequencyOriginal

Time (sec) Type RespecTime (sec) Slowdown

1 10% once2% twice 2.05

w/o rollback 2.19 7%w/ rollback 2.21 8%

overall 2.19 7%

2 20% once2% twice 1.93



overall 2.17 13%

3 24% once 1.94

w/o rollback 2.08 7%w/ rollback 2.09 8%

overall 2.08 7%

4 18% once2% twice 1.96


w/ rollback 2.08 6%

overall 2.08 6%

• Out of 50 runs, 14-24% of executions invoke more than one rollbacks• Peformance impact is negligible (due to very frequent checkpoint)

respec : efficient online multiprocessor replay via speculation and external determinism

Documents