effective and inexpensive (memory) race recording min xu thesis defense 05/04/2006 electrical and...

Effective and Inexpensive(Memory) Race Recording

Min Xu

Thesis Defense

05/04/2006

Electrical and Computer Engineering Department, UW-Madison

Advisors: Mark Hill, Rastislav Bodik

Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood

2OverviewIncreasingly useful to replay multithreaded code• Race recording: key to dealing with nondeterminism

A Case Study• Long recording: 1 byte/kilo-instr• Always-on recording: less than 2% overhead• Low cost: 24 KB RAM/core• Support both SC & TSO (x86-like)

Effective Inexpensive

Race Recorder

Long

Rec

ordi

ng

Mor

e App

licab

le

Low O

verh

ead

Low C

ost

3

Order-ValueHybrid

RTRAlgorithm

Thesis Contributions

Set/LRUApproximation

CoherencePiggyback

Effective Inexpensive

Low CostHardware

SmallLog Size

Low RuntimeOverhead

SC & TSOApplicability

4Outline

Motivation & Problem

An Effective and Inexpensive Race Recorder

Evaluation Method & Results

RTRAlgorithm


CoherencePiggyback

Order-ValueHybrid

Conclusion & My Other Research

5slides

21

6

3

Motivation & Problem

6Multithreaded Debugging

% gcc hash.c% a.outSegmentation fault%

% gdb a.outgdb> runProgram received SIGSEGV.In get() at hash.c:4545 a = bucket->d;

% gdb a.outgdb> runProgram exited normally.gdb>

% gcc para-hash.c% a.outSegmentation fault%

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-hash.c% a.outSegmentation faultRace recorded in “log”%

7Race Recording

X=6

X = 1

X++

print(X)

X = 1

X++

print(X)

-X = X*5

--

---

X = X*5-

Thread IThread J

Original Replay

X=10

Recording

X= 6

-X = X*5

--

Log

Thread IThread J

8Recording for Multithreaded Replay

Race Recording• Not-an-issue for a single thread• Create the same general & data races

Checkpointing• Provide a snapshot of the program state• Many proposals (e.g., SafetyNet), not focus

Input Recording• Provide repeatable inputs• Some proposals (e.g., part of FDR), not focus

Focus

9A Good Race Recorder



Long recording:small log

Low runtimeoverhead

Low cost

Applicability

10Desired & Existing Race Recorders

RecordingLength

Applicability

Overhead Cost

DesiredRecorder

Small Log Size

MPRacey

Code

SC

TSONegligible Slowdown

Little Hardware

InstRply ’87

R&C ’90

Bacon’91

Netzer’93

Déjà Vu ’98

RecPlay ’00JaRec ’04Our

Recorder

Order-ValueHybrid


RTRAlgorithm

CoherencePiggyback

SmallLog Size

12

Reproduce exact same conflicts: no more, no less

Problem Formulation

ld A

Thread I Thread J

Recording

st B

st C

sub

ld B

add

st C

ld B

st A

st C

Thread I Thread J

Replay

Log

ld D

st D

ld A

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Conflicts(red)

Dependence(black)

13

Detect conflicts Write log

Log All Conflicts

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: 23 14 35 46

Log I: 23

Log Size: 5*16=80 bytes(10 integers)

Dependence Log

16 bytes

Assign IC(logical Timestamps)But too many conflicts

14Netzer’s Transitive Reduction

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

TR reduced Log J: 23

35 46

Log I: 23

Log Size: 64 bytes(8 integers)

TR Reduced Log

15The Intuition of the New RTR Algorithm

After Reduction

From I to J

From J to I

Vectors

VectorsRegulate Replay (RTR)

16

Stricter Dependences to Aid Vectorization

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: 23 45

Log I: 23


New Reduced Log

stricter

Reduced

17Compress Vectorized Dependencies

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: x=3,5, ∆=1

Log I: x=3, ∆=1


Vectorized Log

VectorDeps.

Reduce log size to KB/core/second

Order-ValueHybrid


RTRAlgorithm

CoherencePiggyback

Low RuntimeOverhead

19Detect Conflicts

1

2

3

1

2

3

4

ld A

Thread I Thread J

Recording

st B

st C

add

st C

ld B

st A

A.readers.add(I, 1)

if (C.writer != I) log(WAW)foreach C.readers if (reader != I) log(WAR)C.readers.clear( )C.writer = (I, 3)

B.writer = (I, 2) C.writer =(J, 2)

if (B.writer != J) log(RAW)B.readers.add(J,3)

…

Expensive in software

A.readers

A.writer

20Use Cache and Cache Coherence

ProcI

Tag State Data TimestampA S … 1B M … 4

ProcJ

Tag State Data TimestampA S … 3B I … 2

A.readersA.writer

B.readersB.writer

ld B

Get/S Request

Data Response

Timestamp

Detect conflict in hardware with little runtime cost

RAWDetected& Logged

21Cache Evictions and Writebacks

ProcI


ProcJ


st A

OK with nonsilent eviction & directory eviction

C M … 3

Directory of A: Shared(I,J) Owner()

Get/SInv

AckTimestamp? WAR

Detected& Logged

M … 4

22Implement TR and RTR in Hardware

Ideal TR requires vector timestamps• Too expensive• New idea: Pairwise-TR (use scalar timestamp)• Enable pairwise transitive reduction

Optimal RTR algorithm is likely expensive• Implement a greedy RTR algorithm• One-pass, online algorithm• Keep a sliding window of vectorizable

dependencies

23Hardware Implementation

CacheEviction/writeback Solved, more details

later

Directory protocols Solved

Snooping protocols Partly solved

Two-level coherence Not yet solved

ProcessorOut-of-order/Prefetching Solved

Unordered message Solved

Counter overflow Solved

Thread Migration Not yet solved

Order-ValueHybrid


RTRAlgorithm

CoherencePiggyback

Low CostHardware

25Timestamp Approximation


One Set of I’s $

Correct, but more evictions more logged conflicts

1

2

3

1

2

3

J

ld A

Thread I Thread J

Recording

st B

st C

add

st C

ld B

st AI ld D

Use current IC of thread

I

C M … 3

Directory of A: Shared(I)

HardwareCost

Log Size

27


One Set of I’s $ 1

2

3

1

2

3

J

ld A

Thread I Thread J

st B

st C

add

st C

ld B

st AI ld D

C M … 3

Recording

Set/LRU Approximation

Use current IC of thread

I

LRU guarantee B’s TS > A’s TS

Set/LRU better preserve reducibilitySmall $ more misses but still small log

28Hardware Cost of Timestamps

Coupled timestamp memory: overhead cache size• Not flexible• 64B line + 64b (24b) timestamp 12.5% (4.7%)

overhead• 192 KB for a 4MB L2

Need to modify cache


Coupled Timestamp Memory

29Decoupled Timestamp Memory

Decoupling Small timestamp memory (Set/LRU)• e.g., 32-set, 64-way 99% transitive reduction• Timestamps Memory 24 KB

No need to modify cache


Tag State DataA S …B M …

Tag TimestampA 1B 2

Cache

Timestamp Memory

Coupled Timestamp Memory

From 192 KB to 24 KB: 8x reduction

30

Order-ValueHybrid


RTRAlgorithm

CoherencePiggyback

SC & TSOApplicability

31

ld A

ld B

st A,1

st B,1

ld A

ld B

st A,1

st B,1

ld A

ld B

st A,1

st B,1

A=1B=0

A=0B=1

A=1B=1

Recording with Total Store Order (TSO)

Majority of existing MP are non-SC

TSO is well defined, x86-like

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

A=B=0

ld A

ld B

st A,1

st B,1

A=0B=0

SC

TSO

32TSO Execution

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

A=B=0 ld A

ld B

st A,1

st B,1

A=0B=0

st A,1

st B,1

I

WrBuf

Memory System

J

WrBuf

A=0 B=0A=0 B=0

A=1 B=1

33Order-Value-Hybrid Recording

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

Recording

A=B=0

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

Replay Value UsedA=0

ld A

ld B

st A,1

st B,1

A=0B=0

st A,1

st B,1I

WrBuf

Memory System

J

WrBuf

A=0 B=0

WAROmitted Value

Logged

A=0 B=0

A=1 B=1

StartMonitor A

StartMonitor B

A Changed!

StopMonitor B

34Hybrid Recording with TR and RTR

Hybrid recording• All loads get correct values• Hardware similar to OoO SC [Gharachorloo et al.

’91]

Hybrid + TR & RTR• TR will not use the omitted WAR in reduction• RTR vectorize dependencies more conservatively

Evaluation Method & Results

36Put-it-together: Determinizer/CMP

Shared L2 Cache(L1 Dir)

TSM TSM

TSM TSM

Core1

Core2

Core4

Core3

L1_I$ L1_D$

TSM

IC

L1CoherenceController

Log TRReg

RTRReg

37Simulation Method

Commercial server hardware• GEMS: http://www.cs.wisc.edu/gems• Full-system (OS + application) executions• 4-core CMP (Sequential Consistent)

• 1-way in-order issue, 2 GHz, • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory

Commercial server software• Apache – static web serving• SpecJBB – middleware• OLTP – TPC-C like• Zeus – static web serving

38Log Size: 1 byte/kilo-instr

Well within in the capability of current machines• Long recording (days – months) need improvement

0.0

0.5

1.0

1.5

2.0byte/core/kilo-instr

ApacheJBB OLTP Zeus AVG0

50

100

150

200KB/core/s

ApacheJBB OLTP Zeus AVG

39Runtime Overhead

Baseline With race recorder

0

20

40

60

80

100

Execution Time

Apache JBB OLTP Zeus

Interconnection Msg. B/W

Our recorder can be “always-on”

0

80

100

Apache JBB OLTP Zeus

60

40

20

40Benefits of RTR and Set/LRU (Log Size)

Pairwise-TR

Our RTR

Improvement by RTR

0

20

40

60

80

100

ApacheJBB OLTP ZeusAVG

Perfect TSM

24KB Set/LRU TSM

Effectiveness of Set/LRU

0

20

40

60

80

100

Apache JBB OLTP Zeus AVGL

og

S

ize

Lo

g

Siz

e

41Why RTR and Set/LRU Work Well?

RTR• Processors execute instructions at similar speed• Therefore, we can find “vectorizable”

dependencies

Set/LRU• Temporal locality makes the LRU timestamps old• We only need to know if a timestamp is “old-

enough”

42Sensitivity and Scalability

A design space of the timestamp memory (TSM)• Size: smaller TSM -> larger log• Read/write timestamp: should be used when TSM is

large• Partial timestamp: 24-bit enough• Associativity: higher better for RTR

Scalability of the recorder• Studied with modest processors (2p – 16p)• Commercial workloads, not scientific workloads• Log size increase slowly with number of cores

Conclusion & My Other Research

44Race Recording

Race recording Key to combat nondeterminism

My thesis An effective & inexpensive Recorder• RTR algorithm small log size• Coherence piggyback Negligible slowdown• Timestamp approximation Low hardware cost• Order-value hybrid support SC & TSO

Future work• Improve race recording algorithm • Improve race recorder implementation• Study race replay

45

Serializability Violation Detector [PLDI’05]Like a race detectorNo a priori annotation requirement

• “critical sections” are inferredIntend to detect bugs “actually” happen

• Check for a 2-Phase-Locking condition

Read in1

Read in2Write out1

Write out2

Write local

Read local

SharedVariables

A “Critical Section”

46Publications

FDR (ISCA’03)• Adopted by UCSD BugNet (ISCA’05)

SVD (PLDI’05)• Cited by Vaziri et al. (POPL’06)• Influenced new data race definition

RTR, Set/LRU & Hybrid• Submitted for publication

Thank you!



48Acknowledgements

Joint work with my advisors• Mark Hill, Ras Bodik

Ph.D. Committee• David Wood, Mikko Lipasti, Remzi Arpaci-Dusseau,

Barton Miller

Multifacet Group• Milo Martin, Dan Sorin, Carl Mauer, Brad Beckmann,

Kevin Moore, Alaa Alameldeen, Mike Marty, Luke Yen

Affiliates & Companies• Joe Emer, CJ Newburn, Peter Hsu, Bob Zak, Eric Bach,

Gang Luo, Alex Chow, IBM, Intel, Microsoft, Sun

49Deterministic Replay is Useful

Deterministic Replay is logically recreating a program execution

Present applications• Cyclic Debugging ([Pancake & Netzer ‘93])• Fault Tolerance (ExtraVirt [Lucchetti et al. ’05])• Intrusion Analysis (ReVirt [Dunlap et al. ’02])

Future applications• Data Recovery • Replay-based Synchronization

50Multicore and Multithreading

Multicore is common• AMD X2• IBM Power 5/6, Cell• Intel Pentium D, Core Duo• Sun SPARC T1

Multithreading is common• Server: high throughput• Scientific: high performance• Desktop/embedded: low response time

51Race Recording: Key to Determinism

Races: general race & data race [Netzer & Miller]• Both cause nondeterminism• Race recording can help, but

Existing race recorders are inadequate• Some generate large logs• Some have high runtime overhead• Some have high hardware cost (space overhead)• Support only sequential consistency

Need a better race recorder

52Recording/Replay & Debugging

Online Recorder

Crash

Dump “Core”

P1

P2

P3

P4

Checkpoint B Checkpoint C

Store log A Store log B Store log C

Checkpoint A

Crash

Read Checkpoint B

Replaying fromlog B, C

Deterministic Replayer

53Deterministic Replay & Fault Tolerance

Fault Recovery• Replay after a failure

Fault Detection• Replay then compare

(Courtesy of VMware)

54Future: Record/Replay & Undo/Redo

VM as a software platform• Ease software development• Fine granularity in Undo and Redo

Windows XP

55Future: Replay-based Synchronization

Three steps• Coarse-grain sync. fine-grain sync. hardware sync.

Results: higher performance

Works only if static control flow & fixed data addr• DSP kernels

ld Ast B

Unlock()

lock()st Ald B

Recording

ld Ast B st A

ld BReplay

Log

56Race Recording Related Work

Total-order recorders Partial-order recordersBacon ’91(Hardwar

e)

RecPlay ’00

JaRec ’04

R&C’90

Déjà Vu ’98

Bacon ’91(Hardware

)

Instant Replay ’87

Netzer ’93

Bus transactio

ns

Lamport Clocks

SchedulingBus

transaction groups

Variable versionVector clocks

Large log Small log Small log Large log Large log Small log

Low overhead

Low overhead

(sync only)

Low overhead(non-MP)

Low overhead

High overheadHigh

overhead

Low replay parallelism High replay parallelism

57Correctness of Order-Value-Hybrid

Removing WAR dependencies• Say thread I read, thread J write• Removing the WAR affects I’s read, not J’s write• But, for every dependence removed, thread I

reads correct value from the value log• Therefore, all reads get the correct value

58TR and TSO

TR affects dependencies reduced by a WAR• The WAR itself may later be removed during replay• Solution: Not use WAR in TR if the WAR can be

removed• Respond with a special flag when a loaded cache line

is stolen

1

2

1

2

st A

Thread I Thread J

st C

st B

st C

Recording

3 3ld B ld A

Must notbe reduced

59RTR and TSO

The sliding window may expose the ordered loads• Shrink the sliding window to avoid it

1

2

1

2

st A

Thread I Thread J

add

add

sub

Recording

3 3st B ld A

4 4ld C ld Bordered

in write bufffer

orderednew winfor j:3old win

for j:3

Not allowedby new window

60Deadlock Avoidance of RTR

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Recording

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Avoid deadlock by adhere to a SC total order

i:4j:1 j:2 i:3 i:4

Replay Cycle

61Recording Race-free Executions

No data races

Only need to record synchronization race

Deterministic replay up until the first data race

62Replay Parallelism

Replay performance depends on

(1)Number of synchronizations(2)Extra wait incurred by the

synchronizations

63Directory Protocols

Add sticky states in the directory• Retain states after writebacks• Need extra acknowledgements

Or, add extra timestamp memory in the directory• Helps to avoid extra acknowledgements

A tradeoff• Sticky states can be cheaper• But extra timestamp memory can be faster

64Snooping Protocols

Key problem is combined/implicit response• Not a problem for AMD Hammer

ProcI


ProcJ


st A

Get/XPull Shared

WARDetected& Logged

+ Current IC

65Nonsilent Evictions

ProcI


ProcJ


st A

Directory eviction: more false conflict, like snooping

C M … 3

Directory of A: Shared(J) Owner() StickyS(I,J)

Get/S

M … 4

AckTimestamp

TimestampMemory

Eviction

66Out-of-Order & Hardware Prefetching

Speculative execution• No IC assigned yet

Hardware prefetching• No IC assigned

Key idea: receive observation• Can associate a ld/st with current commit

instruction

67Unordered Messages in Interconnect

Message arrive out-of-order

Can affect reduction

But better add a sequence number• Reconstruct the message order• Enable IC compression by sending deltas

68Integer Overflow

IC and timestamps may overflow

IC: make it 64bit, will not overflow for a long time

Timestamps: use approximation techniques• MSB of IC + LSB of Timestamps

69Varying TSM Size

2 4 8 16 32 64 128 256 512 1024 2048

Size of the Timestamp Memory (KB)

0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Apache-1TS-RTRApache-1TS-TRApache-2TS-RTRApache-2TS-TR

(64 ways, Full Timestamps, Set/LRU)

2 4 8 16 32 64 128 256 512 1024 2048


0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

OLTP-1TS-RTROLTP-1TS-TROLTP-2TS-RTROLTP-2TS-TR


2 4 8 16 32 64 128 256 512 1024 2048


0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

SPECjbb-1TS-RTRSPECjbb-1TS-TRSPECjbb-2TS-RTRSPECjbb-2TS-TR


2 4 8 16 32 64 128 256 512 1024 2048


0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Zeus-1TS-RTRZeus-1TS-TRZeus-2TS-RTRZeus-2TS-TR


70Varying Associativity

2 4 8 16 32 64 128 256 512 1024

Associativity of the Timestamp Memory

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Zeus-CurrentIC-RTRZeus-CurrentIC-TRZeus-SetLRU-TRZeus-SetLRU-RTR

(64KB, Full R/W Timestamps)

2 4 8 16 32 64 128 256 512 1024


10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

SPECjbb-CurrentIC-RTRSPECjbb-CurrentIC-TRSPECjbb-SetLRU-TRSPECjbb-SetLRU-RTR


2 4 8 16 32 64 128 256 512 1024


10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

OLTP-CurrentIC-RTROLTP-CurrentIC-TROLTP-SetLRU-TROLTP-SetLRU-RTR


2 4 8 16 32 64 128 256 512 1024


10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Apache-CurrentIC-RTRApache-CurrentIC-TRApache-SetLRU-TRApache-SetLRU-RTR


71Varying Partial Timestamp Width

10 15 20 25 30

Partial Timestamp Width

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Zeus-TRZeus-RTR

(64sets, 64ways, Set/LRU)

10 15 20 25 30


10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

SPECjbb-TRSPECjbb-RTR


10 15 20 25 30


10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

OLTP-TROLTP-RTR


10 15 20 25 30


10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Apache-TRApache-RTR


72Log Size Scaling

2 4 8 16

Number of Cores

0.0

0.2

0.4

0.6

0.8

1.0

Log

Siz

e (

MB

/core

/s)

ApacheSPECjbbOLTPZeus

73In Retrospect …

What are you most proud of?• RTR improves TR after 13 years

What would you do differently if doing it again?• “replaying me is deterministic” (just kidding)• I wish I focused on race recording earlier

What the industry should do?• Implement the recorder as a VMM extension

effective and inexpensive (memory) race recording min xu thesis defense 05/04/2006 electrical and...

Documents

log thread

c thread

log gdb

c ld b

c ld d

race recording x

d log j

c sub ld b