a regulated transitive reduction (rtr) for longer memory race recording (aslpos’06)

Post on 01-Jan-2016

22 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06). Rastislav Bodik. Mark D. Hill. Min Xu. Shimin Chen LBA Reading Group Presentation. Why Do You Need a Recorder?. % gdb a.out gdb> run Program received SIGSEGV. In get() at hash.c:45 - PowerPoint PPT Presentation

TRANSCRIPT

A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording

(ASLPOS’06)

Min Xu Rastislav Bodik Mark D. Hill

Shimin Chen

LBA Reading Group Presentation

2

% gcc sim.c% a.outSegmentation fault%

% gdb a.outgdb> runProgram received SIGSEGV.In get() at hash.c:4545 a = bucket->d;

% gdb a.outgdb> runProgram exited normally.gdb>

% gcc para-sim.c% a.outSegmentation fault%

Why Do You Need a Recorder?

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-sim.c% a.outSegmentation faultRace recorded in “log”%

3Ideally …

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-sim.c% a.outSegmentation faultRace recorded in “log”%

Long recording:small logLow runtime

overheadLow cost

Applicability:Programs – data race

Systems – non-SC

4Flight Data Recorder (ISCA’03)

Full-system Record-Replay• Recording memory races:

• Assumes Sequential Consistency (SC)• Record order of instruction interleaving• Target cache-coherence multiprocessor server• Piggyback on coherence protocol: little extra H/W

• Recording system states: SafetyNet• Recording I/OsResults:

• Non-trivial recording interval: 1 second• Negligible runtime overhead: less than 2%• Can be “Always On”

5RTR

Better memory race log compression• 1 byte per Kilo instructions

Dealing with Total Store Ordering

In this talk, I will try to describe a full picture combining FDR and RTR.

6Outline

•Introduction•Recording System State•Recording Input/Output•Recording Memory Races•Dealing with TSO•Summary

7

Recording System State (based on SafetyNet)

•Purpose: re-construct the initial state (registers, TLB, main memory) at the beginning of the replay interval

•Policy: FDR’s 1second replay interval• Take a logical checkpoint every 1/3 second• Reserve memory space to store logs for 4

checkpoints•Logical checkpoint:

• Quiesce entire system to take a physical checkpoint• Registers and TLB states (4248 bytes/processor on

SPARC V9)• Log old value of a cache line upon first update

• Add an “already-updated” bit per cache line

8

FDR paper

9Outline

•Introduction•Recording System State•Recording Input/Output•Recording Memory Races•Dealing with TSO•Summary

10Recording I/O

I/O loads

Instruction count + interrupt number

DMA store values

11Outline

•Introduction•Recording System State•Recording Input/Output•Recording Memory Races•Dealing with TSO•Summary

12Log All Dependence

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: 23 14 35 46

Log I: 23

Log Size: 5*16=80 bytes(10 integers)

Dependence Log

16 bytes

But too many dependence

13

Netzer’s Transitive Reduction (TR)approximated by FDR

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

TR reduced Log J: 23

35 46

Log I: 23

Log Size: 64 bytes(8 integers)

TR Reduced Log

How to further reduce log size?

14RTR

Actively creating artificial dependencies• Stricter• Vectorized

15The Intuition of the RTR Algorithm

After Reduction

From I to J

From J to I

Vectors

Vectors“Regulate” Replay

16

Stricter Dependences to Aid Vectorization

1

2

3

4

1

2

3

4

ld A

Thread I Thread J

Replay

st B

st C

add

st C

ld B

st Ald D

5 5sub st C

6 6ld B st D

Log J: 23 45

Log I: 23

Log Size: 48 bytes(6 integers)

New Reduced Log

stricter

Reduced

Fewer dependencies to log

17Compress Vectorized Dependencies

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: x=3,5, ∆=1

Log I: x=3, ∆=1

Log Size: 40 bytes(5 integers)

Vectorized Log

VectorDeps.

TRRTR: fewer deps + fewer byte/dep

18

19H/W Considerations

(IC) Instruction count per core -- easy(VIC[p]) record previously seen senders’ largest time stamps for transitive reduction

(CTS[b]) time stamp per cache block:• i.e. record IC upon load/store commits• At commit time:

• Figure out memory address – how difficult?• Write CTS: decoupled timestamp memory

20H/W Considerations Cont’d

Piggyback on cache coherence messages• FDR: CTS[b]• RTR: CTS[b] & sender’s IC

Logic to perform algorithm at the receiver side• FDR: integer comparison, update VIC[sender],

generate log record• RTR: in addition, max/min, integer subtraction

Augment directory structure• Record last owner for evicted blocks

Cache must respond to inquiries about evicted blocks: reply with CTS[SET/LRU]

21Outline

•Introduction•Recording System State•Recording Input/Output•Recording Memory Races•Dealing with TSO•Summary

22Total Store Ordering

FIFO Write buffer• A store commits by placing its value into write

buffer• A store is ordered when it exits the write buffer

and updates the memory• Stores are ordered in commit order (FIFO)

Load can obtain values from write buffer or from memory system

23Problems with TSO

/* XXX */ is memory order

The two examples create cycles that will result in replay deadlocks

24Solution

Identify problematic load instructions• Monitor invalidation in [t1, t2]• t1: the load (or the previous store that feeds the

load) is ordered at memory• t2: all preceding instructions are ordered

Log load values and replay these load instructions by values

HW: similar to the misspeculation detection circuitry in SC systems (e.g. MIPS R10000)

Insufficient for supporting Processor Consistency and other more relaxed models

25Conclusion

RTR 1 byte/kilo-instruction•Based on Netzer’s transitive reduction•Create stricter dependencies•Vectorize dependencies to compress log•Avoid overly-strict hence no deadlock

top related