rtr: 1 byte/kilo- i nstruction race recording

Post on 31-Jan-2016

41 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

RTR: 1 Byte/Kilo- I nstruction Race Recording. Rastislav Bodik. Mark D. Hill. Min Xu. Why Do You Need a Recorder?. % gdb a.out gdb> run Program received SIGSEGV. In get() at hash.c:45 45 a = bucket->d;. % gcc sim .c % a.out Segmentation fault %. % gcc para- sim .c % a.out - PowerPoint PPT Presentation

TRANSCRIPT

RTR: 1 Byte/Kilo-InstructionRace Recording

Min Xu Rastislav Bodik Mark D. Hill

2

% gcc sim.c% a.outSegmentation fault%

% gdb a.outgdb> runProgram received SIGSEGV.In get() at hash.c:4545 a = bucket->d;

% gdb a.outgdb> runProgram exited normally.gdb>

% gcc para-sim.c% a.outSegmentation fault%

Why Do You Need a Recorder?

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-sim.c% a.outSegmentation faultRace recorded in “log”%

3Ideally …

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-sim.c% a.outSegmentation faultRace recorded in “log”%

Long recording:small logLow runtime

overheadLow cost

Applicability:Programs – data race

Systems – non-SC

4Better and Better Recorders

Low Cost

Low Overhead

Small Log Size

Applicability

Data Race Non-SC

InstRply ’87 Y Y

B. & G. ’91 YNetzer’93 YDéjà Vu

’98 YRecPlay

’00JaRec ’04

FDR ’03BugNet ’05

YReEnact

‘03 Y

CORD ‘06 YStrata ’06 YRTR ‘06 Y Y

1 Byte/Kilo-Instruction[ASPLOS’06]

A New Recorder

HardwareAcceleration

[ISCA’03]

LessHardware[ASPLOS’06]

SC & TSO[ASPLOS’06]

Result: One more step toward practical

This talk covers only RTR• Regulated Transitive Reduction algorithm

6Outline

Race Recording

RTR Algorithm

Results with Commercial Workloads

Compress log during recording replay more “regularly”

Conclusion

Technically, what’s race recording?

8Race Recording

X=6

X = 1

X++

print(X)

X = 1

X++

print(X)

-X = X*5

--

---

X = X*5-

Thread IThread J

Original Replay

X=10

Recording

X= 6

-X = X*5

--

Log

Thread IThread J

9

Goal: Reproduce same conflicts with minimum log data

Terminologies and Assumptions

ld A

Thread I Thread J

Recording

st B

st C

sub

ld B

add

st C

ld B

st A

st C

Thread I Thread J

Replay

Log

ld D

st D

ld A

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Conflicts(red)

Dependence(black)

Regulated Transitive Reduction (RTR)

11Log All Conflicts

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: 23 14 35 46

Log I: 23

Log Size: 5*16=80 bytes(10 integers)

Dependence Log

16 bytes

But too many conflicts

12Netzer’s Transitive Reduction (TR)

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

TR reduced Log J: 23

35 46

Log I: 23

Log Size: 64 bytes(8 integers)

TR Reduced Log

How to further reduce log size?

13The Intuition of the RTR Algorithm

After Reduction

From I to J

From J to I

Vectors

Vectors“Regulate” Replay

14

Stricter Dependences to Aid Vectorization

1

2

3

4

1

2

3

4

ld A

Thread I Thread J

Replay

st B

st C

add

st C

ld B

st Ald D

5 5sub st C

6 6ld B st D

Log J: 23 45

Log I: 23

Log Size: 48 bytes(6 integers)

New Reduced Log

stricter

Reduced

Fewer dependencies to log

15Compress Vectorized Dependencies

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: x=3,5, ∆=1

Log I: x=3, ∆=1

Log Size: 40 bytes(5 integers)

Vectorized Log

VectorDeps.

TRRTR: fewer deps + fewer byte/dep

16Deadlock Avoidance of RTR

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Recording

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Limit the strict dependencies (see paper)

i:4j:1 j:2 i:3 i:4

Replay Cycle

Results with Commercial Workloads

18Full-system Simulation Method

Commercial server hardware• GEMS: http://www.cs.wisc.edu/gems• Full-system (OS + application) executions• 4-core CMP (Sequential Consistent)

• 1-way in-order issue, 2 GHz, • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory

Commercial server software• Apache – static web serving• SpecJBB – middleware• OLTP – TPC-C like• Zeus – static web serving

L2C0

C1

C3

C2

19Log Size: 1 byte/KI

0.0

0.5

1.0

1.5

2.0byte/core/KI

ApacheJBB OLTP Zeus AVG0

50

100

150

200KB/core/s

ApacheJBB OLTP Zeus AVG

Less buffer, longer recording, smaller logs

20RTR vs. Netzer’s TR

TR

RTR

0

20

40

60

80

100

ApacheJBB OLTP ZeusAVG

Lo

g

Siz

e 28% smaller log• TR was “optimal”

21Why Does RTR Work Well?

RTR•Instructions execute at similar speed•Dependencies are often “vectorizable”

A New Recorder

HardwareAcceleration

[ISCA’03]

1 Byte/Kilo-Instruction[ASPLOS’06]

LessHardware[ASPLOS’06]

SC & TSO[ASPLOS’06]

Result: One more step toward practical

“Less hardware” & “TSO” not covered• Equally important• More details in the paper

23Conclusion

Race recording Counter nondeterminism

RTR 1 byte/kilo-instruction•Based on Netzer’s transitive reduction•Create stricter dependencies•Vectorize dependencies to compress log•Avoid overly-strict hence no deadlock

Future work•Support snooping, SMT, replayer

top related