weefence: toward making fences free in tso

25
WeeFence: Toward Making Fences Free in TSO Yuelu Duan, Abdullah Muzahid, Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu ISCA, June 2013

Upload: others

Post on 16-Oct-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WeeFence: Toward Making Fences Free in TSO

WeeFence:

Toward Making Fences Free in TSO

Yuelu Duan, Abdullah Muzahid, Josep Torrellas

Department of Computer Science

University of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

ISCA, June 2013

Page 2: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free

Fence: a Primitive for Parallelism

• Instruction inserted by programmers or compilers

• Prevents the compiler and HW from reordering memory accesses

2

Write y

Fence

Read x

Read z

Until these are finished

• loads retired

• writes retired + drained from write buffer

Cannot be observed by another processor

Page 3: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free

• Compilers insert fences in C++

– Programmer uses intentional data race for performance declares

variable as atomic

– Compiler inserts fence after the access, does not reorder

– Hardware does not reorder across fence

3

Use of Fences

Concurrency coordination with low overhead (much less than locks)

Expensive: cost of a fence in Xeon-based desktop is 20—200 cycles

• Programmers insert fences in codes with fine-grain sharing

– Work-stealing algorithm in Cilk

Page 4: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 4

What if Fences Were Free?

• Programmers could write faster fine-grained concurrent algorithms

• C++/Java programs could guarantee Sequential Consistency (SC) at

little performance cost:

– Programmers would declare all shared variables as atomic

– Hardware would skip fences while retaining correctness

Page 5: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 5

Current Fences Perform Speculation

• Reads following fences can load data speculatively

– If no processor observes it, no problem

– If coherence transaction received, squash and retry

• Still: speculative reads cannot retire until the WB is drained

f r

w2

f r

w1

w2 f r

w1

ROB

WB (write buffer) Write

Fence

Read

Page 6: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 6

Proposal: WeeFence (or WFence)

• Eliminate any stall in the pipeline

• Post-fence read retires before the pre-fence writes have drained

– “Skip” the fence

Substantial gains when write misses pile-up before the fence

w2

f r

w1

w1

w2 f r ROB

WB

Spec execution

Write

Fence

Read

Page 7: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 7

But… Reordering Can Cause Incorrect Execution

With fences: t0=1 or t1=1 or both=1

A0: x =1

A1: t0 = y B0: y = 1

B1: t1 = x

x = y = 0

PA PB

fence

fence Unintuitive bug:

Sequential Consistency(SC) Violation

wr x

rd y

PA PB

fence

wr y

fence

rd x

W

W

t0 = t1 = 0

A1

B0

B1

A0

With WFences:

Solution: Stall reads if reordering can cause a dependence cycle

Page 8: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 8

(2)

execute

(1) PS

x

wr x

rd y

PA PB

Wfence1

wr y

rd x

Wfence2

How WFence Works

PS: Pending Set

BS: Bypass Set rd y

PA PB

Wfence1

wr y

Wfence2

rd x

wr x

Page 9: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 9

wr x

rd y

PA PB

Wfence1

(1) (2) PS

execute

wr y

x

(4)

local check

stall

(5)

How WFence Works

PS

y

(3)

Table

Wfence2

rd x

PS: Pending Set

BS: Bypass Set

wr x

rd y

PA PB

Wfence1

wr y

Wfence2

rd x

x

Page 10: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 10

(2)

execute

wr x

rd y

PA PB

Wfence1

wr y

wr x

y BS

(3)

How WFence Works (II)

(1) PS

x

Table

PS: Pending Set

BS: Bypass Set

wr x

rd y

PA PB

Wfence1

wr y

wr x

No fence present in TSO

Page 11: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 11

wr x

rd y

PA PB

Wfence1

(1) (2) PS

execute

x

wr y

wr x

y BS

(3)

(4) coherence

squash or bounce

How WFence Works (II)

Table

PS: Pending Set

BS: Bypass Set

wr x

rd y

PA PB

Wfence1

wr y

wr x

No fence present in TSO

Page 12: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 12

wr x

rd y

PA

Wfence1

wr x

rd y

PA

Wfence1

(1) PS

x

Summary: How WFence Works

z

Table

(6) squash or bounce

(4) y

BS

(5)

execute

z

(2)

check

(3)

PS: Pending Set

BS: Bypass Set

Page 13: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 13

wr x

rd y

PA

Wfence1

wr x

rd y

PA

Wfence1

(1) PS

x

Summary: How WFence Works

z

Table

(6) squash or bounce

(4) y

BS

(5)

execute

z

(2)

check

(3)

PS: Pending Set

BS: Bypass Set

Global Reorder Table (GRT)

in shared memory (signatures)

Register in the

processor

(signature)

List of addresses

in the cache

Page 14: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free

• Cycles are rare: Wfence typically executes without stalling the processor

– No reordering constraints

• Compatible with conventional fences

• Works with cycles with any number of processors

• No compiler support needed: Off-the shelf executable

14

WFence

wr x

rd y

PA PB

Wfence1

wr y

wr x

Page 15: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 15

Distributed Global Reorder Table (GRT)

• Distribute the GRT like the directory, into modules with address ranges

• WFence works as usual if its PS communicates with single GRT module

– Most common case due to locality (first-touch page allocation)

• Otherwise, it reverts to a conventional fence

– Eliminates potential protocol races

Small machine: GRT associated with the bus controller

Larger machine: Distributed GRT

Page 16: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 16

Evaluation

• Simulations of 8-core multicore (centralized and distributed GRT)

• Experiment with kernels (Peterson, Worksteal…)

– Kernels have explicit fences

– Goal: Remove all the fence stall time with Wfence

• Experiment with applications (SPLASH-2 and Parsec)

– A compiler pass conservatively inserts fences to guarantee SC

– Goal: The resulting fences, if implemented with WFence, induce

negligible overhead

Page 17: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 17

Kernels with Fences: Execution Time

WFence eliminates most of the fence stall time (11% execution reduction)

Baseline WFence

Page 18: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 18

Applications with Fences for SC: Execution Overhead

• With distributed GRT, 2% additional overhead

• Fences for SC induce only 2% overhead (rather than 36%)

2% 36%

• Disabled compiler optimizations add an additional 4%

4%

Page 19: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 19

Conclusions

• Today’s fences are expensive. If they were free

– Programmers could write faster fine-grained algorithms

– C++/Java compilers could guarantee SC at little cost

• WFence:

– Executes without stalling the processor (cycles are rare)

– Compatible with conventional fences

– No compiler support needed: off-the-shelf executable

– Effective:

• Eliminates fence stall from kernels (11% execution reduction)

• Supports SC in applications with only 2% overhead

• Our Future Work: Show how WFence can help parallel programming

Page 20: WeeFence: Toward Making Fences Free in TSO

WeeFence:

Toward Making Fences Free in TSO

Yuelu Duan, Abdullah Muzahid, Josep Torrellas

Department of Computer Science

University of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

ISCA, June 2013

Page 21: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free 21

Also in the Paper

• Show that deadlock is not possible

• Value forwarding

• Multiple WFences per thread

• Application to release consistency

• Detailed characterization of WFences

• Scalability with the number of processors

Page 22: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free

WFence

WFence executes

Response to WFence

Round

trip to

GRT

R completes (Spec)

22

Timeline

WFence retires

R reaches ROB

head and retires

R remains

speculative W retires

Difference in

R retirement

times

WFence completes W completes

R retires

Conventional Fence

W, Fence, R enter ROB

WB drains

Fence retires and completes

R completes (Spec)

W retires

W completes

Write

Fence

Read

PA

Page 23: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free

Execution Overhead: Distributed GRT

Distributed GRT

Page 24: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free

Distributed GRT

Page 25: WeeFence: Toward Making Fences Free in TSO

Duan, Muzahid,Torrellas

Toward Making Fences Free

Intel Processor