weefence: toward making fences free in tso
TRANSCRIPT
WeeFence:
Toward Making Fences Free in TSO
Yuelu Duan, Abdullah Muzahid, Josep Torrellas
Department of Computer Science
University of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu
ISCA, June 2013
Duan, Muzahid,Torrellas
Toward Making Fences Free
Fence: a Primitive for Parallelism
• Instruction inserted by programmers or compilers
• Prevents the compiler and HW from reordering memory accesses
2
Write y
Fence
Read x
Read z
Until these are finished
• loads retired
• writes retired + drained from write buffer
Cannot be observed by another processor
Duan, Muzahid,Torrellas
Toward Making Fences Free
• Compilers insert fences in C++
– Programmer uses intentional data race for performance declares
variable as atomic
– Compiler inserts fence after the access, does not reorder
– Hardware does not reorder across fence
3
Use of Fences
Concurrency coordination with low overhead (much less than locks)
Expensive: cost of a fence in Xeon-based desktop is 20—200 cycles
• Programmers insert fences in codes with fine-grain sharing
– Work-stealing algorithm in Cilk
Duan, Muzahid,Torrellas
Toward Making Fences Free 4
What if Fences Were Free?
• Programmers could write faster fine-grained concurrent algorithms
• C++/Java programs could guarantee Sequential Consistency (SC) at
little performance cost:
– Programmers would declare all shared variables as atomic
– Hardware would skip fences while retaining correctness
Duan, Muzahid,Torrellas
Toward Making Fences Free 5
Current Fences Perform Speculation
• Reads following fences can load data speculatively
– If no processor observes it, no problem
– If coherence transaction received, squash and retry
• Still: speculative reads cannot retire until the WB is drained
f r
w2
f r
w1
w2 f r
w1
ROB
WB (write buffer) Write
Fence
Read
Duan, Muzahid,Torrellas
Toward Making Fences Free 6
Proposal: WeeFence (or WFence)
• Eliminate any stall in the pipeline
• Post-fence read retires before the pre-fence writes have drained
– “Skip” the fence
Substantial gains when write misses pile-up before the fence
w2
f r
w1
w1
w2 f r ROB
WB
Spec execution
Write
Fence
Read
Duan, Muzahid,Torrellas
Toward Making Fences Free 7
But… Reordering Can Cause Incorrect Execution
With fences: t0=1 or t1=1 or both=1
A0: x =1
A1: t0 = y B0: y = 1
B1: t1 = x
x = y = 0
PA PB
fence
fence Unintuitive bug:
Sequential Consistency(SC) Violation
wr x
rd y
PA PB
fence
wr y
fence
rd x
W
W
t0 = t1 = 0
A1
B0
B1
A0
With WFences:
Solution: Stall reads if reordering can cause a dependence cycle
Duan, Muzahid,Torrellas
Toward Making Fences Free 8
(2)
execute
(1) PS
x
wr x
rd y
PA PB
Wfence1
wr y
rd x
Wfence2
How WFence Works
PS: Pending Set
BS: Bypass Set rd y
PA PB
Wfence1
wr y
Wfence2
rd x
wr x
Duan, Muzahid,Torrellas
Toward Making Fences Free 9
wr x
rd y
PA PB
Wfence1
(1) (2) PS
execute
wr y
x
(4)
local check
stall
(5)
How WFence Works
PS
y
(3)
Table
Wfence2
rd x
PS: Pending Set
BS: Bypass Set
wr x
rd y
PA PB
Wfence1
wr y
Wfence2
rd x
x
Duan, Muzahid,Torrellas
Toward Making Fences Free 10
(2)
execute
wr x
rd y
PA PB
Wfence1
wr y
wr x
y BS
(3)
How WFence Works (II)
(1) PS
x
Table
PS: Pending Set
BS: Bypass Set
wr x
rd y
PA PB
Wfence1
wr y
wr x
No fence present in TSO
Duan, Muzahid,Torrellas
Toward Making Fences Free 11
wr x
rd y
PA PB
Wfence1
(1) (2) PS
execute
x
wr y
wr x
y BS
(3)
(4) coherence
squash or bounce
How WFence Works (II)
Table
PS: Pending Set
BS: Bypass Set
wr x
rd y
PA PB
Wfence1
wr y
wr x
No fence present in TSO
Duan, Muzahid,Torrellas
Toward Making Fences Free 12
wr x
rd y
PA
Wfence1
wr x
rd y
PA
Wfence1
(1) PS
x
Summary: How WFence Works
z
Table
(6) squash or bounce
(4) y
BS
(5)
execute
z
(2)
check
(3)
PS: Pending Set
BS: Bypass Set
Duan, Muzahid,Torrellas
Toward Making Fences Free 13
wr x
rd y
PA
Wfence1
wr x
rd y
PA
Wfence1
(1) PS
x
Summary: How WFence Works
z
Table
(6) squash or bounce
(4) y
BS
(5)
execute
z
(2)
check
(3)
PS: Pending Set
BS: Bypass Set
Global Reorder Table (GRT)
in shared memory (signatures)
Register in the
processor
(signature)
List of addresses
in the cache
Duan, Muzahid,Torrellas
Toward Making Fences Free
• Cycles are rare: Wfence typically executes without stalling the processor
– No reordering constraints
• Compatible with conventional fences
• Works with cycles with any number of processors
• No compiler support needed: Off-the shelf executable
14
WFence
wr x
rd y
PA PB
Wfence1
wr y
wr x
Duan, Muzahid,Torrellas
Toward Making Fences Free 15
Distributed Global Reorder Table (GRT)
• Distribute the GRT like the directory, into modules with address ranges
• WFence works as usual if its PS communicates with single GRT module
– Most common case due to locality (first-touch page allocation)
• Otherwise, it reverts to a conventional fence
– Eliminates potential protocol races
Small machine: GRT associated with the bus controller
Larger machine: Distributed GRT
Duan, Muzahid,Torrellas
Toward Making Fences Free 16
Evaluation
• Simulations of 8-core multicore (centralized and distributed GRT)
• Experiment with kernels (Peterson, Worksteal…)
– Kernels have explicit fences
– Goal: Remove all the fence stall time with Wfence
• Experiment with applications (SPLASH-2 and Parsec)
– A compiler pass conservatively inserts fences to guarantee SC
– Goal: The resulting fences, if implemented with WFence, induce
negligible overhead
Duan, Muzahid,Torrellas
Toward Making Fences Free 17
Kernels with Fences: Execution Time
WFence eliminates most of the fence stall time (11% execution reduction)
Baseline WFence
Duan, Muzahid,Torrellas
Toward Making Fences Free 18
Applications with Fences for SC: Execution Overhead
• With distributed GRT, 2% additional overhead
• Fences for SC induce only 2% overhead (rather than 36%)
2% 36%
• Disabled compiler optimizations add an additional 4%
4%
Duan, Muzahid,Torrellas
Toward Making Fences Free 19
Conclusions
• Today’s fences are expensive. If they were free
– Programmers could write faster fine-grained algorithms
– C++/Java compilers could guarantee SC at little cost
• WFence:
– Executes without stalling the processor (cycles are rare)
– Compatible with conventional fences
– No compiler support needed: off-the-shelf executable
– Effective:
• Eliminates fence stall from kernels (11% execution reduction)
• Supports SC in applications with only 2% overhead
• Our Future Work: Show how WFence can help parallel programming
WeeFence:
Toward Making Fences Free in TSO
Yuelu Duan, Abdullah Muzahid, Josep Torrellas
Department of Computer Science
University of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu
ISCA, June 2013
Duan, Muzahid,Torrellas
Toward Making Fences Free 21
Also in the Paper
• Show that deadlock is not possible
• Value forwarding
• Multiple WFences per thread
• Application to release consistency
• Detailed characterization of WFences
• Scalability with the number of processors
Duan, Muzahid,Torrellas
Toward Making Fences Free
WFence
WFence executes
Response to WFence
Round
trip to
GRT
R completes (Spec)
22
Timeline
WFence retires
R reaches ROB
head and retires
R remains
speculative W retires
Difference in
R retirement
times
WFence completes W completes
R retires
Conventional Fence
W, Fence, R enter ROB
WB drains
Fence retires and completes
R completes (Spec)
W retires
W completes
Write
Fence
Read
PA
Duan, Muzahid,Torrellas
Toward Making Fences Free
Execution Overhead: Distributed GRT
Distributed GRT
Duan, Muzahid,Torrellas
Toward Making Fences Free
Distributed GRT
Duan, Muzahid,Torrellas
Toward Making Fences Free
Intel Processor