evaluating synchronization on shared address space multiprocessors: methodology & performance

Evaluating Synchronization on Shared Address Space Multiprocessors:

Methodology & Performance

Sanjeev Kumar

Dongming Jiang

Rohit Chandra

Jaswinder Pal Singh

2

Classic Study on Synchronization

• Software Algorithms for Locks and Barriers [Mellor-Crummey et. al., TOCS’91]

– Multiprocessors machines• BBN Butterfly, Sequent Symmetry

– Microbenchmarks

• Little benefit from special hardware support– Handle memory/network contention in software

3

Case for Hardware Support

• Fetch&Op [Laudon et. al., ISCA’97]

– Origin 2000 – Microbenchmarks (Counter & Barrier)

• QOLB [Kagi et. al., ISCA’97]

– Simulations– Microbenchmarks & Applications (Locks)

• Better performance with Hardware Support

4

Our Study

• Re-examine synchronization– 64 processor Origin 2000

• New architectures CC-NUMA

• New primitives LL-SC

– Applications (SPLASH2) and microbenchmarks

• Applications : Little benefit from H/W support – Locks : Small performance sometimes– Barriers : Load-imbalance dominates

5

Outline

• Background

• Performance evaluation: Microbenchmarks– Synchronization primitives on Origin 2000– Lock and Barrier algorithms and performance

• Performance evaluation: Applications

• Is further hardware support valuable ?

• Conclusions

6

Spinning in Cache Cache Coherence

Spinning Traffic No Cache Coherence

Performance Tradeoffs : Wait

Synchronization Primitives on Origin 2000

• LL-SC– 2 instructions, Cached

– Flexible

• Fetch&Op– Special locations, uncached

– Inflexible e.g. Atomic Swap

Performance Tradeoffs : Atomic update Contention Retries Contention at Memory

7

Lock Algorithms (1)

• Simple– One location

NoAvailable ?

P P P P

Simple

Atomic test-and-setLL-SC Fetch&Op

8

Lock Algorithms (2)

• Ticket– Like in a bakery

– Proportional backoff

Atomic fetch-and-incrementLL-SC Fetch&Op

P P P P

Ticket

132Next-Ticket

125Now-Serving

126 127 132125

9

Lock Algorithms (3)

P P P P

MCS Queuing

0Queue0 0

• MCS– Queuing

– Local spinning

Atomic Compare-and-SwapLL-SC Not Fetch&Op

10

Lock-Delay Microbenchmark

0

4

8

12

1 2 4 8 16 32 64

Processors

Tim

e(u

s)

Simple,LL-SC TicketProp,LL-SC

MCS,LL-SC TicketProp,Fetch&Op

Simple (LL-SC)

TicketProp (LL-SC)MCS (LL-SC)TicketProp (Fetch&Op)

11

Barrier Algorithms (1)

• Central– Increment a counter

– Wait on a location 5Arrived

P P P P

Central

NoGo


12

P P P P

Tournament

0 00

0 0

0

Barrier Algorithms (2)

• Tournament– Tree of locations

– Spin on different locations

– Avoid hotspot and contention


13

Barrier-Null Microbenchmark

0

40

80

120

160

1 2 4 8 16 32 64

Processors

Tim

e(u

s)

Central,LL-SC Tournament,LL-SC

Central,Fetch&Op Central,Hybrid

Central (LL-SC)

Tournament (LL-SC)

Central (Fetch&Op)Hybrid (LL-SC, Fetch&Op)

14

Microbenchmarks Summary

• LL-SC– Simplest algorithms perform poorly

e.g. Simple lock and Central barrier– Smarter algorithms perform much better

• Fetch&Op supports faster synchronization

15

Outline

• Background

• Performance evaluation: Microbenchmarks


• Is further hardware support valuable ?

• Conclusions

16

Choosing Applications: Methodology

• Applications from SPLASH-2 – Undo optimizations (Added locks and barriers)

• Problem Size – At least 25 fold speedup on 64 processors

• Base case– Best LL-SC lock and barrier

17

Base Performance

0

0.2

0.4

0.6

0.8

1

Raytrace Barnes Radiosity Ocean Water-Spatial Water-NSq

Bre

ak

do

wn

Compute+Communication Lock Barrier

0

16

32

48

64

Raytrace Barnes Radiosity Ocean Water-Spatial Water-NSq

Sp

ee

du

ps

18

Application performance usingDifferent Locks

• Better algorithm helps • Fetch&Op traffic hurts

0

0.2

0.4

0.6

0.8

1

1.2

Raytrace Barnes Radiosity Ocean Water-Spatial Water-NSq uBench

No

rma

lize

d S

pe

ed

up

s

Simple,LL-SCTicketProp,LL-SCTicketProp,Fetch&Op

1.65Base : MCS,LL-SC

19

Application performance using .Different Barriers .

• Load-imbalance dominates • Fetch&Op traffic hurts

0

0.2

0.4

0.6

0.8

1

1.2

Raytrace Barnes Radiosity Ocean Water-Spatial Water-NSq uBench

No

rma

lize

d S

pe

ed

up

s

Central,LL-SCCentral,Fech&OpCentral,Hybrid

1.52

2.68Base : Tournament,LL-SC

20

Applications Summary

• LL-SC– Locks : Better algorithm helps– Barriers : Load imbalance dominates

• Fetch&Op – Traffic due to spinning hurts performance

• Different from the microbenchmarks

21

Outline

• Background

• Performance evaluation: Microbenchmarks


• Is further hardware support valuable ?– Locks– Barriers

• Conclusions

22

Sensitivity to Lock Performance

Adding round-trip network delays

0

0.2

0.4

0.6

0.8

1

1.2

Base 1 Round-trip 2 Round-trips 3 Round-trips 4 Round-trips

No

rmal

ize

d S

pe

ed

up

s

Raytrace Barnes Radiosity Ocean Water-Spatial Water-Nsq

RaytraceRadiosity

Extrapolate : 20-30 % improvement from better hardware

23

When do faster locks help Applications ?

• Applications sensitive to Lock performance– Raytrace, Radiosity ( ~ 20 -30 %)

• Substantial time in synchronization

• Small contended critical sections– Critical section size = actual + lock overhead

• Lock overhead dilates the critical section

• Effect on performance size of critical section

– 2 Apps : ~ 5 us (1-2 updates to shared locations)

24

Can we fix contention problems in these cases in the Application ?

• Yes. Fix was fairly easy– Raytrace

• Global counter Partial reductions

– Radiosity• Single buffer allocation queue Multiple

• Tasks added to local queue Distribute

• Significant performance improvement– Raytrace : 90%, Radiosity: 220%

25

Barriers

• Load-imbalance dominates

• Other applications– Well-balanced with little communication

• Like the microbenchmarks; Real applications ?

– Well-balanced computation & communication• SOR : nearest neighbor on a grid

• Barriers : 61 % execution time

• Still dominates. Communication Imbalance

26

• Fetch&op does not help Applications– At least for well-known lock & barrier algorithms

• Using applications is important

• Little benefit from hardware support– Locks: helps sometimes Fixable– Barriers: load imbalance dominates

• Sound Methodology

Summary & Conclusions

27

Tournament barrier with Fetch&Op

• Worse performance– Preliminary measurements indicated worse

overhead in addition to traffic

• Barrier performance did not make a difference in the applications

28

Small problem size

• Raytrace : Decreases lock time

• Barnes : Load-imbalance increases

• Water-Nsq : Load-imbalance and Serialization

• Ocean & SOR : Barrier time remains same

• Radiosity & Water-Spatial : Not available

29

SOR Breakdown

0

0.2

0.4

0.6

0.8

1

1 4 7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

61

64

Processors

Ex

ec

uti

on

tim

e

Compute + Communication Barrier

Load-imbalance is dominates time spent in barriers

evaluating synchronization on shared address space multiprocessors: methodology & performance

Documents

better hardware

hardware support valuable

atomic swaplock algorithms

faster locks

simple lock

lock performanceraytrace

hw support locks

hardware supportour