licentiate thesis seminar uppsala university, 25/9 – 2003 efficient synchronization and
DESCRIPTION
Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and Coherence for Nonuniform Communication Architectures Zoran Radovic [email protected]. Introduction: Cache. “Scratch pad” Kladdpapper. $. Memory. A. B. P. A: 5. B: 80. A = 5 B = A + 75. - PowerPoint PPT PresentationTRANSCRIPT
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Licentiate Thesis SeminarLicentiate Thesis SeminarUppsala University, 25/9 – 2003Uppsala University, 25/9 – 2003
Efficient Synchronization andEfficient Synchronization andCoherence for NonuniformCoherence for NonuniformCommunication ArchitecturesCommunication Architectures
Zoran RadovicZoran [email protected]@it.uu.se
Licentiate Thesis SeminarLicentiate Thesis SeminarUppsala University, 25/9 – 2003Uppsala University, 25/9 – 2003
Efficient Synchronization andEfficient Synchronization andCoherence for NonuniformCoherence for NonuniformCommunication ArchitecturesCommunication Architectures
Zoran RadovicZoran [email protected]@it.uu.se
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Introduction: Cache
P
Memory
A
B
A = 5B = A + 75
“Scratch pad”
Kladdpapper
$
A: 5
B: 80
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Introduction: Cache Coherence
P2
Memory
P1 P3Web serverDatabase
serveretc.
A: 5A: 5A: 5
A = 5B = A + 75
A = A + 1
B: 80
AB
Cache-to-cacheTransferA: 6
A:=0A:=56
Y:=X
BARRIER
LOCK(CS)
UNLOCK
CacheCoherence
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Inside a Real Thing ...
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Switch
Nonuniform Memory AccessArchitecture (NUMA)
Many NUMA optimizations are proposed Page migration speed up accesses to “private” data Page replication speed up reads to “shared” data
Does not help communication… E.g., cache-to-cache transfers
P
$
P
$
P
$
P
$
P
$
P
$
P
$
P
$
Memory Memory
12 – 10
Accesstime ratio ...
[email protected] Licentiate Thesis Seminar Sept 25, 2003
A “new” propertyof NUMAs…
NUCA
Nonuniform CommunicationArchitecture (NUCA)
NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future (Today): CMP, SMT (~ 10)
NUCAratio
Switch
P
$
P
$
P
$
P
$
P
$
P
$
P
$
P
$
Memory Memory
1 2 – 10...
NUCA optimizationsare getting important for
future architectures!
NUCA optimizationsare getting important for
future architectures!
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Outline
Introduction NUCA Locks
Paper A: RH Lock Paper B: HBO Locks
Beating the Real Thing … Paper C: DSZOOM – Software-based Shared Memory Paper D: THROOM – POSIX Front-end Paper E: SAIT & Write Permission Cache (WPC)
Contributions Future Work
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Synchronization Basics
Locks are used to protect critical section (CS) data
CS examples: Bank account status Global counters Number of on-line visitors …
A:=0 BARRIER
LOCK(L)A:=A+1
UNLOCK(L)LOCK(L)B:=A+5
UNLOCK(L)
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Synchronization Example
P1
$
P2
$
P4
$
Memory
…
Test / SpinTest / SpinTest / SpinLock CS flag Update CS dataUnlock
Lock CS flag Update CS dataUnlock
lockhandover
Locks are used to protect
critical section (CS) data
“CS efficiency”
= CS flag
= CS data
Write BUSY tokento the flag…
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Memory
Large System Synchronization
Switch
MemoryMemory
TestTestLock UpdateUnlock
TestTestTestTestTest
TestTestTestTestTestTestTest
Lock UpdateUnlockTest
TestTestTestTest
P9
$
P10
$
P12
$…
P5
$
P6
$
P8
$…
P1
$
P2
$
P4
$…
TestTestTestTest
TestTestTestTestTest
TestTestTestTest
Lock UpdateUnlock
Three problems under contentionwith Spin (Test&Set) locks:
1) Test and invalidation traffic2) Lock handover3) CS efficiency
Three problems under contentionwith Spin (Test&Set) locks:
1) Test and invalidation traffic2) Lock handover3) CS efficiency
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Vasaloppet“Contention Problem in Sweden”
Traditional cross-country ski race90 km …
85.6533 km to go… CS
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Spin Locks Under Contention
Amount of Contention
Spin locks
Spin locksw/ backoff
Cri
tic
al S
ecti
on
(C
S)
Co
st
IF (more contention) THEN less efficient CS …
“The more important the slower it runs…”
IF (more contention) THEN less efficient CS …
“The more important the slower it runs…”
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Making it Scalable: Queues …
First-come,first-served order Starvation avoidance Maximal fairness Reduced traffic
Queue-based locks HW: QOLB ‘89 SW: MCS ‘91 SW: CLH ‘93
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Queue-based Locks
Amount of Contention
Spin locks
Spin locksw/ backoff
CS
Co
st
Queue-based locks IF (more contention) THEN constant CS cost …
IF (more contention) THEN constant CS cost …
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Raytrace Speedup
0
1
2
3
4
5
6
7
8
9
0 4 8 12 16 20 24 28
Number of Processors
Spe
edup
TATAS MCS
Sun WildFire (WF)
NUCA Ratio = 6
14 14
WF
[email protected] Licentiate Thesis Seminar Sept 25, 2003
This Thesis
Amount of Contention
Queue-based locks
Spin locks
Spin locksw/ backoff
NUCA locks
CS
Co
st
IF (more contention) THEN more efficient CS …
“The more important the faster it runs…”
IF (more contention) THEN more efficient CS …
“The more important the faster it runs…”
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Raytrace Speedup
0
1
2
3
4
5
6
7
8
9
0 4 8 12 16 20 24 28
Number of Processors
Spe
edup
TATAS MCS
NUCA Locks
Sun WildFire (WF)
14 14
WF
[email protected] Licentiate Thesis Seminar Sept 25, 2003
NUCA Locks
Switch
MemoryMemoryMemory
TestTestTestTestLock/Unlock
Lock/Unlock
P
$
P
$
P
$…
P
$
P
$
P
$…
P
$
P
$
P
$…
TestTestTestTestTestTestTest
1) Reduces traffic(one CPU per node is testing…)
2) Improves lock handover3) More efficient CS
(local traffic is cheaper)
1) Reduces traffic(one CPU per node is testing…)
2) Improves lock handover3) More efficient CS
(local traffic is cheaper)
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Application PerformanceRaytrace Speedup
WF
0
1
2
3
4
5
6
7
8
0 4 8 12 16 20 24 28
Number of Processors
Spe
edup
TATAS
MCS
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Application PerformanceRaytrace Speedup
WF
0
1
2
3
4
5
6
7
8
0 4 8 12 16 20 24 28
Number of Processors
Spe
edup
TATAS
MCS
HBO
HBO_GT
RH LockRH Lock
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Total Traffic: Raytrace
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
TATAS TATAS_EXP MCS HBO_GT
Local Transactions Global Transactions
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Outline
IntroductionNUCA Locks
Paper A: RH LockPaper B: HBO Locks
Beating the Real Thing … Paper C: DSZOOM – Software-based Shared Memory Paper D: THROOM – POSIX Front-end Paper E: SAIT & Write Permission Cache (WPC)
Contributions Future Work
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Servers vs. Clusters
A:=0 A:=56
Y:=X
BARRIER
LOCK(CS)
UNLOCK
A:=0 A:=56
Y:=X
BARRIER
LOCK(CS)
UNLOCK
??
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Popular Solutions
Solution 1: more hardware (HW-DSM) Transparent for programmers Usually good scalability Expensive, hard verification, long time to market …
Solution 2: simple HW + software (SW-DSM) Can use more complex (adaptive) protocols Traditionally poor scalability for many programs Shorter time to market, simple to upgrade/customize
[email protected] Licentiate Thesis Seminar Sept 25, 2003
The DSZOOM proposal
[email protected] Licentiate Thesis Seminar Sept 25, 2003
DSZOOM Cluster
DSZOOM Nodes: Each node consists of an unmodified
workstation/server Server’s hardware provides memory protocols for
caches and memory within each machine
+ DSZOOM Cluster Network:
“Standard” and fast cluster interconnect Inexpensive user-level remote memory access
+ DSZOOM software
Memory protocols between nodes, synchronization
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Problems with Traditional SW-DSMs
Large coherence units (4-8kB) False Sharing! Weaker Memory Models[e.g., Ivy, Munin, TreadMarks, Cashmere-2L, GeNIMA, …]
Protocol agent messaging is slow Most efficiency lost in interrupt/poll
CPUs
MemProt.agent
CPUs
MemProt.agent
LD a
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Our proposal: DSZOOM Run entire protocol in requesting-processor
No protocol agent communication!
Assumes user-level remote memory access put, get, and atomics [ InfiniBand ]
Fine-grain memory protocols (64 bytes)
Hardware-like memory models [Shasta, Blizzard, Sirocco]
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Global Coherency ActionRead data modified in a third node: 3–hop read
DIR
Mem
WritePerm.
1.atomic
3b. put
2a.atomic
2b. get
data
3a. put
Requestor
LD a
“Blocking directory” protocol
Node 1
Node 2
Node 3
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Squeezing protocols into binaries…
...cmp %g0, %l5
bne 0x24431nop
ldd [%o0 + 16], %f4clr %l5...
...cmp %g0, %l5
bne 0x24431nop
ldd [%o0 + 16], %f4clr %l5...
ld [%o1 + 64], %o0
ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop
OriginalProgram
DSZOOMProgram
Fast-path Protocol
Code
Slow-pathProtocol
Code(C-code)
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Compilation Process
ParallelProgramming
Constructs
a.out
(Un)executable
EEL
DSZOOMRun-Time Library
GNU
gcc
link
UnmodifiedApplication
MemoryProtocols(C-code)
[email protected] Licentiate Thesis Seminar Sept 25, 2003
ResultsExecution Times in Seconds (16 CPUs)
0
1
2
3
4
5
6
7
8
9
10
Exe
cutio
n tim
e [s
econ
ds]
E6000 16 CPUs CC-NUMA 2x8 DSZOOM-WF 2x8
HW SW16
8 8 8 8
DSZOOM
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Outline
IntroductionNUCA Locks
Paper A: RH LockPaper B: HBO Locks
Beating the Real Thing …Paper C: DSZOOM – Software-based Shared Memory Paper D: THROOM – POSIX Front-end Paper E: SAIT & Write Permission Cache (WPC)
Contributions Future Work
[email protected] Licentiate Thesis Seminar Sept 25, 2003
THROOMTowards Higher Transparency …
a.out
Unmodified POSIX thread(Pthread) Application
EELMemory
Protocols(C-code)
ParallelProgramming
Constructs
a.out
(Un)executable
EEL
DSZOOMRun-Time Library
GNUgcc
link
UnmodifiedApplication
MemoryProtocols(C-code)
Transparent runtime support:-- memory allocation-- thread creation / termination-- synchronization-- I/O…
[email protected] Licentiate Thesis Seminar Sept 25, 2003
SAIT Overview
SAIT = SPARC Assembler Instrumentation Tool Instrument assembler files
More information about programs is available
Support for liveness analysis
SourceFile cc
.s assembler
outputSAIT
.s instrumented
assemblerld
User Library(e.g., protocols)
calls
linkUser Library
(e.g., protocols)User Library
(e.g., protocols)snippets.txt
a.out
Used in severalUART projects!
Used in severalUART projects!
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Write Permission Cache (WPC)
P
Memory
P
Write permission: A, B, DWrite permission: A, B, D
WPC WPC WPC
WritePermissio
n?
Store instrumentationis expensive…
PP
Store A
AA
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Contributions
Nonuniform Communication Architecture (NUCA) Several NUCA-locks that exploit NUCAs:
RH lock Three HBO locks
DSZOOM: Novel SW-DSM system THROOM: Supporting POSIX binaries on clusters SAIT: SPARC Assembler Instrumentation Tool WPC: Write Permission Cache
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Future Work
NUCA locks for the DSZOOM system Instrumentation optimizations
Compiler support Optimizing backend
Further WPC studies/optimizations Protocol optimizations
Adaptive Invalidate/Update “Push based” protocols
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Licentiate Thesis SeminarLicentiate Thesis SeminarUppsala University, 25/9 – 2003Uppsala University, 25/9 – 2003
Efficient Synchronization andEfficient Synchronization andCoherence for NonuniformCoherence for NonuniformCommunication ArchitecturesCommunication Architectures
Zoran RadovicZoran [email protected]@it.uu.se
Licentiate Thesis SeminarLicentiate Thesis SeminarUppsala University, 25/9 – 2003Uppsala University, 25/9 – 2003
Efficient Synchronization andEfficient Synchronization andCoherence for NonuniformCoherence for NonuniformCommunication ArchitecturesCommunication Architectures
Zoran RadovicZoran [email protected]@it.uu.se
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Fairness Study2-node Sun WildFire, 28 CPUs
02468
10121416182022242628
0 5 10 15Time [seconds]
Num
ber
of F
inis
hed
Pro
cess
ors TATAS
MCS
HBO_GT
t
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Traditional Microbenchmark
for (i = 0; i < iterations; i++) { LOCK(L); /* null/small Critical Section */ UNLOCK(L);}
For each thread:
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Lock performanceTraditional microbenchmark
0
5
10
15
20
25
30
35
40
45
50
55
60
0 4 8 12 16 20 24 28
Number of Processors
Tim
e [m
icro
seco
nds]
TATAS
MCS
HBO_GT
WF
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28Number of Processors
Nod
e ha
ndof
fs [
%]
TATAS
MCS
HBO_GT
[email protected] Licentiate Thesis Seminar Sept 25, 2003
New Microbenchmark
for (i = 0; i < iterations; i++) { LOCK(L); delay(critical_workcritical_work); // CS UNLOCK(L); static_delay(); random_delay();}
More realistic node handoffs for queue-locks Constant number of processors Control the “amount of contention”
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs
3
4
5
6
7
8
9
10
11
12
0 500 1000 1500 2000critical_work
Tim
e [s
econ
ds]
TATAS
MCS
HBO_GT
WF
14 14
0
10
20
30
40
50
60
0 500 1000 1500 2000
critical_work
Nod
e ha
ndof
fs [
%]
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Results (2)Normalized Execution Time Breakdowns (16 CPUs)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Store
Load
Locks
Barriers
ILC
Task
SW8 8
EEL
[email protected] Licentiate Thesis Seminar Sept 25, 2003
Instrumentation Performance
Program Problem Size%LD
%ST
InstrumentationOverhead
FFT 1,048,576 points (48.1 MB) 19.0 16.5 1.38
LU-Cont 10241024, block 16 (8.0 MB) 15.5 9.4 1.59
LU-Non-Cont 10241024, block 16 (8.0 MB) 16.7 11.1 1.50
Radix 4,194,304 items (36.5 MB) 15.6 11.6 1.13
Barnes-Hut 16,384 bodies (32.8 MB) 23.8 31.1 1.03
FMM 32,768 particles (8.1 MB) 17.5 13.6 1.06
Ocean-Cont 514514 (57.5 MB) 27.0 23.9 1.34
Ocean-Non-Cont 258258 (22.9 MB) 11.6 28.0 1.24
Radiosity Room (29.4 MB) 26.3 27.2 1.07
Raytrace Car (32.2 MB) 19.0 18.1 1.21
Water-nsq 2,197 mols., 2 steps (2.0 MB) 13.4 16.2 1.06
Water-sp 2,197 mols., 2 steps (1.5 MB) 15.7 13.9 1.09
Average 18.4 18.3 1.22
[email protected] Licentiate Thesis Seminar Sept 25, 2003
1-entry WPC
0
4
8
12
16
20
24
28
32
FFT
LU-c
ont
LU-n
on-c
ont
Radix
Barne
s
Choles
kyFM
M
Ocean
-con
t
Ocean
-non
-con
t
Radios
ity
Raytra
ce
Wat
er-n
sq
Wat
er-s
p
Ave
rag
e #
Sto
res
un
til U
NL
OC
K
64 bytes 128 bytes 256 bytes
[email protected] Licentiate Thesis Seminar Sept 25, 2003
2-entry WPC
0
4
8
12
16
20
24
28
32
FFT
LU-c
ont
LU-n
on-c
ont
Radix
Barne
s
Choles
kyFM
M
Ocean
-con
t
Ocean
-non
-con
t
Radios
ity
Raytra
ce
Wat
er-n
sq
Wat
er-s
p
Ave
rag
e #
Sto
res
un
til U
NL
OC
K
64 bytes 128 bytes 256 bytes