zoran radović and erik hagersten {zoranr, eh}@it.uu.se uppsala university information technology...
DESCRIPTION
Implementing Low Latency Distributed Software-Based Shared Memory. Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems. Problems with Traditional SW-DSMs. Page-sized coherence unit False Sharing! - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/1.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 1
Zoran Radović and Erik Hagersten{zoranr, eh}@it.uu.se
Uppsala UniversityInformation Technology
Department of Computer Systems
Implementing Low Latency DistributedSoftware-Based Shared Memory
![Page 2: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/2.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 2
Problems with Traditional SW-DSMs
Page-sized coherence unit False Sharing![e.g., Ivy, Munin, TreadMarks, Cashmere-2L, Shasta, GeNIMA, …]
Protocol agent messaging is slow Most efficiency lost in interrupt/poll
CPUs
Mem Prot.agent
CPUs
MemProt.agent
LD x
![Page 3: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/3.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 3
Our proposal: DSZOOMRun entire protocol in requesting-processor No protocol agent communication!
Assumes user-level remote memory access put, get, and atomics [ InfiniBand]
Fine-grain access-control checks[e.g., Shasta, Blizzard-S, Sirocco-S]
CPUs
Mem
ProtocolCPUs
Mem
atomicDIR
get
LD x
DIR
![Page 4: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/4.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 4
Outline
MotivationGeneral DSZOOM OverviewExperimentation EnvironmentDSZOOM-WF Implementation DetailsPerformance ResultsImproved DSZOOM… [SC2001]
![Page 5: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/5.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 5
DSZOOM Cluster
DSZOOM Nodes: Each node consists of an unmodified SMP
multiprocessor SMP hardware keeps coherence among the caches
and the memory within each SMP node
DSZOOM Cluster Network: Non-coherent cluster interconnect Inexpensive user-level remote memory access Remote atomic operations [e.g., InfiniBand]
![Page 6: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/6.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 6
Current DSZOOM Hardware
Two E6000 connected through a hardware-coherent interface (Sun-WildFire) with a raw bandwidth of 800 MB/s in each direction Data migration and coherent memory replication (CMR) are kept inactive
16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory Memory access times: 330 ns local / 1700 ns remote (lmbench latency)
Run as 16-way SMP, 28 HW-ccNUMA, and 28 SW-DSM
![Page 7: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/7.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 7
Compilation Process
DSZOOM-WFImplementationof PARMACS
Macros
a.out
Binary
EEL
DSZOOM-WFRun-Time Library
m4
GNU
gcc
UnmodifiedSPLASH-2Application
CoherenceProtocols
![Page 8: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/8.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 8
Stack
Text & Data
Heap
PRIVATE_DATA
shmid = A
Physical Memoryof the Cabinet 1
shmget
shmid = B
shmget
Physical Memoryof the Cabinet 2
Process and Memory Distribution
Cabinet 1
forkforkfork
pset_bindpset_bindpset_bind
forkforkfork
0x80000000
G_MEM
Cabinet_1_G_MEM
Cabinet_2_G_MEM
Cabinet_1_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
Cabinet_2_G_MEM
Cabinet_1_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
Cabinet_2_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
Stack
Text & Data
Heap
PRIVATE_DATA
Cabinet_1_G_MEM
Cabinet_2_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
”Aliasing”
Stack
Text & Data
Heap
PRIVATE_DATA
Cabinet 2
Stack
Text & Data
Heap
PRIVATE_DATA
Cabinet_1_G_MEM
Cabinet_2_G_MEM
Stack
Text & Data
Heap
PRIVATE_DATA
G_MEM
shmat
shmat
shmat
![Page 9: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/9.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 9
So far …
DSZOOM-WFImplementationof PARMACS
Macros
a.out
(Un)executable
EEL
DSZOOM-WFRun-Time Library
m4
GNU
gcc
UnmodifiedSPLASH-2Application
CoherenceProtocols
![Page 10: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/10.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 10
Squeezing Protocols into Binaries …
Static Binary Instrumentation EEL — Machine-independent Executable
Editing Library implemented in C++• Replace global loads with snippets containing fine-
grain access control checks• Insert coherence protocols
![Page 11: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/11.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 11
1: ld [address],%reg // original LD 2: fcmps %fcc0,%reg,%reg // compare reg with itself 3: fbe,pt %fcc0,hit // if (reg == reg) goto hit 4: nop
5: Call global coherence load routine
hit:
Fine-grain Access Control Checks
The “magic” value is a small integer corresponding to an IEEE floating-point NaN [Blizzard-S, Sirocco-S]Floating-point load example:
CoherenceProtocols
![Page 12: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/12.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 12
Modified-Shared-Invalid (MSI)
G_MEM
Cabinet_2_G_MEM
Shared cache line
Invalid cache line
MEM_STORE
Cabinet_1_G_MEM
0 0 0 0 0 0 1 0LOCK
After MEM_STORE
Presence bitsDIR_ENTRY
0 0 0 0 0 0 0 1LOCK
Before MEM_STORE
One DIR_ENTRYper cache line
Distributed DIR
”Aliasing”
![Page 13: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/13.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 13
Read Data from Home Node:2–hop read
MemDIR1a. f&s
2. put
= Small packet (~10 bytes)
= Large packet (~68 bytes)
= Message on the critical path
= Message off the critical path
1b. get
data
Requestor
![Page 14: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/14.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 14
Instrumentation Performance
Program Problem Size %LD
%ST
InstrumentationOverhead
FFT 1,048,576 points (48.1 MB) 26.1 22.2 1.43LU-Cont 10241024, block 16 (8.0 MB) 22.7 14.5 1.68LU-Non-Cont 10241024, block 16 (8.0 MB) 23.9 16.6 1.42Radix 4,194,304 items (36.5 MB) 24.1 14.9 1.15Barnes-Hut 16,384 bodies (32.8 MB) 37.5 50.5 1.25FMM 32,768 particles (8.1 MB) 25.5 22.9 1.12Ocean-Cont 514514 (57.5 MB) 28.6 26.2 1.34Ocean-Non-Cont 258258 (22.9 MB) 15.5 31.6 1.21Radiosity Room (29.4 MB) 31.1 35.0 1.11Raytrace Car (32.2 MB) 28.8 31.5 1.53Water-nsq 2,197 mols., 2 steps (2.0 MB) 24.5 32.4 1.21Water-sp 2,197 mols., 2 steps (1.5 MB) 25.5 27.6 1.21
Average 26.2 27.2 1.30
![Page 15: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/15.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 15
Normalized Instrumentation Overhead Breakdown (Seq. Exec.)
0%
20%
40%
60%
80%
100%
f-p-ST-snippetint-ST-snippetf-p-LD-snippetint-LD-snippetE6000 seq
![Page 16: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/16.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 16
Results (1)Execution Times in Seconds (16 CPUs)
02468
1012
E6000 16 Processors ccNUMA 2x8 DSZOOM-WF 2x8 CL128
![Page 17: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/17.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 17
Results (2)Normalized Execution Time Breakdowns (16 CPUs)
0%
20%
40%
60%
80%
100%
StoreLoadLocksBarriersTask
![Page 18: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/18.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 18
DSZOOM completely eliminates asynchronous messaging between protocol agentsConsistently competitive and stable performance in spite of high instrumentation overhead 35% slowdown compared to hardware State-of-the-art checking overheads are in the range of
5–35% (e.g., Shasta), DSZOOM: 11–68%
Conclusions
![Page 19: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/19.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 19
Improved DSZOOM… [SC2001]
Protocol/Overall optimizations Coherency unit variations
Synchronization improvements More balanced execution between cabinets
Better instrumentation More detailed backward slice algorithm
![Page 20: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/20.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 20
SC2001 TeaserExecution Times in Seconds (16 CPUs)
0
2
4
6
8
10
12
ccNUMA 2x8 DSZOOM-WF 2x8 CL128 DSZOOM Today
![Page 21: Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems](https://reader036.vdocument.in/reader036/viewer/2022062521/568167e0550346895ddd413c/html5/thumbnails/21.jpg)
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 21
http://www.it.uu.se/research/group/uart
DSZOOM’s Home Page