an asymmetric distributed shared memory model for heterogeneous parallel systems
DESCRIPTION
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. Isaac Gelado , Javier Cabezas . John Stone, Sanjay Patel, Nacho Navarro and Wen-mei Hwu. 1. Introduction: Heterogeneous Computing. CPU. ACC. Existent programming models are DMA based: Explicit memory copy - PowerPoint PPT PresentationTRANSCRIPT
ASPLOS 2010 -- Pittsburgh
1
An Asymmetric Distributed Shared Memory Model for Heterogeneous
Parallel Systems
Isaac Gelado, Javier Cabezas. John Stone,Sanjay Patel, Nacho Navarro and Wen-mei Hwu
3/17/2010
ASPLOS 2010 -- Pittsburgh
2
1. Introduction: Heterogeneous Computing
3/17/2010
Heterogeneous Parallel Systems: CPU: sequential control-intensive code Accelerators: massively data-parallel
code
CPU
ACC
Existent programming models are DMA based: Explicit memory copy Programmer-managed memory
coherence
CPU
ACC
IN
IN
OUT
OUT
ASPLOS 2010 -- Pittsburgh
3
Outline
1. Introduction2. Motivation3. ADSM: Asymmetric Distributed Shared
Memory4. GMAC: Global Memory for ACcelerators5. Experimental Results6. Conclusions
3/17/2010
ASPLOS 2010 -- Pittsburgh
4
2.1 Motivation: Reference System
3/17/2010
CPU(N - Cores)
PCIe Bus
RAM Memory
RAM Memory
GPU-likeAccelerator
High bandwidthWeak consistencyLarge page size
Low latencyStrong consistencySmall page size
RAM Memory
RAM Memory
Device Memory
System Memory
ASPLOS 2010 -- Pittsburgh
5
2.2 Motivation: Memory Requirements High memory
bandwidth requirements
Non fully-coherent systems: Long-latency
coherence traffic Different coherence
protocols Accelerator memory always growing (e.g. 6GB
NVIDIA Fermi, 16GB PowerXCell 8i)
3/17/2010
ASPLOS 2010 -- Pittsburgh
6
2.3 Motivation: DMA-Based Programming
3/17/2010
• Duplicated Pointers
• Explicit Coherence Management
• CUDA Sample Code
CPU GPU
foo foofoofoo
void compute(FILE *file, int size) { float *foo, *dev_foo; foo = malloc(size); fread(foo, size, 1, file); cudaMalloc(&dev_foo, size); cudaMemcpy(dev_foo, foo, size, cudaMemcpyHostToDevice); kernel<<<Dg, Db>>>(dev_foo, size); cudaMemcpy(foo, dev_foo, size, cudaMemcpyDeviceToHost); cpuComputation(foo); cudaFree(dev_foo); free(foo);}
ASPLOS 2010 -- Pittsburgh
7
3.1 ADSM: Unified Virtual Address Space• Unified Virtual Shared Address Space
• CPU: access both, system and accelerator memory• Accelerator: access to its own memory• Under ADSM, both will use the same virtual address when
referencing the shared object
CPU ACCbar
baz
foo foo
Shared DataObject
3/17/2010
System Memory
Device Memory
ASPLOS 2010 -- Pittsburgh
8
3.2 ADSM: Simplified Code• Simpler CPU code than in DMA-
based programming models• Hardware-independent code
Single PointerData
Assignment
Peer DMALegacy
Support3/17/2010
void compute(FILE *file, int size) {
float *foo;
foo = adsmMalloc(size);
fread(foo, size, 1, file);
kernel<<<Dg, Db>>>(foo, size);
cpuComputation(foo);
adsmFree(foo);
}
CPU GPU
foo
ASPLOS 2010 -- Pittsburgh
9
3.3 ADSM: Memory Distribution
• Asymmetric Distributed Shared Memory principles:• CPU accesses objects in accelerator memory but
not vice versa• All coherency actions are performed by the CPU
• Trashing unlikely to happen:• Synchronization Variables: Interrupt-based and
dedicated hardware• False-sharing: Data object sharing granularity
3/17/2010
ASPLOS 2010 -- Pittsburgh
10
3.4 ADSM: Consistency and Coherence• Release consistency:
• Consistency only relevant from CPU perspective• Implicit release/acquire at accelerator call/return
CPU ACC
Foo
CPU ACC
FooAcceleratorReturnAccelerator
Call
3/17/2010
• Memory Coherence:• Data ownership information enables eager data transfers• CPU maintains coherency
ASPLOS 2010 -- Pittsburgh
11
4. Global Memory for Accelerators
• ADSM implementation
• User-level shared library
• GNU / Linux Systems
• NVIDIA CUDA GPUs
3/17/2010
ASPLOS 2010 -- Pittsburgh
12
4.1 GMAC: Overall Design
• Layered Design:• Multiple Memory Consistency Protocols• Operating System and Accelerator Independent code
CUDA-like Front-End
Memory Manager(Different Policies)
Kernel Scheduler(FIFO)
Operating SystemAbstraction Layer
Accelerator AbstractionLayer (CUDA)
3/17/2010
ASPLOS 2010 -- Pittsburgh
13
4.2 GMAC: Unified Address Space
System Virtual Address Space
GPU Physical Address Space
3/17/2010
• Virtual Address Space formed by GPU and System physical memories
• GPU memory address range cannot be selected
• Allocate same virtual memory address range in both, GPU and CPU
• Accelerator Virtual memory would ease this process
ASPLOS 2010 -- Pittsburgh
14
• Batch-Update: copy all shared objects• Lazy-Update: copy modified / needed shared
objects• Data object granularity• Detect CPU read/write accesses to shared objects
• Rolling-Update: copy only modified / needed memory• Memory block size granularity• Fixed maximum number of modified blocks in system
memory flush data when maximum is reached
4.3 GMAC: Coherence Protocols
3/17/2010
ASPLOS 2010 -- Pittsburgh
15
5.1 Results: GMAC vs. CUDA
3/17/2010
• Batch-Update overheads:– Copy output data
on call– Copy non-used data
• Similar performance for CUDA, Lazy-Update and Rolling-Update
ASPLOS 2010 -- Pittsburgh
16
5.2 Results: Lazy vs. Rolling on 3D Stencil• Extra data copy
for small data objects
• Trade-off between bandwidth and page fault overhead
3/17/2010
ASPLOS 2010 -- Pittsburgh
17
6. Conclusions
• Unified virtual shared address space simplifies programming of heterogeneous systems
• Asymmetric Distributed Shared Memory• CPU access accelerator memory but not vice versa• Coherence actions only executed by CPU
• Experimental results shows no performance degradation
• Memory translation in accelerators is key to efficient implement ADSM
3/17/2010
ASPLOS 2010 -- Pittsburgh
18
Thank you for your attention
Eager to start using GMAC?http://code.google.com/p/adsm/
[email protected]@googlegroups.com
3/17/2010
ASPLOS 2010 -- Pittsburgh
19
Backup Slides
3/17/2010
ASPLOS 2010 -- Pittsburgh
20
4.4 GMAC: Memory Mapping
• Software: allocate different address space and provide translation function (gmacSafePtr())
• Hardware: implement virtual memory in the GPU
3/17/2010
System Virtual Address Space
GPU Physical Address Space
• Allocation might fail if the range is in use
ASPLOS 2010 -- Pittsburgh
21
4.5 GMAC: Protocol States
• Protocol States: Invalid, Read-only, Dirty
3/17/2010
Invalid
Dirty
Read Only
Call
Call
Read
Write
Write
Flush
Invalid Dirty
Call
Return• Batch-Update:
• Call / Return• Lazy-Update:
• Call / Return• Read / Write
• Rolling-Update:• Call / Return• Read / Write• Flush
ASPLOS 2010 -- Pittsburgh
22
4.6 GMAC: Rolling vs. Lazy
3/17/2010
• Batch – Update: transfer on kernel call
• Rolling – Update: transfer while CPU computes
ASPLOS 2010 -- Pittsburgh
23
5.3 Results: Break-down of Execution
3/17/2010
ASPLOS 2010 -- Pittsburgh
24
5.4 Results: Rolling Size vs. Block Size
• No appreciable effect on most benchmarks
3/17/2010
• Small Rolling size leads to performance aberrations
• Prefer relative large rolling sizes
ASPLOS 2010 -- Pittsburgh
25
6.1 Conclusions: Wish—list
• GPU Anonymous Memory Mappings:• GPU to CPU mappings never fail• Dynamic memory re—allocations
• GPU dynamic Pinned Memory:• No intermediate data copies on flush
• Peer DMA:• Speed—up I/O operations• No intermediate copies on GPU-to-GPU copies
3/17/2010