gpu memory details

by Martin Kruliš (v1.1) 1

GPU Memory DetailsMartin Kruliš

29. 10. 2015

by Martin Kruliš (v1.1) 229. 10. 2015

Overview

Glo

bal M

em

ory

L2 C

ach

e

L1

Cach

e

Regis

ters

Core

Core

…

L1

Cach

e

Regis

ters

Core

Core

…

…Host

CPU

Host Memory SMP

GPU Chip

GPU Device

PCI Express(16/32 GBps)

~ 25 GBps

Note that details about host memory interconnection are platform specific.

> 100 GBps


PCIe Transfers◦ Much slower than internal GPU data transfers◦ Issued explicitly by host code

cudaMemcpy(dst, src, size, direction); With one exception, when the GPU memory is

mapped to the host memory space The transfer call has significant overhead

Bulk transfers are preferred

Overlapping◦ Up to 2 async. transfers while the GPU is

computing

29. 10. 2015

Host-Device Transfers


Global Memory Properties◦ Off-chip, but on the GPU device◦ High bandwidth and high latency

~ 100 GBps, 400-600 of clock cycles◦ Operated in transactions

Continuous aligned segments of 32 B - 128 B Number of transaction depends on caching model,

GPU architecture, and memory access pattern

29. 10. 2015

Global Memory


Global Memory Caching◦ Data are cached in L2 cache

Relatively small (up to 2MB on new Maxwell GPUs)◦ On CC < 3.0 (Fermi) also cached in L1 cache

Configurable by compiler flag -Xptxas -dlcm=ca (Cache Always, i.e. also in L1, default) -Xptxas -dlcm=cg (Cache Global, i.e. L2 only)

◦ CC 3.x (Kepler) reserves L1 for local memory caching and registry spilling

◦ CC 5.x (Maxwell) separates L1 cache from shared memory and unifies it with texture cache

29. 10. 2015

Global Memory


Coalesced Transfers◦ Number of transactions caused by global memory

access depends on the pattern of the access◦ Certain access patterns are optimized◦ CC 1.x

Threads sequentially access aligned memory block Subsequent threads access subsequent words

◦ CC 2.0 and later Threads access aligned memory block Access within the block can be permuted

29. 10. 2015

Global Memory


Access Patterns◦ Perfectly aligned sequential access

29. 10. 2015

Global Memory


Access Patterns◦ Perfectly aligned with permutation

29. 10. 2015

Global Memory


Access Patterns◦ Continuous sequential, but misaligned

29. 10. 2015

Global Memory


Coalesced Loads Impact

29. 10. 2015

Global Memory


Memory Shared by SM◦ Divided into banks

Each bank can be accessed independently Consecutive 32-bit words are in consecutive banks

Optionally, 64-bit words division is used (CC 3.x)

◦ Bank conflicts are serialized Except for reading the same address (broadcast)

29. 10. 2015

Shared Memory

Compute capability

Mem. size

# of banks

latency

1.x 16 kB 16 32bits / 2 cycles

2.x 48 kB 32 32 bits / 2 cycles

3.x 48 kB 32 64 bits / 1 cycle


Linear Addressing◦ Each thread in warp access different memory

bank◦ No collisions

29. 10. 2015

Shared Memory


Linear Addressing with Stride◦ Each thread access 2*i-th item◦ 2-way conflicts (2x slowdown) on CC < 3.0◦ No collisions on CC 3.x

Due to 64-bits per cycle throughput

29. 10. 2015

Shared Memory


Linear Addressing with Stride◦ Each thread access 3*i-th item◦ No collisions, since the number of banks is not

divisible by the stride

29. 10. 2015

Shared Memory


Broadcast◦ One set of threads access value in bank #12 and

the remaining threads access value in bank #20◦ Broadcasts are served independently on CC 1.x

I.e., sample bellow causes 2-way conflict◦ CC 2.x and 3.x serve all broadcasts

simultaneously

29. 10. 2015

Shared Memory


Shared Memory vs. L1 Cache◦ On most devices, they are the same resource◦ Division can be set for each kernel bycudaFuncSetCacheConfig(kernel, cacheConfig); Cache configuration can prefer L1 or shared memory

(i.e., selecting 48kB of 64kB for the preferred)

Shared Memory Configuration◦ Some devices (CC 3.x) can configure memory

bankscudaFuncSetSharedMemConfig(kernel, config); The config selects between 32 bit and 64 bit mode

29. 10. 2015

Shared Memory


Registers◦ One register pool per multiprocessor

8-64k of 32-bit registers (depending on CC) Register allocation is defined by compiler

◦ As fast as the cores (no extra clock cycles)◦ Read-after-write dependency

24 clock cycles Can be hidden if there are enough active warps

◦ Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible The programmer have no direct control over conflicts

29. 10. 2015

Registers


Per-thread Global Memory◦ Allocated automatically by compiler

Compiler may report the amount of allocated local memory (use --ptxas-options=-v)

◦ Large structures and arrays are places here Instead of the registers

◦ Register Pressure There is not enough registers to accommodate the

data of the thread The registers are spilled into the local memory Can be moderated selecting smaller thread blocks

29. 10. 2015

Local Memory


Constant Memory◦ Special 64KB cache for read-only data

8KB is the cache working set per multiprocessor◦ CC 2.x introduces LDU (LoaD Uniform) instruction

Compiler uses to force loading read-only variables that are thread-independent into the cache

Texture Memory◦ Texture cache is optimized for 2D spatial locality◦ Additional functionality like fast data interpolation,

normalized coordinate system, or handling the boundary cases

29. 10. 2015

Constant and Texture Memory


Global Memory◦ cudaMalloc(), cudaFree()◦ Dynamic kernel allocation

malloc() and free() called from kernel cudaDeviceSetLimit(cudaLimitMallocHeapSize, size)

Shared Memory◦ Statically (e.g., __shared__ int foo[16];)◦ Dynamically (by kernel launch parameter)extern __shared__ float bar[];float *bar1 = &(bar[0]);float *bar2 = &(bar[size_of_bar1]);

29. 10. 2015

Memory Allocation


Global Memory◦ Data should be accessed in coalesced manner◦ Hot data should be manually cached in shared

mem

Shared Memory◦ Bank conflicts need to be avoided

Redesigning data structures in col-wise manner Using strides that are not divisible by # of banks

Registers and Local Memory◦ Use as few as possible, avoid registry spilling

29. 10. 2015

Implications and Guidelines


Memory Caching◦ The structures should be designed to utilize

caches in best way possible The workset of active blocks should fit L2 cache

◦ Providing maximum information for the compiler Using const for constant data Using __restrict__ to indicate that no pointer

aliasing will occur Data Alignment

◦ Operate on 32bit/64bit values only◦ Align data structures to suitable powers of 2

29. 10. 2015

Implications and Guidelines


What is new in Maxwell….◦ L1 merges with texture cache

Data are cached in L1 the same way as in Fermi◦ Shared memory is independent

64k or 96k not shared with L1◦ Shared memory uses 32bit banks

Revert to Fermi-like style, keeping the aggregated bandwidth

◦ Faster shared memory atomic operations

29. 10. 2015

Maxwell Architecture

by Martin Kruliš (v1.1) 2429. 10. 2015

Discussion

gpu memory details

Documents