code gpu with cuda - memory subsystem

CODE GPU WITH CUDAMEMORY SUBSYSTEM

Created by Marina Kolpakova ( ) for cuda.geek Itseez

PREVIOUS

http://github.com/cuda-geek

http://itseez.com/

http://cuda-geek.github.io/cumib/code_gpu_with_cuda_3.html#/end1

OUTLINEGPU memory typesVector transactionCoalesced memory accessMemory hierarchyRequest trajectoryHardware supported atomicsTexture, constant, shared memory typesRegister spilling

OUT OF SCOPEComputer graphics capabilitiesOrganization of texture interpolation HW

GPU MEMORY TYPESOn-chip is placed on SM

Register file (RF)Shared (SMEM)

Off-chip is placed in GPU’s RAMGlobal (GMEM)Constant (CMEM)Texture (TEX)Local (LMEM)

VECTOR TRANSACTIONSM has dedicated LD/ST units to handle memory accessGlobal memory accesses are serviced on warp basis

COALESCED MEMORY ACCESSFist sm_10 defines coalesced access as an affine access aligned to 128 byte line

Other obsolete sm_1x has strict coalescing rules, too.Modern GPUs have more relaxed requirements and define coalesced transaction astransaction that fits cache line

COALESCED MEMORY ACCESS (CONT)Request is coalesced if warp loads only bytes it needsThe less cache lines it needs the more coalesced access it hasAddress alignment by cache line size is still preferred

MEMORY HIERARCHYGPU memory has 2 levels of caches.

CACHE CHARACTERISTICSCache L1 L2generation Fermi Kepler Fermi Keplersizes, KB 16/48 16/32/48 up to 768 up to 1536line width 128B 32Blatency 56 clock - 282 158mode R, n-c - R&W, c, WBassociativity 2x64/6x64 - ? ?usage gmem, sys sys gmem, sys, tex

MEMORY REQUEST TRAJECTORY: LD.EFermi: fully-cached load

LD/ST units compute physical address and number of cache lines warp requests(L1 line is 128 B)L1 hit -> return line else go to L2L2 subdivides 128 B line into 4x32 B (L2 line size). If all required 32 B lines arefound in L2 return result else go to gmemgmem

Keplerdiscrete GPUs: like Fermi but bypass L1integrated GPUs: the same as Fermi

DUALITY OF CACHE LINEThe following requests are equal from gmem point of view.

32 B granularity useful if access pattern is close to random.

LOAD CACHING CONFIGURATIONSLD

Default (cache all): No special suffix

Cache only in L2 (cache global): LD.CG

Bypass caches (cache volatile) LD.CV

Cache streaming

L D R 8 , [ R 6 ] ;

L D . C G R 4 , [ R 1 6 ] ;

L D . C V R 1 4 , [ R 1 4 ] ;

MEMORY REQUEST TRAJECTORY: ST.EStore instruction invalidates cache line in L1 on all SMs, if present (since L1s are on SMand non-coherent)Request goes directly to L2. Default write strategy is write back. Can be configured aswrite through.Hit to L2 costs ~160 clocks in case write-back is not needed.Go to gmem in case of L2 miss (penalty > 350 clocks)

L2 is multi-ported

WIDE & NARROW TYPESWide

GPU supports wide memory transactions

Only 64 and 128-bit transactions are supported since they can be mapped to 2(4)32-bit registers

Narrow

Example: uchar2 SOA store results in 2 store transactions

/ * 1 6 1 8 * / L D . E . 1 2 8 R 8 , [ R 1 4 ] ;/ * 1 6 3 0 * / S T . E . 1 2 8 [ R 1 8 ] , R 8 ;

s t r u c t u c h a r 2 {u n s i g n e d c h a r x ;u n s i g n e d c h a r y ;}

/ * 0 2 c 8 * / S T . E . U 8 [ R 6 + 0 x 1 ] , R 0 ;/ * 0 2 d 0 * / S T . E . U 8 [ R 6 ] , R 3 ;

GMEM ATOMIC OPERATIONSPerformed in L2 per 32 B cache line.

throughput Fermi, per clock Kepler, per clockshared address 1/9 th 1independent 24 64

Same address means the same cache line

ATOM

RED

A T O M . E . I N C R 4 , [ R 6 ] , R 8 ;

R E D . E . A D D [ R 2 ] , R 0 ;

TEXTURE HARDWARELegacy from graphicsRead-only. Always loads through interpolation hardwareTwo-level: Dedicated L1, shared L2 for texture and global loads

property Fermi sm_30 sm_35L1 hit latency, clock No data 104 108

line size, B No data 128 128size, KB 8 12 4sbpx12(set)x(way) No data 4x24 4x24

L2 hit latency, clock No data 212 229penalty, clock No data 316 351

READ-ONLY DATA CACHEL1 Texture cache is opened for global load bypassing interpolation hardware. Supported

by sm_35.

/ * 0 2 8 8 * / T E X D E P B A R 0 x 0 ;/ * 0 2 9 0 * / L D G . E . 6 4 R 8 , [ R 4 ] ;/ * 0 2 9 8 * / T E X D E P B A R 0 x 0 ;/ * 0 2 a 0 * / L D G . E . 6 4 R 4 , [ R 8 ] ;/ * 0 2 a 8 * / I A D D R 6 , R 6 , 0 x 4 ;/ * 0 2 b 0 * / T E X D E P B A R 0 x 0 ;/ * 0 2 b 8 * / L D G . E . 6 4 R 8 , [ R 4 ] ;/ * 0 2 c 8 * / I S E T P . L T . A N D P 0 , P T , R 6 , R 7 , P T ;/ * 0 2 d 0 * / T E X D E P B A R 0 x 0 ;

Size is 48 KB (4 sub-partitions x 12 KB)Different warps go through different sub-partitionsSingle warp can use up to 12 KB

CONSTANT MEMORYOptimized for uniform access from the warp.

Compile time constantsKernel parameters and configurations2–3 layers of caches. Latency: 4–800 clocks

LOAD UNIFORMThe LDU instruction can employ constant cache hierarchy for each global memory

location. LDU = load (block-) uniform variable from memory.

Variable resides in global memoryPrefix pointer with const keywordMemory access must be uniform across all threads in the block (not dependent onthreadIdx)

_ _ g l o b a l _ _ v o i d k e r n e l ( t e s t _ t * g _ d s t , c o n s t t e s t _ t * g _ s r c ){ c o n s t i n t t i d = / * * / ; g _ d s t [ t i d ] = g _ s r c [ 0 ] + g _ s r c [ b l o c k I d x . x ] ;}

/ * 0 0 7 8 * / L D U . E R 0 , [ R 4 ] ;/ * 0 0 8 0 * / L D U . E R 2 , [ R 2 ] ;

SHARED MEMORYBanked: Successive 4-byte words placed to successive banks

sm_1x – 16x4 B, sm_2x – 32x4 B, sm_3x – 32x64 B

Atomic operations are done in lock/unlock manner

( v o i d ) a t o m i c A d d ( & s m e m [ 0 ] , s r c [ t h r e a d I d x . x ] ) ;

/ * 0 0 5 0 * / S S Y 0 x 8 0 ; / * 0 0 5 8 * / L D S L K P 0 , R 3 , [ R Z ] ; / * 0 0 6 0 * / @ P 0 I A D D R 3 , R 3 , R 0 ; / * 0 0 6 8 * / @ P 0 S T S U L [ R Z ] , R 3 ; / * 0 0 7 0 * / @ ! P 0 B R A 0 x 5 8 ; / * 0 0 7 8 * / N O P . S ;

REGISTER SPILLINGLocal memory refers to memory where registers are spilledPhysically resides in gmem, but likely cachedA local variable require a cache line for spilling because spilling is done per warpAddressing is resolved by the compilerStores are cached in L1Analogy with CPU stack variables

LDL/STL ACCESS OPERATIONStore writes line to L1

If evicted, then line is written to L2The line could also be evicted from L2, in this case it is written to DRAM

Load requests line from L1If a hit, operation is completeIf a miss, then request the line from L2If L2 miss, then request the line from DRAM

FINAL WORDSSM has dedicated LD/ST units to handle memory accessGlobal memory accesses are serviced on warp basisCoalesced transaction is a transaction that fits cache lineGPU memory has 2 levels of cachesOne L1 cache line consists of 4 L2-lines. Coalescing unit manages number of L2 lines thatis actually required64-bit and 128-bit memory transactions are natively supportedAtomic operations on global memory is done in L2Register spilling is fully cached for both reads and writes

THE ENDNEXT

BY / 2013–2015CUDA.GEEK

http://cuda-geek.github.io/cumib/code_gpu_with_cuda_4.html

https://github.com/cuda-geek

code gpu with cuda - memory subsystem

Education