code gpu with cuda - memory subsystem
TRANSCRIPT
CODE GPU WITH CUDAMEMORY SUBSYSTEM
Created by Marina Kolpakova ( ) for cuda.geek Itseez
PREVIOUS
OUTLINEGPU memory typesVector transactionCoalesced memory accessMemory hierarchyRequest trajectoryHardware supported atomicsTexture, constant, shared memory typesRegister spilling
OUT OF SCOPEComputer graphics capabilitiesOrganization of texture interpolation HW
GPU MEMORY TYPESOn-chip is placed on SM
Register file (RF)Shared (SMEM)
Off-chip is placed in GPU’s RAMGlobal (GMEM)Constant (CMEM)Texture (TEX)Local (LMEM)
VECTOR TRANSACTIONSM has dedicated LD/ST units to handle memory accessGlobal memory accesses are serviced on warp basis
COALESCED MEMORY ACCESSFist sm_10 defines coalesced access as an affine access aligned to 128 byte line
Other obsolete sm_1x has strict coalescing rules, too.Modern GPUs have more relaxed requirements and define coalesced transaction astransaction that fits cache line
COALESCED MEMORY ACCESS (CONT)Request is coalesced if warp loads only bytes it needsThe less cache lines it needs the more coalesced access it hasAddress alignment by cache line size is still preferred
MEMORY HIERARCHYGPU memory has 2 levels of caches.
CACHE CHARACTERISTICSCache L1 L2generation Fermi Kepler Fermi Keplersizes, KB 16/48 16/32/48 up to 768 up to 1536line width 128B 32Blatency 56 clock - 282 158mode R, n-c - R&W, c, WBassociativity 2x64/6x64 - ? ?usage gmem, sys sys gmem, sys, tex
MEMORY REQUEST TRAJECTORY: LD.EFermi: fully-cached load
LD/ST units compute physical address and number of cache lines warp requests(L1 line is 128 B)L1 hit -> return line else go to L2L2 subdivides 128 B line into 4x32 B (L2 line size). If all required 32 B lines arefound in L2 return result else go to gmemgmem
Keplerdiscrete GPUs: like Fermi but bypass L1integrated GPUs: the same as Fermi
DUALITY OF CACHE LINEThe following requests are equal from gmem point of view.
32 B granularity useful if access pattern is close to random.
LOAD CACHING CONFIGURATIONSLD
Default (cache all): No special suffix
Cache only in L2 (cache global): LD.CG
Bypass caches (cache volatile) LD.CV
Cache streaming
L D R 8 , [ R 6 ] ;
L D . C G R 4 , [ R 1 6 ] ;
L D . C V R 1 4 , [ R 1 4 ] ;
MEMORY REQUEST TRAJECTORY: ST.EStore instruction invalidates cache line in L1 on all SMs, if present (since L1s are on SMand non-coherent)Request goes directly to L2. Default write strategy is write back. Can be configured aswrite through.Hit to L2 costs ~160 clocks in case write-back is not needed.Go to gmem in case of L2 miss (penalty > 350 clocks)
L2 is multi-ported
WIDE & NARROW TYPESWide
GPU supports wide memory transactions
Only 64 and 128-bit transactions are supported since they can be mapped to 2(4)32-bit registers
Narrow
Example: uchar2 SOA store results in 2 store transactions
/ * 1 6 1 8 * / L D . E . 1 2 8 R 8 , [ R 1 4 ] ;/ * 1 6 3 0 * / S T . E . 1 2 8 [ R 1 8 ] , R 8 ;
s t r u c t u c h a r 2 {u n s i g n e d c h a r x ;u n s i g n e d c h a r y ;}
/ * 0 2 c 8 * / S T . E . U 8 [ R 6 + 0 x 1 ] , R 0 ;/ * 0 2 d 0 * / S T . E . U 8 [ R 6 ] , R 3 ;
GMEM ATOMIC OPERATIONSPerformed in L2 per 32 B cache line.
throughput Fermi, per clock Kepler, per clockshared address 1/9 th 1independent 24 64
Same address means the same cache line
ATOM
RED
A T O M . E . I N C R 4 , [ R 6 ] , R 8 ;
R E D . E . A D D [ R 2 ] , R 0 ;
TEXTURE HARDWARELegacy from graphicsRead-only. Always loads through interpolation hardwareTwo-level: Dedicated L1, shared L2 for texture and global loads
property Fermi sm_30 sm_35L1 hit latency, clock No data 104 108
line size, B No data 128 128size, KB 8 12 4sbpx12(set)x(way) No data 4x24 4x24
L2 hit latency, clock No data 212 229penalty, clock No data 316 351
READ-ONLY DATA CACHEL1 Texture cache is opened for global load bypassing interpolation hardware. Supported
by sm_35.
/ * 0 2 8 8 * / T E X D E P B A R 0 x 0 ;/ * 0 2 9 0 * / L D G . E . 6 4 R 8 , [ R 4 ] ;/ * 0 2 9 8 * / T E X D E P B A R 0 x 0 ;/ * 0 2 a 0 * / L D G . E . 6 4 R 4 , [ R 8 ] ;/ * 0 2 a 8 * / I A D D R 6 , R 6 , 0 x 4 ;/ * 0 2 b 0 * / T E X D E P B A R 0 x 0 ;/ * 0 2 b 8 * / L D G . E . 6 4 R 8 , [ R 4 ] ;/ * 0 2 c 8 * / I S E T P . L T . A N D P 0 , P T , R 6 , R 7 , P T ;/ * 0 2 d 0 * / T E X D E P B A R 0 x 0 ;
Size is 48 KB (4 sub-partitions x 12 KB)Different warps go through different sub-partitionsSingle warp can use up to 12 KB
CONSTANT MEMORYOptimized for uniform access from the warp.
Compile time constantsKernel parameters and configurations2–3 layers of caches. Latency: 4–800 clocks
LOAD UNIFORMThe LDU instruction can employ constant cache hierarchy for each global memory
location. LDU = load (block-) uniform variable from memory.
Variable resides in global memoryPrefix pointer with const keywordMemory access must be uniform across all threads in the block (not dependent onthreadIdx)
_ _ g l o b a l _ _ v o i d k e r n e l ( t e s t _ t * g _ d s t , c o n s t t e s t _ t * g _ s r c ){ c o n s t i n t t i d = / * * / ; g _ d s t [ t i d ] = g _ s r c [ 0 ] + g _ s r c [ b l o c k I d x . x ] ;}
/ * 0 0 7 8 * / L D U . E R 0 , [ R 4 ] ;/ * 0 0 8 0 * / L D U . E R 2 , [ R 2 ] ;
SHARED MEMORYBanked: Successive 4-byte words placed to successive banks
sm_1x – 16x4 B, sm_2x – 32x4 B, sm_3x – 32x64 B
Atomic operations are done in lock/unlock manner
( v o i d ) a t o m i c A d d ( & s m e m [ 0 ] , s r c [ t h r e a d I d x . x ] ) ;
/ * 0 0 5 0 * / S S Y 0 x 8 0 ; / * 0 0 5 8 * / L D S L K P 0 , R 3 , [ R Z ] ; / * 0 0 6 0 * / @ P 0 I A D D R 3 , R 3 , R 0 ; / * 0 0 6 8 * / @ P 0 S T S U L [ R Z ] , R 3 ; / * 0 0 7 0 * / @ ! P 0 B R A 0 x 5 8 ; / * 0 0 7 8 * / N O P . S ;
REGISTER SPILLINGLocal memory refers to memory where registers are spilledPhysically resides in gmem, but likely cachedA local variable require a cache line for spilling because spilling is done per warpAddressing is resolved by the compilerStores are cached in L1Analogy with CPU stack variables
LDL/STL ACCESS OPERATIONStore writes line to L1
If evicted, then line is written to L2The line could also be evicted from L2, in this case it is written to DRAM
Load requests line from L1If a hit, operation is completeIf a miss, then request the line from L2If L2 miss, then request the line from DRAM
FINAL WORDSSM has dedicated LD/ST units to handle memory accessGlobal memory accesses are serviced on warp basisCoalesced transaction is a transaction that fits cache lineGPU memory has 2 levels of cachesOne L1 cache line consists of 4 L2-lines. Coalescing unit manages number of L2 lines thatis actually required64-bit and 128-bit memory transactions are natively supportedAtomic operations on global memory is done in L2Register spilling is fully cached for both reads and writes
THE ENDNEXT
BY / 2013–2015CUDA.GEEK