lecture: manycoregpu architectures and programming, part 2 · storing data on the cpu •a memory...
TRANSCRIPT
![Page 1: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/1.jpg)
Lecture:Manycore GPUArchitecturesandProgramming,Part2
1
CSCE569ParallelComputing
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://passlab.github.io/CSCE569/
![Page 2: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/2.jpg)
Manycore GPUArchitecturesandProgramming:Outline
• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenACC andOpenMP
2
![Page 3: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/3.jpg)
HowistheGPUcontrolled?
• TheCUDAAPIissplitinto:– TheCUDAManagementAPI– TheCUDAKernelAPI
• TheCUDAManagementAPIisforavarietyofoperations– GPUmemoryallocation,datatransfer,execution,resource
creation– MostlyregularCfunctionandcalls
• TheCUDAKernelAPIisusedtodefinethecomputationtobeperformedbytheGPU– Cextensions
3
![Page 4: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/4.jpg)
HowistheGPUcontrolled?
• A CUDAkernel:– Definestheoperationstobeperformedbyasinglethreadon
theGPU– JustasaC/C++functiondefinesworktobedoneontheCPU– Syntactically,akernellookslikeC/C++withsomeextensions
__global__ void kernel(...) {...
}
– EveryCUDAthreadexecutesthesamekernellogic(SIMT)– Initially,theonlydifferencebetweenthreadsaretheirthread
coordinates
4
![Page 5: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/5.jpg)
HowareGPUthreadsorganized?
• CUDA threadhierarchy– Warp =SIMTGroup– ThreadBlock=SIMTGroupsthatrun
concurrentlyonanSM– Grid =AllThreadBlockscreatedbythesame
kernellaunch
• Launchingakernelissimpleandsimilartoafunctioncall.– kernelnameandarguments– #ofthreadblocks/gridand#ofthreads/blocktocreate:
kernel<<<nblocks,threads_per_block>>>(arg1, arg2, ...);
5
![Page 6: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/6.jpg)
HowareGPUthreadsorganized?
• InCUDA,onlythreadblocksandgridsarefirst-classcitizensoftheprogrammingmodel.
• Thenumberofwarps createdandtheirorganizationareimplicitlycontrolled bythekernellaunchconfiguration,butneversetexplicitly.
kernel<<<nblocks, threads_per_block>>>(arg1, arg2, ...);
kernel launch configuration
6
![Page 7: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/7.jpg)
HowareGPUthreadsorganized?
• GPUthreadscanbeconfiguredinone-,two-,orthree-dimensionallayouts
– One-dimensionalblocksandgrids:int nblocks = 4;int threads_per_block = 8;kernel<<<nblocks, threads_per_block>>>(...);
7
Block 0 Block 1 Block 2 Block 3
![Page 8: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/8.jpg)
HowareGPUthreadsorganized?
• GPUthreadscanbeconfiguredinone-,two-,orthree-dimensionallayouts
– Two-dimensionalblocksandgrids:dim3 nblocks(2,2)dim3 threads_per_block(4,2);kernel<<<nblocks, threads_per_block>>>(...);
8
![Page 9: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/9.jpg)
HowareGPUthreadsorganized?
• GPUthreadscanbeconfiguredinone-,two-,orthree-dimensionallayouts
– Two-dimensionalgridandone-dimensionalblocks:dim3 nblocks(2,2);int threads_per_block = 8;kernel<<<nblocks, threads_per_block>>>(...);
9
![Page 10: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/10.jpg)
HowareGPUthreadsorganized?
• OntheGPU,thenumberofblocksandthreadsperblockisexposedthroughintrinsicthreadcoordinatevariables:– Dimensions– IDs
Variable MeaninggridDim.x, gridDim.y,
gridDim.zNumberofblocksinakernellaunch.
blockIdx.x, blockIdx.y, blockIdx.z
UniqueIDoftheblockthatcontainsthecurrentthread.
blockDim.x, blockDim.y, blockDim.z
Numberofthreadsineachblock.
threadIdx.x, threadIdx.y, threadIdx.z
UniqueIDofthecurrentthreadwithinitsblock.
10
![Page 11: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/11.jpg)
HowareGPUthreadsorganized?
tocalculateagloballyuniqueIDforathreadontheGPUinsideaone-dimensionalgridandone-dimensionalblock:kernel<<<4, 8>>>(...);
__global__ void kernel(...) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
...
}
11
Block 0 Block 1 Block 2 Block 3
blockIdx.x = 2;blockDim.x = 8;threadIdx.x = 2;
01234567
8
![Page 12: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/12.jpg)
HowareGPUthreadsorganized?
• Threadcoordinatesofferawaytodifferentiatethreadsandidentifythread-specificinputdataorcodepaths.– Linkdataandcomputation,amapping
__global__ void kernel(int *arr) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < 32) {
arr[tid] = f(arr[tid]);
} else {
arr[tid] = g(arr[tid]);
}
12
codepathforthreadswithtid <32
codepathforthreadswithtid >=32
ThreadDivergence:recallthatuselesscodepathisexecuted,butthendisabledinSIMTexecutionmodel
![Page 13: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/13.jpg)
HowisGPUmemorymanaged?
• CUDAMemoryManagementAPI– AllocationofGPUmemory– TransferofdatafromthehosttoGPUmemory– Free-ing GPUmemory– Foo(int A[][N]){}
HostFunction CUDAAnalogue
malloc cudaMalloc
memcpy cudaMemcpy
free cudaFree
13
![Page 14: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/14.jpg)
HowisGPUmemorymanaged?
cudaError_t cudaMalloc(void **devPtr, size_t size);
– Allocatesize bytesofGPUmemoryandstoretheiraddressat*devPtr
cudaError_t cudaFree(void *devPtr);– ReleasethedevicememoryallocationstoredatdevPtr– MustbeanallocationthatwascreatedusingcudaMalloc
14
![Page 15: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/15.jpg)
HowisGPUmemorymanaged?
cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind);– Transferscountbytesfromthememorypointedtobysrc to
dst– kind canbe:• cudaMemcpyHostToHost,• cudaMemcpyHostToDevice,• cudaMemcpyDeviceToHost,• cudaMemcpyDeviceToDevice
– Thelocationsofdst andsrc mustmatchkind,e.g.ifkind iscudaMemcpyHostToDevice thensrc mustbeahostarrayanddst mustbeadevicearray
15
![Page 16: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/16.jpg)
HowisGPUmemorymanaged?
void *d_arr, *h_arr;h_addr = … ; /* init host memory and data */// Allocate memory on GPU and its address is in d_arrcudaMalloc((void **)&d_arr, nbytes);
// Transfer data from host to devicecudaMemcpy(d_arr, h_arr, nbytes,
cudaMemcpyHostToDevice);
// Transfer data from a device to a hostcudaMemcpy(h_arr, d_arr, nbytes,
cudaMemcpyDeviceToHost);
// Free the allocated memorycudaFree(d_arr);
16
![Page 17: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/17.jpg)
CUDAProgramFlow
• Atitsmostbasic,theflowofaCUDAprogramisasfollows:1. AllocateGPUmemory2. PopulateGPUmemorywithinputsfromthehost3. ExecuteaGPUkernelonthoseinputs4. TransferoutputsfromtheGPUbacktothehost5. FreeGPUmemory
• Let’stakealookatasimpleexamplethatmanipulatesdata
17
![Page 18: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/18.jpg)
AXPYExamplewithOpenMP:Multicore
• y=α·x+y– x andy arevectorsofsizen– α isscalar
• Data(x,yanda)areshared– Parallelizationisrelativelyeasy
18
![Page 19: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/19.jpg)
CUDAProgramFlow
• AXPYisanembarrassinglyparallelproblem– Howcanvectoradditionbeparallelized?– HowcanwemapthistoGPUs?• Eachthreaddoesoneelement
A B C 19
![Page 20: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/20.jpg)
AXPYOffloadingToaGPUusingCUDA
20
Memoryallocationondevice
Memcpy fromhosttodevice
Launchparallelexecution
Memcpy fromdevicetohost
Deallocation ofdev memory
![Page 21: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/21.jpg)
CUDAProgramFlow• ConsidertheworkflowoftheexamplevectoradditionvecAdd.cu:
1. AllocatespaceforA,B,andC ontheGPU2. TransfertheinitialcontentsofA andB totheGPU3. ExecuteakernelinwhicheachthreadsumsAi andBi,andstoresthe
resultinCi4. TransferthefinalcontentsofC backtothehost5. FreeA,B,andC ontheGPUModifytoC=A+B+C
A=B*C;wewillneedbothCandAinthehostsideafterGPU
computation.• Compileandrunningonbridges:– https://passlab.github.io/CSCE569/resources/HardwareSoftware.html#inte
ractive– copygpu_code_examples folderfrommyhomefolder• cp –r~yan/gpu_code_examples ~
– $nvcc –Xcompiler –fopenmp vectorAdd.cu– $./a.out
21
![Page 22: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/22.jpg)
MoreExamplesandExercises
• Matvec:– Version1:eachthreadcomputesoneelementofthefinal
vector– Version2:• Matmul inassignment#4– Version1:eachthreadcomputesonerowofthefinalmatrixC
22
![Page 23: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/23.jpg)
CUDASDKExamples
• CUDAProgrammingManual:– http://docs.nvidia.com/cuda/cuda-c-programming-guide
• CUDASDKExamplesonbridges– moduleloadgcc/5.3.0cuda/8.0– exportCUDA_PATH=/opt/packages/cuda/8.0– /opt/packages/cuda/8.0/samples• Copytoyourhomefolder– cp –r/opt/packages/cuda/8.0/samples~/CUDA_samples• Doa“make”inthefolder,anditwillbuildallthesources• Orgotoaspecificexamplefolderandmake,itwillbuildonlythe
binary
• Findonesyouareinterestedinandruntosee
23
![Page 24: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/24.jpg)
InspectingCUDAPrograms
• DebuggingCUDAprogram:– cuda-gdb debuggingtool, likegdb
• Profilingaprogramtoexaminetheperformance– Nvprof tool,likegprof– Nvprof ./vecAdd
24
![Page 25: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/25.jpg)
Manycore GPUArchitecturesandProgramming:Outline
• Introduction– GPUarchitectures,GPGPUs,andCUDA• GPUExecutionmodel• CUDAProgrammingmodel• WorkingwithMemoryinCUDA– Globalmemory,sharedandconstantmemory• Streamsandconcurrency• CUDAinstructionintrinsicandlibrary• Performance,profiling,debugging,anderrorhandling• Directive-basedhigh-levelprogrammingmodel– OpenACC andOpenMP
25
![Page 26: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/26.jpg)
StoringDataontheCPU
• Amemoryhierarchyemulatesalargeamountoflow-latencymemory– Cachedatafromalarge,high-latencymemorybankinasmall
low-latencymemorybankDRAM
L2Cache
L1Cache
26
CPUMemoryHierarchy
CPU
![Page 27: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/27.jpg)
GPUMemoryHierarchy
27
SIMT Thread Groups on a GPU
SIMT Thread Groups on an SM
SIMT Thread Group
Registers Local Memory
On-Chip Shared Memory/Cache
Global Memory
Constant Memory
Texture Memory
• MorecomplexthantheCPUmemory– Many different types
ofmemory,eachwithspecial-purposecharacteristics• SRAM• DRAM
– Moreexplicit controloverdatamovement
![Page 28: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/28.jpg)
StoringDataontheGPU
28
• Registers(SRAM)– Lowestlatencymemoryspaceonthe
GPU– PrivatetoeachCUDAthread– Constantpoolofregistersper-SM
dividedamongthreadsinresidentthreadblocks
– Architecture-dependentlimitonnumberofregistersperthread
– Registersarenotexplicitlyusedbytheprogrammer,implicitlyallocatedbythecompiler
– -maxrregcount compileroptionallowsyoutolimit#registersperthread
![Page 29: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/29.jpg)
StoringDataontheGPU
Shared Memory
Tran
sfer
29
• SharedMemory(SRAM)– Declaredwiththe__shared__
keyword– Low-latency,highbandwidth– Sharedbyallthreadsinathreadblock– Explicitlyallocatedandmanagedby
theprogrammer,manualL1cache– Storedon-SM,samephysicalmemory
astheGPUL1cache– On-SMmemoryisstatically
partitionedbetweenL1cacheandsharedmemory
![Page 30: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/30.jpg)
StoringDataontheGPU
L1 Cache
L2 Cache
Global Memory
30
• GPUCaches(SRAM)– BehaviourofGPUcachesis
architecture-dependent– Per-SML1cachestoredon-chip– Per-GPUL2cachestoredoff-chip,
cachesvaluesforallSMs– Duetoparallelismofaccesses,GPU
cachesdonotfollowthesameLRUrulesasCPUcaches
![Page 31: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/31.jpg)
StoringDataontheGPU
Constant Memory
Constant Cache
31
• ConstantMemory(DRAM)– Declaredwiththe__constant__
keyword– Read-only– Limitedinsize:64KB– Storedindevicememory(same
physicallocationasGlobalMemory)– Cachedinaper-SMconstantcache– Optimizedforallthreadsinawarp
accessingthesamememorycell
![Page 32: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/32.jpg)
StoringDataontheGPU
Texture Memory Read-Only Cache
Texture Memory
32
• TextureMemory(DRAM)– Read-only– Storedindevicememory(same
physicallocationasGlobalMemory)– Cachedinatexture-onlyon-SMcache– Optimizedfor2Dspatiallocality
(cachescommonlyonlyoptimizedfor1Dlocality)
– Explicitlyusedbytheprogrammer– Special-purposememory
![Page 33: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/33.jpg)
StoringDataontheGPU
L1 Cache
L2 Cache
Global Memory
33
• GlobalMemory(DRAM)– Large,high-latencymemory– Storedindevicememory(along
withconstantandtexturememory)– Canbedeclaredstaticallywith__device__
– CanbeallocateddynamicallywithcudaMalloc
– Explicitlymanagedbytheprogrammer
– Optimizedforallthreadsinawarpaccessingneighbouringmemorycells
![Page 34: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/34.jpg)
StoringDataontheGPU
34
![Page 35: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/35.jpg)
StoringDataontheGPU
35
![Page 36: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/36.jpg)
StaticGlobalMemory
• StaticGlobalMemoryhasafixedsizethroughoutexecutiontime:__device__ float devData;
__global__ void checkGlobalVariable()
printf(“devData has value %f\n”, devData);
}
• InitializedusingcudaMemcpyToSymbol:cudaMemcpyToSymbol(devData, &hostData, sizeof(float));
• FetchedusingcudaMemcpyFromSymbol:cudaMemcpyFromSymbol(&hostData, devData, sizeof(float));
36
![Page 37: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/37.jpg)
DynamicGlobalMemory
• Wehavealreadyseendynamicglobalmemory– cudaMalloc dynamicallyallocatesglobalmemory– cudaMemcpy transfersto/fromglobalmemory– cudaFree freesglobalmemoryallocatedbycudaMalloc
• cudaMemcpy supports4typesoftransfer:– cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice
• Youcanalsomemset globalmemorycudaError_t cudaMemset(void *devPtr, int value, size_t count);
37
![Page 38: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/38.jpg)
GlobalMemoryAccessPatterns
• CPUcachesareoptimizedforlinear,iterativememoryaccesses– Cachelinesensurethataccessingonememorycellbrings
neighbouringmemorycellsintocache– Ifanapplicationexhibitsgoodspatialortemporallocality
(whichmanydo),laterreferenceswillalsohitincache
CPU
SystemMemory
Cache
38
![Page 39: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/39.jpg)
GlobalMemoryAccessPatterns
• GPUcachingisamorechallengingproblem– Thousandsofthreadscooperatingonaproblem– Difficulttopredictthenextroundofaccessesforallthreads
• Forefficientglobalmemoryaccess,GPUsinsteadrelyon:– Largedevicememorybandwidth– Alignedandcoalescedmemoryaccesspatterns– MaintainingsufficientpendingI/Ooperationstokeepthe
memorybussaturatedandhideglobalmemorylatency
39
![Page 40: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/40.jpg)
GlobalMemoryAccessPatterns
• Achievingaligned andcoalesced globalmemoryaccessesiskeytooptimizinganapplication’suseofglobalmemorybandwidth
– Coalesced:thethreadswithinawarpreferencememoryaddressesthatcanallbeservicedbyasingleglobalmemorytransaction(thinkofamemorytransactionastheprocessofbringacachelineintothecache)
– Aligned:theglobalmemoryaccessesbythreadswithinawarpstartatanaddressboundarythatisanevenmultipleofthesizeofaglobalmemorytransaction
40
![Page 41: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/41.jpg)
GlobalMemoryAccessPatterns
• Aglobalmemorytransactioniseither32or128bytes– Thesizeofamemorytransactiondependsonwhichcachesit
passesthrough– IfL1+L2:128byte– IfL2only:32bytes– Whichcachesaglobalmemorytransactionpassesthrough
dependsonGPUarchitectureandthetypeofaccess(readvs.write)
41
![Page 42: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/42.jpg)
GlobalMemoryAccessPatterns
• AlignedandCoalescedMemoryAccess(w/L1cache)– 32-threadwrap,128-bytesmemorytransaction
• With128-byteaccess,asingletransactionisrequiredandalloftheloadedbytesareused
42
![Page 43: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/43.jpg)
GlobalMemoryAccessPatterns
• MisalignedandCoalescedMemoryAccess (w/L1cache)
• With128-byteaccess,twomemorytransactionsarerequiredtoloadallrequestedbytes.Onlyhalfoftheloadedbytesareused.
43
![Page 44: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/44.jpg)
GlobalMemoryAccessPatterns
• MisalignedandUncoalesced MemoryAccess (w/L1cache)
• Withuncoalesced loads,manymorebytesloadedthanrequested
44
![Page 45: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/45.jpg)
GlobalMemoryAccessPatterns
• MisalignedandUncoalesced MemoryAccess (w/L1cache)
• Onefactortoconsiderwithuncoalesced loads:whiletheefficiencyofthisaccessisverylowitmaybringmanycachelinesintoL1/L2cachewhichareusedbylatermemoryaccesses.TheGPUisflexibleenoughtoperformwell,evenforapplicationsthatpresentsuboptimalaccesspatterns.
45
![Page 46: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/46.jpg)
GlobalMemoryAccessPatterns
• MemoryaccessesthatarenotcachedinL1cacheareservicedby32-bytetransactions– Thiscanimprovememorybandwidthutilization– However,theL2cacheisdevice-wide,higherlatencythanL1,
andstillrelativelysmallèmanyapplicationsmaytakeaperformancehitifL1cacheisnotusedforreads
46
![Page 47: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/47.jpg)
GlobalMemoryAccessPatterns
• AlignedandCoalescedMemoryAccess(w/oL1cache)
• With32-bytetransactions,fourtransactionsarerequiredandalloftheloadedbytesareused
47
![Page 48: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/48.jpg)
GlobalMemoryAccessPatterns
• MisalignedandCoalescedMemoryAccess (w/oL1cache)
• With32-bytetransactions,extramemorytransactionsarestillrequiredtoloadallrequestedbytesbutthenumberofwastedbytesislikelyreduced,comparedto128-bytetransactions.
48
![Page 49: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/49.jpg)
GlobalMemoryAccessPatterns
• MisalignedandUncoalesced MemoryAccess (w/oL1cache)
• Withuncoalesced loads,morebytesloadedthanrequestedbutbetterefficiencythanwith128-bytetransactions
49
![Page 50: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/50.jpg)
GlobalMemoryAccessPatterns
• GlobalMemoryWritesarealwaysservicedby32-bytetransactions
50
![Page 51: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/51.jpg)
GlobalMemoryandSpecial-PurposeMemory
• GlobalmemoryiswidelyusefulandaseasytouseasCPUDRAM
• Limitations– Easytofindapplicationswithmemoryaccesspatternsthatare
intrinsicallypoorforglobalmemory– Manythreadsaccessingthesamememorycellè poorglobal
memoryefficiency– Manythreadsaccessingsparsememorycellsè poorglobal
memoryefficiency
• Special-purposememoryspacestoaddressthesedeficienciesinglobalmemory– Specializedfordifferenttypesofdata,differentaccesspatterns– Givemorecontroloverdatamovementanddataplacementthan
CPUarchitecturesdo53
![Page 52: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/52.jpg)
SharedMemory
Shared Memory
Tran
sfer
54
• Declaredwiththe__shared__keyword
• Low-latency,highbandwidth• Sharedbyallthreadsinathreadblock
• Explicitlyallocatedandmanagedbytheprogrammer,manualL1cache
• Storedon-SM,samephysicalmemoryastheGPUL1cache
• On-SMmemoryisstaticallypartitionedbetweenL1cacheandsharedmemory
![Page 53: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/53.jpg)
SharedMemoryAllocation
• Sharedmemorycanbeallocatedstaticallyordynamically
• StaticallyAllocatedSharedMemory– Sizeisfixedatcompile-time– Candeclaremanystaticallyallocatedsharedmemoryvariables– Canbedeclaredgloballyorinsideadevicefunction– Canbemulti-dimensionalarrays
__shared__ int s_arr[256][256];
55
![Page 54: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/54.jpg)
SharedMemoryAllocation
• DynamicallyAllocatedSharedMemory– Sizeinbytesissetatkernellaunchwithathirdkernellaunch
configurable– Canonlyhaveonedynamicallyallocatedsharedmemoryarray
perkernel– Mustbeone-dimensionalarrays
__global__ void kernel(...) {extern __shared__ int s_arr[];...
}
kernel<<<nblocks, threads_per_block, shared_memory_bytes>>>(...);
56
![Page 55: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/55.jpg)
Matvec usingsharedmemory
57
![Page 56: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/56.jpg)
MatrixVectorMultiplication
58
![Page 57: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/57.jpg)
MatrixVectorMultiplication
59
![Page 58: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/58.jpg)
MatrixMultiplicationV1andV2inAssignment#4
• https://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory
60
![Page 59: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/59.jpg)
GPUMemoryHierarchy
68
SIMT Thread Groups on a GPU
SIMT Thread Groups on an SM
SIMT Thread Group
Registers Local Memory
On-Chip Shared Memory/Cache
Global Memory
Constant Memory
Texture Memory
• MorecomplexthantheCPUmemory– Many different types
ofmemory,eachwithspecial-purposecharacteristics• SRAM• DRAM
– Moreexplicit controloverdatamovement
![Page 60: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/60.jpg)
ConstantMemory
Constant Memory
Constant Cache
69
• Declaredwiththe__constant__keyword
• Read-only• Limitedinsize:64KB• Storedindevicememory(samephysicallocationasGlobalMemory)
• Cachedinaper-SMconstantcache• Optimizedforallthreadsinawarpaccessingthesamememorycell
![Page 61: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/61.jpg)
ConstantMemory
• Asitsnamesuggests,constantmemoryisbestusedforstoringconstants– Valueswhichareread-only– Valuesthatareaccessedidenticallybyallthreads
• Forexample:supposeallthreadsareevaluatingtheequation
y = mx + b
fordifferentvaluesofx,butidenticalvaluesofm andb– Allthreadswouldreferencem andb withthesamememory
operation– Thisbroadcastaccesspatternisoptimalforconstantmemory
70
![Page 62: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/62.jpg)
ConstantMemory
• Asimple1Dstencil– targetcellisupdatedbasedonits8neighbors,weightedby
someconstantsc0,c1,c2,c3
71
![Page 63: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/63.jpg)
ConstantMemory
• constantStencil.cu containsanexample1Dstencilthatusesconstantmemory
__constant__ float coef[RADIUS + 1];
cudaMemcpyToSymbol(coef, h_coef, (RADIUS + 1) * sizeof(float));
__global__ void stencil_1d(float *in, float *out, int N) {...for (int i = 1; i <= RADIUS; i++) {tmp += coef[i] * (smem[sidx + i] - smem[sidx - i]);
}}
72
![Page 64: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/64.jpg)
CUDASynchronization
• Whenusingsharedmemory,youoftenmustcoordinateaccessesbymultiplethreadstothesamedata
• CUDAofferssynchronizationprimitivesthatallowyoutosynchronizeamongthreads
73
![Page 65: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/65.jpg)
CUDASynchronization
__syncthreads– Synchronizesexecutionacrossallthreadsinathreadblock– Nothreadinathreadblockcanprogresspasta__syncthreads
beforeallotherthreadshavereachedit– __syncthreads ensuresthatallchangestosharedand
globalmemorybythreadsinthisblockarevisibletoallotherthreadsinthisblock
__threadfence_block– Allwritestosharedandglobalmemorybythecallingthread
arevisibletoallotherthreadsinitsblockafterthisfence– Doesnotblockthreadexecution
74
![Page 66: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/66.jpg)
CUDASynchronization
__threadfence– Allwritestoglobalmemorybythecallingthreadarevisibleto
allotherthreadsinitsgridafterthisfence– Doesnotblockthreadexecution
__threadfence_system– Allwritestoglobalmemory,page-lockedhostmemory,and
memoryofotherCUDAdevicesbythecallingthreadarevisibletoallotherthreadsonallCUDAdevicesandallhostthreadsafterthisfence
– Doesnotblockthreadexecution
75
![Page 67: Lecture: ManycoreGPU Architectures and Programming, Part 2 · Storing Data on the CPU •A memory hierarchy emulates a large amount of low-latency memory –Cache data from a large,](https://reader033.vdocument.in/reader033/viewer/2022042919/5f61c10c0ab10d377a2a389e/html5/thumbnails/67.jpg)
SuggestedReadings
1. Chapter2,4,5inProfessionalCUDACProgramming2. CliffWoolley.GPUOptimizationFundamentals.2013.
https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf
3. MarkHarris.UsingSharedMemoryinCUDAC/C++.http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
4. MarkHarris.OptimizingParallelReductioninCUDA.http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
76