complete unified device architecture a highly scalable parallel programming framework submitted in...
TRANSCRIPT
![Page 1: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/1.jpg)
Complete Unified Device Architecture
A Highly Scalable Parallel Programming Framework
Submitted in partial fulfillment of the requirements for the Maryland high school diploma
Andrew “Shirley” Das Sarma (Calico Cannonballs McMullins), Blair Computational Methods 2009
![Page 2: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/2.jpg)
Background: Why CUDA?Scientific Computing
• A large computer market
• Arithmetic-intensive
• Huge datasets
• Distributed
• Parallel
![Page 3: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/3.jpg)
Background: Why CUDA?Moore’s Law
• Transistors double every 24 months
• Slowing down?
• New tricks– Multicore– Multi-node
• Metrics– Transistors per circuit– Performance per unit cost
![Page 4: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/4.jpg)
Background: Why CUDA?CPU vs. GPU
• CPUs optimized for general workload– More instructions per second– Pipelining, lookahead branch prediction, etc.
• GPUs optimized for parallel calculations– 1 pixel shader = 1 thread– Lots of pixel shaders– Lots of arithmetic– On-card DRAM
![Page 5: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/5.jpg)
Background: Why CUDA?CPU vs. GPU
![Page 6: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/6.jpg)
Background: Why CUDA?CPU vs. GPU
In terms of raw computing power, GPUs surpass CPUs.
![Page 7: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/7.jpg)
What is CUDA?
• GPGPU (not just graphics, or no graphics)• Runs on CPU and GPU• High-level language
– Extension of C– FORTRAN coming soon
• One compiler• Only NVIDIA so far
– Tesla– Larrabee
• Unfathomably cool
![Page 8: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/8.jpg)
How it works
• C language extension– Language constructs– Keywords
• Low-overhead threads
• Independent blocks
• CPU or GPU: choose one– CPU good for sequential or non-numerical
tasks– GPU good for highly parallel calculations
![Page 9: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/9.jpg)
GPU block diagram
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
![Page 10: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/10.jpg)
![Page 11: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/11.jpg)
CUDA: A C extension
• Declspecs: host, global, device
• Keywords: blockIdx, threadIdx, etc.
• Intrinsics: __syncthreads()• Runtime API
– cudaMalloc()– cudaMemcpy()– etc.
• Kernel launch: kernel<<<blocks,threads>>>()
![Page 12: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/12.jpg)
CUDA: A C extension
gcc / cl
G80 SASSfoo.sass
OCG
nvcc/cudaccEDG C/C++ frontend
Open64 Global Optimizer
GPU Assemblyfoo.s
CPU Host Code foo.cpp
Integrated source(foo.cu)
![Page 13: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/13.jpg)
Background: Pointers
• Pointer: a structure that contains the address of some other data in memory
• malloc(size_t sz) returns a pointer to sz bytes of available memory
• To declare a 20-element int array:int * A = (int *) malloc(20*sizeof(int));
![Page 14: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/14.jpg)
Background: Threads
• Sequence of instructions
• One thread at a time– Multicore
• Desktop computer has thousands of threads– Usually fewer than 4 cores
• GPU comfortably runs millions of threads– Hundreds of cores
![Page 15: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/15.jpg)
CUDA execution model
• Arrays of parallel threads
• Each thread executes the same code
• Work determined by threadIdx, blockIdx, blockDim, gridDim
• Blocks: collections of threads– Threads in a block can cooperate and share
fast local memory– No inter-block cooperation
• 1D, 2D, or 3D block/thread numbering
![Page 16: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/16.jpg)
CUDA execution model
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
![Page 17: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/17.jpg)
CUDA execution model
• All functions are declared __host__, __global__, or __device__
• Host: Runs on CPU, called from CPU
• Global: Runs on GPU, called from CPU
• Device: Runs on GPU, called from GPU
![Page 18: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/18.jpg)
CUDA memory model
• Global memory– Faster than CPU memory– Slower than cache– Accessible by all threads
• Block shared memory– Small-ish, fast, shared by threads in a block
• Thread memory– Small, fast, local
• Texture memory– Small, fast, global
![Page 19: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/19.jpg)
Example: SAXPY ( )__host__ void SAXPYCPU(float * X, float * Y, float a, int N){
for(int i=0; i<N; i++)Y[i] = a*X[i] + Y[i]
}__global__ voidSAXPYGPU(float * X, float * Y, float a){
int i = blockDim.x*blockIdx.x+threadIdx.x;Y[i] = a*X[i] + Y[i];
}
(continued)
![Page 20: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/20.jpg)
Example: SAXPY
__host__ int main() {
int N = 1073741824 ; //2^30 ≈ 1 billion
size_t sz = N * sizeof(float); //bytes we need
float * h_X = (float *) malloc(sz); //allocate the
float * h_Y = (float *) malloc(sz); //host memory
/*some code to fill up h_X and h_Y*/
float * d_X, * d_Y;
cudaMalloc((void **)&d_X, sz); //allocate the
cudaMalloc((void **)&d_Y, sz); //device memory
//move the data onto the GPGPU
cudaMemcpy(d_X, h_X, sz, cudaMemcpyHostToDevice);
cudaMemcpy(d_Y, h_Y, sz, cudaMemcpyHostToDevice);
(continued)
![Page 21: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/21.jpg)
Example: SAXPY
//data is on the device; time to do some SAXPY
int threadsPerBlock = 256;
int blocks = N / threadsPerBlock;
SAXPYGPU<<<blocks, threadsPerBlock>>>(X, Y, 2);
cudaThreadSynchronize(); //wait until done
cudaMemcpy(h_Y, d_Y, sz, cudaMemcpyDeviceToHost);
cudaFree(d_X);
cudaFree(d_Y); //we no longer need the device memory
}
![Page 22: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/22.jpg)
Example: SAXPY
That was easy.
![Page 23: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/23.jpg)
Example: 2D integration
Simpson 2D coefficient matrix:
Our function:f(x,y)=exy (x+y+π)-1/2 sin(log(x-y+π))Want ∫∫ f(x,y) dA over |x|,|y| ≤ 1
![Page 24: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/24.jpg)
Example: 2D integration__host__ int main(){
int B = N/T; //(N+1)^2=points, T=threads, B=blockssize_t sz = B*N*sizeof(dtyp); //dtyp is typedef’ddtyp * d, *h = (dtyp *) malloc(sz);cudaMalloc((void **)&d, sz);dim3 Threads(T);dim3 Grid(B, N); //W=bound of integrationS2DGPU<<<Grid, Threads>>>(-W, W, -W, W, d); //INVOKEcudaThreadSynchronize(); //wait for it to finishcudaMemcpy(h, d, sz, cudaMemcpyDeviceToHost);cudaFree(d);dtyp u=0;for(int i=0; i<B*N; i++)
u += h[i]; //sigma the different resultsu += f2(W, W); //algorithm misses last pointu *= (dtyp)4*W*W/(9*N*N); //normalize
}
![Page 25: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/25.jpg)
Example: 2D integration__host__ void S2DCPU(dtyp x0, dtyp xf, dtyp y0, dtyp yf, dtyp* a){
*a=0;dtyp x=x0, y;for(int i=0; i<=N; i++){
y = y0;for(int j=0; j<=N; j++){
bool c1 = i==0||i==N, c2 = j==0||j==N;*a+=(c1?(c2?1:(j%2==0?2:4)):
(i%2==0?(c2?2:(j%2==0?4:8)):(c2?4:(j%2==0?8:16))))*f2(x,y);
y += (yf-y0)/N;}x += (xf-x0)/N;
}}
![Page 26: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/26.jpg)
Example: 2D integration__global__ void S2DGPU(dtyp x0, dtyp xf, dtyp y0, dtyp yf, dtyp * a){
int X = blockIdx.x*blockDim.x+threadIdx.x;int Y = blockIdx.y;dtyp x = x0+(xf-x0)*X/(gridDim.x*blockDim.x);dtyp y = y0+(yf-y0)*Y/gridDim.y;__shared__ dtyp u[T];bool evx = (X&1)==0, evy = (Y&1)==0;u[threadIdx.x] = (X==0?(Y==0?1:(evy?2:4)):(evx?(Y==0?2:(evy?4:8)): (Y==0?4:(evy?8:16))))*F(x,y);if(threadIdx.x==0)
if(blockIdx.x==0)u[threadIdx.x]+=(blockIdx.y==0?1: ((blockIdx.y&1)==0?2:4))*F(xf,y);
else if(blockIdx.x==1)u[threadIdx.x]+=(blockIdx.y==0?1: ((blockIdx.y&1)==0?2:4))*F(x0+(xf-x0) *blockIdx.y/gridDim.y, yf);
__syncthreads();if(threadIdx.x==0){
for(int i=1; i<T; i++)u[0]+=u[i];
a[blockIdx.x*gridDim.y+Y]=u[0];}
}
![Page 27: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/27.jpg)
Next-gen GPGPUs
NVIDIA Tesla S1070– 960 cores @ 1.44 GHz– 16 GB DRAM– No more, no less– 506 GB/s memory
bandwidth– 4000 GFLOPS– 800 W (.2 W/GFLOPS)– $4,000 ($1/GFLOPS)
Intel Xeon 5500– 4 cores @ 3.2 GHz– Up to 192 GB DRAM*– *Memory not included– 64 GB/s memory
bandwidth– ~50 GFLOPS– 130 W (2.6 W/GFLOPS)– $2,300 ($46/GFLOPS)
![Page 28: Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland](https://reader035.vdocument.in/reader035/viewer/2022062318/551b53c85503465c7e8b5baa/html5/thumbnails/28.jpg)
Runtime data: 2D integration
dtyp N GPU
16
GPU
256
GPU
512
CPU 16
CPU
256
CPU 2048
double 1024 8 9 - 114 208 -
double 16384 1685 2047 - 25996 15192 17056
float 1024 8 9 10 113 210 -
float 16384 353 255 361 29949 23907 23009
Note: All times are in milliseconds.