applications of gpu computing - rochester institute...
TRANSCRIPT
![Page 1: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/1.jpg)
Applications of GPU Computing Alex Karantza
0306-722 Advanced Computer Architecture Fall 2011
![Page 2: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/2.jpg)
Outline
• Introduction
• GPU Architecture
▫ Multiprocessing
▫ Vector ISA
• GPUs in Industry
▫ Scientific Computing
▫ Image Processing
▫ Databases
• Examples and Benefits
![Page 3: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/3.jpg)
Introduction
“GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.”
- Prof. Jack Dongarra, director of the Innovative Computing Laboratory at the University of Tennessee
Author of LINPACK
![Page 4: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/4.jpg)
(As typified by NVIDIA CUDA)
![Page 5: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/5.jpg)
GPU Architecture
• Parallel Coprocessor to conventional CPUs
▫ Implement a SIMD structure, multiple threads running the
same code.
• Grid of Blocks of Threads
▫ Thread local registers
▫ Block local memory and control
▫ Global memory
![Page 6: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/6.jpg)
Grids, Blocks, and Threads
Thread Thread
Processor
Thread
Block Multiprocessor
Grid Device(s)
Contains local registers
and memory; scalar processor
Shared memory and registers;
shared control logic
Global memory, can be easily
distributed across devices
![Page 7: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/7.jpg)
GPU Architecture
• Processors also implement vector instructions
▫ Vectors of length 2,3,4 of any fundamental type
integer, float, bits, predicate
▫ Instructions for conversion between vector, scalar
• To encourage uniform execution, rather than
branching for conditionals, use predicates
▫ All instructions can be conditionally executed based on
predicate registers
![Page 8: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/8.jpg)
Vectors and Predicates
.global .v4 .f32 V; // a length-4 vector of floats
.shared .v2 .u16 uv; // a length-2 vector of unsigned
.global .v4 .b8 v; // a length-4 vector of bytes
.reg .s32 a, b; // two 32-bit signed ints
.reg .pred p; // a predicate register
setp.lt.s32 p, a, b; // if a < b, set p
@p add.v4.f32 V, V, {1,0,0,0}; // if p, V.x = V.x + 1
![Page 9: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/9.jpg)
NSF Keeneland
360 Tesla20s
![Page 10: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/10.jpg)
GPUs in Industry
• Many applications have been developed to use GPUs
for supercomputing in various fields
▫ Scientific Computing
CFD, Molecular Dynamics, Genome Sequencing,
Mechanical Simulation, Quantum Electrodynamics
▫ Image Processing
Registration, interpolation, feature detection, recognition,
filtering
▫ Data Analysis
Databases, sorting and searching, data mining
![Page 11: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/11.jpg)
Major Categories of Algorithm
• 2D/3D filtering operations
• n-body simulations
• Parallel tree operations – searching/sorting
• All suited to GPUs because of data-parallel
requirements and uniform kernels
![Page 12: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/12.jpg)
Computational Fluid Dynamics
• Simulate fluids in a discrete volume over time
• Involves solving the Navier-Stokes partial differential
equations iteratively on a grid
▫ Can be considered a filtering operation
• When parallelized on a GPU using multigrid solvers,
10x speedups have been reported
![Page 13: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/13.jpg)
Molecular Dynamics
• Large set of particles with forces between them –
protein behavior, material simulation
• Calculating forces between particles can be done in
parallel for each particle
• Accumulation of forces can be implemented as
multilevel parallel sums
![Page 14: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/14.jpg)
Genetics
• Large strings of genome sequences must be searched
through to organize and identify samples
• GPUs enable multiple parallel queries to the
database to perform string matching
• Again, order of magnitude
speedups reported
![Page 15: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/15.jpg)
Electrodynamics
• Simulation of electric fields, Coulomb forces
• Requires iterative solving of partial differential
equations
• Cell phone modeling applications have
reported 50x speedups using GPUs
![Page 16: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/16.jpg)
Image Processing
• Medical Imaging was the early adopter
▫ Registration of massive 3D voxel images
▫ Both the cost function for deformable registration and interpolation of results are filtering operations
• Generic feature detection, recognition, object extraction are all filters
• For object recognition, one can search a database of objects in parallel
• Running these algorithms off the CPU can allow real-time interaction
![Page 17: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/17.jpg)
Data Analysis
• Huge databases for web services require instant
results for many simultaneous users
• Insufficient room in main memory, disk is too slow and
doesn’t allow parallel reads
• GPUs can split up the data and perform
fast searches, keeping their section
in memory
![Page 18: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/18.jpg)
![Page 19: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/19.jpg)
Example: Filtering Operation
• Many algorithms can be reduced to a filtering
operation. As an example, consider image convolution
for blurring
Kernel = Gaussian2D(size);
for (x,y) in Input {
for (p,q) in Kernel {
Output(x,y) += Input(x+p,y+q) * Kernel(p,q);
}
}
![Page 20: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/20.jpg)
Example: Filtering Operation
• A quick optimization that can be made on many filters is that they are separable, and can be done in one pass per dimension
Kernel = Gaussian1D(size);
for (x,y) in Input {
for (p) in Kernel {
Output(x,y) += Input(x+p,y) * Kernel(p);
}
}
for (x,y) in Input {
for (q) in Kernel {
Output(x,y) += Input(x,y+q) * Kernel(q);
}
}
![Page 21: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/21.jpg)
![Page 22: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/22.jpg)
Example: Filtering Operation
• This is still O(2nnm) on a sequential processor • Each output pixel is independent, but shares spatially
local data and a constant kernel
UploadGPU(Kernel, CONSTANT);
UploadGPU(Input, TEXTURE);
ConvolveColumnsGPU<blocks,threads>();
ConvolveRowsGPU<blocks,threads>();
DownloadGPU(Output, TEXTURE);
![Page 23: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/23.jpg)
Example: Filtering Operation
• Complexity remains the same, however each MAC
instruction can be executed on as many processors as
are available, and memory can be accessed quickly
because of the assignment of blocks and texture
memory
• In practice, the overhead of uploading and
downloading from the GPU is far less than the
performance gained in the kernel
![Page 24: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/24.jpg)
Example: Filtering Operation
__global__ void convolutionColumnsKernel(
float *d_Dst,
float *d_Src,
int imageW,
int imageH,
int pitch
){
__shared__ float s_Data[COLUMNS_BLOCKDIM_X]
[(COLUMNS_RESULT_STEPS + 2 * COLUMNS_HALO_STEPS) *
COLUMNS_BLOCKDIM_Y + 1];
//// *snip* Populate s_Data from d_Src
__syncthreads();
#pragma unroll
for(int i = COLUMNS_HALO_STEPS; i < COLUMNS_HALO_STEPS + COLUMNS_RESULT_STEPS; i++){
float sum = 0;
#pragma unroll
for(int j = -KERNEL_RADIUS; j <= KERNEL_RADIUS; j++)
sum += c_Kernel[KERNEL_RADIUS - j] *
s_Data[threadIdx.x][threadIdx.y + i * COLUMNS_BLOCKDIM_Y + j];
d_Dst[i * COLUMNS_BLOCKDIM_Y * pitch] = sum;
}
![Page 25: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/25.jpg)
Even More Fun
• Some of that overhead can be avoided when the
destination of the GPU’s data is graphics
• Texture memory can be shared between general
purpose computations and normal rendering
• For post-processing effects or visualizing particles, the
pixel/vertex data never needs to leave the GPU
![Page 26: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/26.jpg)
![Page 27: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/27.jpg)
Conclusions
Certain classes of problem appear in many different
fields, and involve very data-parallel operations such
as filtering, sorting, or integration
Taking advantage of the architecture decisions behind
graphics processing units such as their multiprocessing
and native vector operations, these problems can be
solved quickly and cheaply
![Page 28: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/28.jpg)
References • 1. Ziegler, Grenot. Introduction to the CUDA Architecture. [Online] 2009.
http://www.cse.scitech.ac.uk/disco/workshops/200907/Day1_01_Intro_CUDA_Architecture.pdf.
• 2. NVIDIA Corporation. NVIDIA Compute PTX: Parallel Thread Execution ISA Version 1.1. 2007.
• 3. Göddeke, Dominik. Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters. Berlin : Logos Verlag, 2010. 978-3-8325-2768-6.
• 4. Accellerating molecular modeling application swith graphics processors. John E Stone, James C Phillips, Peter L Freddolino, David J Hardy, Leonardo G Trabuco, and Klaus Schulten. 2007, Journal of Computational Chemistry, pp. 28:2618-2640.
• 5. Michael C Schatz, Cole Trapnell, Arthur L Delcher, and Amitabh Varshney. High-throughput sequence alignment using Graphics Processing Units. s.l. : BMC Bioinformatics, 2007.
• 6. ANSYS, Inc. ANSYS Unveils GPU Computing for Accelerated Engineering Simulations. [Online] 2010. http://investors.ansys.com/releasedetail.cfm?releaseid=509436.
• 7. Warburton, Tim. Parallel Numerical Methods for Partial Differential Equations. Rocky Mountain Mathematics Consortium. [Online] 2008. http://www.caam.rice.edu/~timwar/RMMC/gpuDG.html.
• 8. Ansorge, Richard. AIRWC : Accelerated Image Registration With CUDA . BSS Group, Cavendish Laboratory, University of Cambridge UK. 2008.
• 9. N. Cornelis, L. Van Gool. Fast Scale Invariant Feature Detection and Matching on Programmable Graphics Hardware. s.l. : CVPR 2008 Workshop, 2008.
• 10. Andrea DiBlas, Tim Kaldewey. Data Monster: Why graphics processors will transform database processing. IEEE Spectrum. [Online] 2009. http://spectrum.ieee.org/computing/software/data-monster/0.
• 11. Podlozhnyuk, Victor. Image Convolution with CUDA. [Online] 2007. http://developer.download.nvidia.com/compute/DevZone/C/html/C/src/convolutionSeparable/doc/convolutionSeparable.pdf.
• 12. Goodnight, Nolan. CUDA/OpenGL Fluid Simulation. [Online] 2007. http://new.math.uiuc.edu/MA198-2008/schaber2/fluidsGL.pdf.
![Page 29: Applications of GPU Computing - Rochester Institute …meseec.ce.rit.edu/722-projects/fall2011/1-2.pdfApplications of GPU Computing Alex Karantza 0306-722 Advanced Computer Architecture](https://reader034.vdocument.in/reader034/viewer/2022051801/5ada89407f8b9afc0f8ca653/html5/thumbnails/29.jpg)
Questions?