advanced topics in numerical analysis: high performance ... · final projects i final...

23
Advanced Topics in Numerical Analysis: High Performance Computing Intro to GPGPU Georg Stadler, Dhairya Malhotra Courant Institute, NYU Spring 2019, Monday, 5:10–7:00PM, WWH #1302 April 8, 2019 1 / 20

Upload: others

Post on 18-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Advanced Topics in Numerical Analysis:High Performance Computing

Intro to GPGPU

Georg Stadler, Dhairya MalhotraCourant Institute, NYU

Spring 2019, Monday, 5:10–7:00PM, WWH #1302

April 8, 2019

1 / 20

Page 2: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Outline

Organization issues

Final projects

Computing on GPUs

2 / 20

Page 3: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Organization

Scheduling:I Homework assignment #4 due next Monday

Topics today:I Final project overview/discussionI More GPGPU programming (several examples)I Algorithms: image filtering (convolution), parallel scan,

bitonic sort

Outlook for next week(s):I Distributed memory programming (MPI)

3 / 20

Page 4: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Outline

Organization issues

Final projects

Computing on GPUs

4 / 20

Page 5: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Final projects

I Final projects! Pitch/discuss your final project to/with us.We’re available Tuesday (tomorrow) 5-6pm and Thursday11-12:30 in WWH #1111 or over Slack.

I Would like to (more or less) finalize project groups and topicsin the next week.

I Final projects are in teams of 1-3 people (2 preferred!)I We posted suggestions for final projects. More ideas on the

next slides. Also, take a look at the HPC projects we collectedfrom the first homework assignment.

I Final project presentations (max 10min each) in the week May20/21. You are also required to hand in a short paper withyour results, as well as the git repo with the code.

5 / 20

Page 6: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Final projects

Final project examples (from example list):I Parallel multigridI Image denoisingI Adaptive finite volumesI Parallel k-meansI Fluid mechanics simulationI Data partitioning using parallel octrees

6 / 20

Page 7: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Final projects

Final project examples (more examples):I Parallelizing a DFT sub-calculation (Tkatchenko-Scheffler

dispersion energies and forces)I Parallelizing a neural network color transfer method for imagesI Parallel all-pairs shortest paths via Floyd-WarshallI Fast CUDA kernels for ResNet inferenceI . . . Take an existing serious code and speed it up/parallelize itI . . .

7 / 20

Page 8: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Outline

Organization issues

Final projects

Computing on GPUs

8 / 20

Page 9: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Review of Last Class

I CUDA programming model: GPU architecture,memory-hierarchy, thread-hierarchy.

I Shared memory: fast, low-latency, shared withinthread-block, 48KB - 128KB (depending on computecapability)

I avoid bank conflicts within a warp.I Synchronization

I syncthreads() all threads in a blockI syncwarp() all threads in a warp

I Reduction on GPUs

9 / 20

Page 10: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Hiding LatencyI All operations have latencyI CPUs hide latency using out-of-order computation and branch

prediction; reduce latency of memory accesses using caches.I GPUs hide latency using parallelism:

I execute warp-1 (threads 0-31)I when warp-1 stalls, start executing warp-2 (threads 32-63)

and so on · · ·

10 / 20

Page 11: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Occupancy CalculatorGet resource usage for kernel functions (compiler flag: -Xptxas -v)Example:# nvcc -std=c++11 -Xcompiler "-fopenmp" -Xptxas -v reduction.cu

ptxas info : Compiling entry function’ Z16reduction kernelPdPKdl’ for ’sm 30’ptxas info : Function properties for Z16reduction kernelPdPKdl0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Used 28 registers, 8192 bytes smem, 344 bytes cmem[0]

Occupancy = #-of-threads per SMmax-#-of-threads per SM

I Calculate occupancy for your code:I https://developer.download.nvidia.com/compute/cuda/CUDA Occupancy calculator.xlsI web version: https://xmartlabs.github.io/cuda-calculatorI Improve occupancy to improve performance

11 / 20

Page 12: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Device Management (Multiple GPUs)

I Get number of GPUs: cudaGetDeviceCount(int *count)

I Set the current GPU: cudaSetDevice(int device)

I Get current GPU: cudaGetDevice(int *device)

I Get GPU properties:cudaGetDeviceProperties(cudaDeviceProp *prop, intdevice)

12 / 20

Page 13: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

StreamsI execute multiple tasks in parallel;

either on separate GPUs or on thesame GPU.

I useful for executing severalindependent small tasks whereeach task does not have sufficientparallelism.

// create streamscudaStream_t stream1 , stream2cudaStreamCreate (& streams1 );cudaStreamCreate (& streams2 );

// launch two kernels in parallelkernel <<<1, 64, 0, streams1 >>>();kernel <<<1, 128, 0, streams2 >>>();

// synchronizecudaStreamSynchronize ( stream1 )cudaStreamSynchronize ( stream2 )

13 / 20

Page 14: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Image FilteringConvolution: read k × k block of the image multiply by filterweights and sum.

figure from: GPU Computing: Image Convolution - Jan Novak,Gabor Liktor, Carsten Dachsbacher

14 / 20

Page 15: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Image FilteringUsing shared memory as cache to minimize global memory reads

I read a 32× 32 block of original image from main memoryI compute convolution in shared memoryI write back result sub-block (excluding halo)

figure from: GPU Computing: Image Convolution - Jan Novak,Gabor Liktor, Carsten Dachsbacher

15 / 20

Page 16: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

SortingComparison based sorting algorithms: bubble sort O(N2),sample sort O(N log N), merge sort O(N log N)

Bitonic merge sort O(N log2 N)I great for small to medium problem sizes.I sorting networks, simple deterministic algorithm bases on

compare and swap.I sequence of log N bitonic merge operations.

16 / 20

Page 17: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

SortingComparison based sorting algorithms: bubble sort O(N2),sample sort O(N log N), merge sort O(N log N)

Bitonic merge sort O(N log2 N)I great for small to medium problem sizes.I sorting networks, simple deterministic algorithm bases on

compare and swap.I sequence of log N bitonic merge operations.

16 / 20

Page 18: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

SortingComparison based sorting algorithms: bubble sort O(N2),sample sort O(N log N), merge sort O(N log N)

Bitonic merge sort O(N log2 N)I great for small to medium problem sizes.I sorting networks, simple deterministic algorithm bases on

compare and swap.I sequence of log N bitonic merge operations.

16 / 20

Page 19: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

SortingComparison based sorting algorithms: bubble sort O(N2),sample sort O(N log N), merge sort O(N log N)

Bitonic merge sort O(N log2 N)I great for small to medium problem sizes.I sorting networks, simple deterministic algorithm bases on

compare and swap.I sequence of log N bitonic merge operations.

16 / 20

Page 20: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Bitonic SortBitonic merge O(N log N) cost for each merge operation

I divide-and-conquer algorithm on bitonic sequences.I Bitonic sequence: a sequence that changes monotonicity

exactly once.

I if bitonic-sequence larger than block-size, then read and writedirectly from global memory; otherwise read/write fromshared-memory

17 / 20

Page 21: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Parallel Scan (within thread-block)Reduction tree Scan tree

-2 1 2 0 -2 0 1 -3 4 -4 -3 2 -5 4 -2 1

-1 2 -2 -2 0 -1 -1 -1

1 -4 -1 -2

-3 -3

-6

-6

-3 -6

1 -3 -4 -6

-1 1 -1 -3 -3 -4 -5 -6

-2 -1 1 1 -1 -1 0 -3 1 -3 -6 -4 -9 -5 -7 -6

Construct scan tree: right child: copy parent’s value.left child: difference between parent’s value and sibling’s value inreduction tree.

18 / 20

Page 22: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Libraries

Optimized libraries for

I cuBLAS for linear algebra

I cuFFT for Fast Fourier Transform

I cuDNN for Deep Neural Networks

cuBLAS Demo!

19 / 20

Page 23: Advanced Topics in Numerical Analysis: High Performance ... · Final projects I Final projects!Pitch/discuss your final project to/with us. We’re available Tuesday (tomorrow) 5-6pm

Summary

I Calculating Occupancy: higher is betterI useful for debugging performance bottlenecks

I Miscellaneous:I managing multiple GPUsI executing multiple streams in parallel

I AlgorithmsI Image filteringI Parallel scanI Bitonic sort

I Libraries: cuBLAS, cuFFT, cuDNN

20 / 20