shangkar mayanglambam, allen d. malony, matthew j. sottile computer and information science...

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance Research Laboratory University of Oregon Performance Measurement of Applications with GPU Acceleration using CUDA ParCo 2009 Outline Motivation Performance perspectives Acceleration, asynchrony, and concurrency CPU-GPU execution scenarios Performance measurement for GPGPUs Accelerator performance measurement in PGI compiler TAUcuda performance measurement operation and API TAUcuda tests and application case studies Conclusions and future work 2 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Motivation Heterogeneous computing technology more accessible Multicore processors Manycore accelerators (e.g., NVIDIA Tesla GPU) High-performance processing engines (e.g., IBM Cell BE) Achieving performance potential is challenging Complexity of hardware operation and programming interface CUDA created to help in GPU accelerator code development Few performance tools for parallel accelerated applications Need to understand acceleration in context of whole program Need integration of accelerator measurements in scalable parallel performance tools Focus on GPGPU performance measurement using CUDA 3 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Heterogeneous Performance Perspective Heterogeneous applications can have concurrent execution Main host path and external task paths Want to capture performance for all execution paths External execution may be difficult or impossible to measure Host creates measurement view for external entity Maintains local and remote performance data External entity may provide performance data to the host What perspective does the host have of the external entity? Determines the semantics of the measurement data Existing parallel performance tools are CPU(host)-centric Event-based sampling (not appropriate for accelerators) Direct measurement (through instrumentation of events) 4 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 CUDA Performance Perspective CUDA enables programming of kernels for GPU acceleration GPU acceleration acts as an external tasks Performance measurement appears straightforward Execution model complicates performance measurement Synchronous and asynchronous operation with respect to host Overlapping of data transfer and kernel execution Multiple GPU devices and multiple streams per device Different acceleration kernels used in parallel application Multiple application sections Multiple application threads/processes See performance in context: temporal, spatial, thread/process Two general approaches: synchronous and asynchronous 5 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 CPU GPU Execution / Measurement Scenarios 6 Synchronous Asynchronous Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Approach Consider use of NVIDIA PerfKit and CUDA Profiler PerfKit provides low-level data for GPU driver interface limited for use with CUDA programming environment CUDA Profiler provides extensive stream-level measurements creates post-mortem event trace of kernel operation on streams difficult to merge with application performance data Goal is to produce profiles (traces) showing distribution of accelerator performance with respect to application events Approach 1: force all measurements to be synchronous Restricts CUDA usage, disallowing concurrent operation Create new thread for every CUDA invocation Approach 2: develop CUDA measurement mechanism Merge with TAU performance system 7 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 PGI Compiler for GPU (using CUDA) PGI accelerator compiler (PGI 9.x, C and Fortran, x64 Linux) Loop parallelization for acceleration on GPUs using CUDA Directive-based presenting a GPGPU programming abstraction Compiler not source translation CUDA code hidden TAU measurement of PGI acceleration Wrappers of runtime system Track runtime system events as seen from the host processor Show source information associated with events Routine name File name, source line number for kernel Variable names in memory upload, download operations Grid sizes 8 Performance Measurement of Applications with GPU Acceleration using CUDAParCo Matrix Multiplication Profile (3000x3000, ~22 GF) Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 CUDA Programming for GPGPU PGI compiler represents GPGPU programming abstraction Performance tool uses runtime system wrappers essentially a synchronous call performance model!!! In general, programming of GPGPU devices is more complex CUDA environment Programming of multiple streams and GPU devices multiple streams execute concurrently Programming of data transfers to/from GPU device Programming of GPU kernel code Synchronization with streams Stream event interface 10 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 TAU CUDA Performance Measurement (TAUcuda) 11 Build on CUDA stream event interface Allow events to be placed in streams and processed events are timestamped CUDA runtime reports GPU timing in event structure Events are reported back to CPU when requested use begin and end events to calculate intervals Want to associate TAU event context with CUDA events Get top of TAU event stack at begin (TAU context) CUDA kernel invocations are asynchronous CPU does not see actual CUDA end event CPU retrieves events in a non-blocking and blocking manner Want to capture waiting time Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 CPU-GPU Operation and TAUcuda Events 12 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 TAU CUDA Measurement API void tau_cuda_init(int argc, char **argv); To be called when the application starts Initializes data structures and checks GPU status void tau_cuda_exit() To be called before any thread exits at end of application All the CUDA profile data output for each thread of execution void* tau_cuda_stream_begin(char *event, cudaStream_t stream); Called before CUDA statements to be measured Returns handle which should be used in the end call If event is new or the TAU context is new for the event, a new CUDA event profile object is created void tau_cuda_stream_end(void * handle); Called immediately after CUDA statements to be measured Handle identifies the stream Inserts a CUDA event into the stream 13 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 TAU CUDA Measurement API (2) vector tau_cuda_update(); Checks for completed CUDA events on all streams Non-blocking and returns # completed on each stream int tau_cuda_update(cudaStream_t stream); Same as tau_cuda_update() except for a particular stream Non-blocking and returns # completed on the stream vector tau_cuda_finalize(); Waits for all CUDA events to complete on all streams Blocking and returns # completed on each stream int tau_cuda_finalize(cudaStream_t stream); Same as tau_cuda_finalize() except for a particular stream Blocking and returns # completed on the stream 14 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Scenario Results One and Two Streams Run simple CUDA experiments to validate TAU CUDA Tesla S1070 test system 15 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Scenario Results Two Devices, Two Contexts 16 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 TAUcuda Compared to CUDA Profiler CUDA Profiler integrated in CUDA runtime system Captures time measures for GPGPU kernel and memory tasks Creates a trace in memory and outputs at end of execution Can use to verify TAUcuda Slight time variation due to differences in mechanism 17 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Case Study: TAUcuda in NAMD and ParFUM TAU integrated in Charm++ (ICPP 2009 paper) Charm++ applications NAMD is a molecular dynamics application Parallel Framework for Unstructure Meshing (ParFUM) Both have been accelerated with CUDA Demonstration use of TAUcuda Observe the effect of CUDA acceleration Show scaling results for GPU cluster execution Experimental environments Two S1070 GPU servers (Universit of Oregon) AC cluster: 32 nodes, 4 Tesla GPUs per node (UIUC) 18 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 NAMD GPU Profile (Two GPU Devices) Test out TAU CUDA with NAMD Two processes with one Tesla GPU for each 19 CPU profile GPU profile (P0) GPU profile (P1) Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 NAMD GPU Efficiency Gain (16 versus 32 GPUs) AC cluster: 16 and 32 processes dev_sum_forces: 50% improvement dev_nonbounded: 100% improvement EventTAU ContextDeviceStream 20 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 NAMD GPU Scaling (4 to 64 GPUs) Strong scaling by event and device number Good scaling for non-bounded calculations Sum forces scales less well, but overall is small Non-bonded calculations Sum forces calculations Number of Devices Scaling Efficiency 21 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 ParFUM CUDA speedup (Single CPU plus GPU) Problem size: 128 x 8 x 8 mesh With GPU acceleration, only 9 seconds in CUDA kernels 22 Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Case Study: HMPP-TAU 23 User Application TAU CUDACUDA TAUcuda HMPP Runtime HMPP CUDA Codelet Measurement User events HMPP events Codelet events Measurement User events HMPP events Codelet events Measurement CUDA stream events Waiting information Measurement CUDA stream events Waiting information Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 HMPP Data/Overlap Experiment 24 TAUcuda events Performance Measurement of Applications with GPU Acceleration using CUDAParCo 2009 Conclusions and Future Work Heterogeneous parallel computing will challenge parallel performance technology Must deal with diversity in hardware and software Must deal with richer parallelism and concurrency Developed and demonstrated TAUcuda TAU + CUDA measurement approach Showed case studies and integrated in HMPP Next targeting OpenCL (TAUopenCL) Better merge TAU and TAUcuda performance data Take advantage of other tools in TAU toolset Performance database (PerfDMF), data mining (PerfExplorer) Integrated in application and heterogeneous environments 25

shangkar mayanglambam, allen d. malony, matthew j. sottile computer and information science...

Documents