javier cabezas mauricio araya isaac gelado thomas bradley gladys gonzález josé maría cela
DESCRIPTION
Reverse Time Migration on GMAC. Javier Cabezas Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José María Cela Nacho Navarro. BSC Repsol /BSC UPC/UIUC NVIDIA Repsol UPC/BSC UPC/BSC. NVIDIA GTC 22 nd of September, 2010. Outline. Introduction - PowerPoint PPT PresentationTRANSCRIPT
Javier CabezasMauricio ArayaIsaac GeladoThomas BradleyGladys GonzálezJosé María CelaNacho Navarro
Reverse Time Migrationon GMAC
NVIDIA GTC22nd of September, 2010
BSCRepsol/BSCUPC/UIUCNVIDIARepsolUPC/BSCUPC/BSC
NVIDIA GPU Technology Conference – 22nd of September, 2010 2
Outline
•Introduction
•Reverse Time Migration on CUDA
•GMAC at a glance
•Reverse Time Migration on GMAC
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 3
Reverse Time Migration on CUDA
•RTM generates an image of the subsurface layers
•Uses traces recorded by sensors in the field
•RTM’s algorithm1.Propagation of a modeled wave (forward in time)
2.Propagation of the recorded traces (backward in time)
3.Correlation of the forward and backward wavefields
• Last forward wavefield with the first backward wavefield
•FDTD are preferred to FFT
• 2nd-order finite differencing in time
• High-order finite differencing in space
└ RTM
NVIDIA GPU Technology Conference – 22nd of September, 2010 4
Introduction
•BSC and Repsol: Kaleidoscope project
• Develop better algorithms/techniques for seismic imaging
• We focused on Reverse Time Migration (RTM), as it is the most popular seismic imaging technique for depth exploration
•Due to the high computational power required, the project started a quest for the most suitable hardware
• PowerPC: scalability issues
• Cell: good performance (in production @ Repsol), difficult programmability
• FPGA: potentially best performance, programmability nightmare
• GPUs: 5x speedup vs Cell (GTX280), what about programmability?
└ Barcelona Supercomputing Center (BSC)
NVIDIA GPU Technology Conference – 22nd of September, 2010 5
Outline
•Introduction
•Reverse Time Migration on CUDA
→General approach
• Disk I/O
• Domain decomposition
• Overlapping computation and communication
•GMAC at a glance
•Reverse Time Migration on GMAC
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 6
Reverse Time Migration on CUDA
•We focus on the host-side part of the implementation
1.Avoid memory transfers between host and GPU memories
• Implement on the GPU as many computations as possible
2.Hide latency of memory transfers
• Overlap memory transfers and kernel execution
3.Take advantage of the PCIe full-duplex capabilities (Fermi)
• Overlap deviceToHost and hostToDevice memory transfers
└ General approach
NVIDIA GPU Technology Conference – 22nd of September, 2010 7
Reverse Time Migration on CUDA└ General approach
3D-Stencil
Absorbing Boundary Conditions
Source insertion
Compression
Write to disk
3D-Stencil
Absorbing Boundary Conditions
Traces insertion
Decompression
Read from disk
Correlation
Forward Backward
NVIDIA GPU Technology Conference – 22nd of September, 2010 8
Reverse Time Migration on CUDA
•Data structures used in the RTM algorithm
• Read/Write structures
• 3D volume for the wavefield (can be larger than 1000x1000x1000 points)
• State of the wavefiled in previous time-steps to compute finite differences in time
• Some extra points in each direction at the boundaries (halos)
• Read-Only structures
• 3D volume of the same size as the wavefield
• Geophones’ recorded traces: time-steps x #geophones
└ General approach
NVIDIA GPU Technology Conference – 22nd of September, 2010 9
Reverse Time Migration on CUDA
•Data flow-graph (forward)
└ General approach
3D-Stencil ABC
Source Compress
WavefieldsConstant read-only data: velocity model, geophones’ traces
NVIDIA GPU Technology Conference – 22nd of September, 2010 10
Reverse Time Migration on CUDA
•Simplified data flow-graph (forward)
└ General approach
RTM Kernel
Compress
Wave-fieldsConstant read-only data: velocity model, geophones’ traces
NVIDIA GPU Technology Conference – 22nd of September, 2010 11
Reverse Time Migration on CUDA
•Control flow-graph (forward)
• RTM Kernel Computation
• Compress and transfer to disk
• deviceToHost + Disk I/O
• Performed every N steps
• Can run in parallel withthe next compute steps
└ General approach
RTM Kernel
i%N == 0
i < steps
no
yes
yes
no
Compress
Disk I/O
End
Start
i = 0
i++
toHost
Runs on the GPURuns on the CPU
NVIDIA GPU Technology Conference – 22nd of September, 2010 12
Outline
•Introduction
•Reverse Time Migration on CUDA
• General approach
→Disk I/O
• Domain decomposition
• Overlapping computation and communication
•GMAC at a glance
•Reverse Time Migration on GMAC
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 13
Reverse Time Migration on CUDA
•GPU → Disk transfers are very time-consuming
•Transferring to disk can be overlapped with the next (compute-only) steps
└ Disk I/O
K1
K2
K3
K4 Disk I/O K
5C
K1
K2
K3
K4
Disk I/O
K5C K
6K7
K8
time
time
toHost
toHost
Runs on the GPURuns on the CPU
NVIDIA GPU Technology Conference – 22nd of September, 2010 14
Reverse Time Migration on CUDA
•Single transfer: wait for all the data to be in host memory
•Multiple transfers: overlap deviceToHost transfers with disk I/O
• Double buffering
└ Disk I/O
deviceToHost
time
Disk I/O
toH
time
toH
toH
toH
Disk I/O Disk I/O Disk I/O Disk I/O
NVIDIA GPU Technology Conference – 22nd of September, 2010 15
Reverse Time Migration on CUDA
•CUDA-RT limitations
• GPU memory accessible by the owner host thread only
→deviceToHost transfers must be performed by the compute thread
└ Disk I/O
CPU addressspace
GPU
GPU addressspace
Computethread
I/Othread
NVIDIA GPU Technology Conference – 22nd of September, 2010 16
Reverse Time Migration on CUDA
•CUDA-RT Implementation (single transfer)
• CUDA streams must be used not to block GPU execution
→Intermediate page-locked buffer must be used: for real-size problems the system can run out of memory!
└ Disk I/O
CPU addressspace
GPU addressspace
GPU
NVIDIA GPU Technology Conference – 22nd of September, 2010 17
GPU
Reverse Time Migration on CUDA
•CUDA-RT Implementation (multiple transfers)
• Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU
→Lots of synchronization code in the compute thread
└ Disk I/O
CPU addressspace
GPU addressspace
NVIDIA GPU Technology Conference – 22nd of September, 2010 18
Outline
•Introduction
•Reverse Time Migration on CUDA
• General approach
• Disk I/O
→Domain decomposition
• Overlapping computation and communication
•GMAC at a glance
•Reverse Time Migration on GMAC
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 19
Reverse Time Migration on CUDA
•But… wait, real-size problems require > 16GB of data!
•Volumes are split into tiles (along the Z-axis)
• 3D-Stencil introduces data dependencies
└ Domain decomposition
y zx
D1
D2
D3
D4
NVIDIA GPU Technology Conference – 22nd of September, 2010 20
Reverse Time Migration on CUDA
•Multi-node may be required to overcome memory capacity limitations
• Shared memory for intra-node communication
• MPI for inter-node communication
└ Domain decomposition
GPU1 GPU2 GPU3 GPU4 GPU1 GPU2 GPU3 GPU4
Node 1 Node 2
MPIHost Memory Host Memory
NVIDIA GPU Technology Conference – 22nd of September, 2010 21
Reverse Time Migration on CUDA
•Data flow-graph (multi-domain)
└ Domain decomposition
RTM KernelCompress
RTM KernelCompress
Wave-fields (domain 1)Constant read-only data: velocity model, geophones’ traces
Wave-fields (domain 2)
NVIDIA GPU Technology Conference – 22nd of September, 2010 22
Reverse Time Migration on CUDA
•Control flow-graph (multi-domain)
• Boundary exchange everytime-step
• Inter-domain communicationblocks execution of the nextsteps!
└ Domain decomposition
Kernel
s%N == 0
i < steps
no
yes
yes
no
Compress
End
Start
i = 0
Exchange
sync
i++ Disk I/O
toHost
Runs on the GPURuns on the CPU
NVIDIA GPU Technology Conference – 22nd of September, 2010 23
Reverse Time Migration on CUDA
•Boundary exchange every time-step is needed
└ Domain decomposition
K1 CK
2K3
K4
K5
K6
K7
Disk I/O
toHost
time
X X X X XX
NVIDIA GPU Technology Conference – 22nd of September, 2010 24
Reverse Time Migration on CUDA
•Single-transfer exchange
• “Easy” to program, needs large page-locked buffers
•Multiple-transfer exchange to maximize PCI-Express utilization
• “Complex” to program, needs smaller page-locked buffers
└ Domain decomposition
toH
toH
toH
toH
deviceToHost
hostToDevice
deviceToHost
hostToDevice
deviceToHost
hostToDevice
toH
toH
toH
toH
toH
toH
toH
toH
toD
toD
toD
toD
toD
toD
toD
toD
toD
toD
toD
toD
time
time
NVIDIA GPU Technology Conference – 22nd of September, 2010 25
GPU1
GPU2
GPU3
GPU4
Reverse Time Migration on CUDA
•CUDA-RT limitations
• Each host thread can only access to the memory objects it allocates
└ Domain decomposition
CPU addressspace
GPUs’ addressspaces
NVIDIA GPU Technology Conference – 22nd of September, 2010 26
Reverse Time Migration on CUDA
•CUDA-RT implementation (single-transfer exchange)
• Streams and page-locked memory buffers must be used
• Page-locked memory buffers can be too big
└ Domain decomposition
CPU addressspace
GPU1
GPU2
GPU3
GPU4
GPUs’ addressspaces
NVIDIA GPU Technology Conference – 22nd of September, 2010 27
•CUDA-RT implementation (multiple-transfer exchange)
• Uses small page-locked buffers
• More synchronization code
•Too complex to be represented using Powerpoint!
•Very difficult to implement in real code!
└ Domain decomposition
NVIDIA GPU Technology Conference – 22nd of September, 2010 28
Outline
•Introduction
•Reverse Time Migration on CUDA
• General approach
• Disk I/O
• Domain decomposition
→Overlapping computation and communication
•GMAC at a glance
•Reverse Time Migration on GMAC
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 29
Reverse Time Migration on CUDA
•Problem: boundary exchange blocks the execution of the following time-step
└ Overlapping computation and communication
K1 CK
2K3
K4
K5
K6
K7
Disk I/O
toHost
time
X X X X XX
NVIDIA GPU Technology Conference – 22nd of September, 2010 30
Reverse Time Migration on CUDA
•Solution: with a 2-stage execution plan we can effectively overlap the boundary exchange between domains
└ Overlapping computation and communication
k1
X
K1
k2
X
K2
k3
X
K3
k4
X
K4
k5
X
K5
k6
X
K6
k7
X
K7
k8
X
K8
time
k9
K9C C
X
Disk I/O
toHost Disk I/OtoHo
st
Disk I/O
NVIDIA GPU Technology Conference – 22nd of September, 2010 31
Reverse Time Migration on CUDA└ Overlapping computation and communication
GPU1 GPU2
y zx
•Approach: two-stage execution
• Stage 1: compute the wavefield points to be exchanged
NVIDIA GPU Technology Conference – 22nd of September, 2010 32
Reverse Time Migration on CUDA└ Overlapping computation and communication
GPU1 GPU2
y zx
•Approach: two-stage execution
• Stage 2: Compute the remaining points while exchanging the boundaries
NVIDIA GPU Technology Conference – 22nd of September, 2010 33
Reverse Time Migration on CUDA
•But two-stage execution requires more abstractions and code complexity
• An additional stream per domain
• We already have 1 to launch kernels, 1 to overlap transfers to disk, 1 to exchange boundaries
→At this point the code is a complete mess!
• Requires 4 streams per domain, many page-locked buffers, lots of inter-thread synchronization
• Poor readability and maintainability
• Easy to introduce bugs
└ Overlapping computation and communication
NVIDIA GPU Technology Conference – 22nd of September, 2010 34
Outline
•Introduction
•Reverse Time Migration on CUDA
•GMAC at a glance
→Features
• Code examples
•Reverse Time Migration on GMAC
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 35
GMAC at a glance
•Library that enhances the host programming model of CUDA
•Freely available at http://code.google.com/p/adsm/
• Developed by BSC and UIUC
• NCSA license (BSD-like)
• Works in Linux and MacOS X (Windows version coming soon)
•Presented in detail tomorrow at 9 am @ San Jose Ballroom
└ Introduction
NVIDIA GPU Technology Conference – 22nd of September, 2010 36
GMAC at a glance
•Unified virtual address space for all the memories in the system
• Single allocation for shared objects
• Special API calls: gmacMalloc, gmacFree
• GPU memory allocated by a host thread is visible to all host threads
→Brings POSIX thread semantics back to developers
└ Features
CPU
Memory
GPU
Shared Data
CPU Data
NVIDIA GPU Technology Conference – 22nd of September, 2010 37
GMAC at a glance
•Parallelism exposed via regular POSIX threads
• Replaces the explicit use of CUDA streams
• OpenMP support
•GMAC uses streams and page-locked buffers internally
• Concurrent kernel execution and memory transfers for free
└ Features
GPU
NVIDIA GPU Technology Conference – 22nd of September, 2010 38
GMAC at a glance
•Optimized bulk memory operations via library interposition
• File I/O
• Standard I/O functions: fwrite, fread
• Automatic overlap of Disk I/O and hostToDevice and deviceToHost transfers
• Optimized GPU to GPU transfers via regular memcpy
• Enhanced versions of the MPI send/receive calls
└ Features
NVIDIA GPU Technology Conference – 22nd of September, 2010 39
Outline
•Introduction
•Reverse Time Migration on CUDA
•GMAC at a glance
• Features
→Code examples
•Reverse Time Migration on GMAC
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 40
GMAC at a glance
•Single allocation (and pointer) for shared objects
└ Examples
void compute(FILE *file, int size){1 float *foo, *dev_foo;2 foo = malloc(size);3 fread(foo, size, 1, file);4 cudaMalloc(&dev_foo, size);5 cudaMemcpy(dev_foo, foo, size, ToDevice);6 kernel<<<Dg, Db>>>(dev_foo, size);7 cudaThreadSynchronize();8 cudaMemcpy(foo, dev_foo, size, ToHost);9 cpuComputation(foo);10 cudaFree(dev_foo);11 free(foo);}
void compute(FILE *file, int size){1 float *foo;2 foo = gmacMalloc(size);3 fread(foo, size, 1, file);456 kernel<<<Dg, Db>>>(foo, size);7 gmacThreadSynchronize();89 cpuComputation(foo);10 gmacFree(foo);11}
CUDA-RT GMAC
NVIDIA GPU Technology Conference – 22nd of September, 2010 41
GMAC at a glance
•Optimized support for bulk memory operations
└ Examples
void compute(FILE *file, int size){1 float *foo, *dev_foo;2 foo = malloc(size);3 fread(foo, size, 1, file);4 cudaMalloc(&dev_foo, size);5 cudaMemcpy(dev_foo, foo, size, ToDevice);6 kernel<<<Dg, Db>>>(dev_foo, size);7 cudaThreadSynchronize();8 cudaMemcpy(foo, dev_foo, size, ToHost);9 cpuComputation(foo);10 cudaFree(dev_foo);11 free(foo);}
void compute(FILE *file, int size){1 float *foo;2 foo = gmacMalloc(size);3 fread(foo, size, 1, file);456 kernel<<<Dg, Db>>>(foo, size);7 gmacThreadSynchronize();89 cpuComputation(foo);10 gmacFree(foo);11}
CUDA-RT GMAC
NVIDIA GPU Technology Conference – 22nd of September, 2010 42
Outline
•Introduction
•GMAC at a glance
•Reverse Time Migration on GMAC
→Disk I/O
• Domain decomposition
• Overlapping computation and communication
• Development cycle and debugging
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 43
GPU
Reverse Time Migration on GMAC
•CUDA-RT Implementation (multiple transfers)
• Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU
→Lots of synchronization code in the compute thread
└ Disk I/O
CPU addressspace
GPU addressspace
NVIDIA GPU Technology Conference – 22nd of September, 2010 44
Reverse Time Migration on GMAC
•GMAC implementation
• deviceToHost transfers performed by the I/O thread
• deviceToHost and Disk I/O transfers overlap for free
• Small page-locked buffers are used
└ Disk I/O (GMAC)
Global addressspace
GPU
NVIDIA GPU Technology Conference – 22nd of September, 2010 45
Outline
•Introduction
•GMAC at a glance
•Reverse Time Migration on GMAC
• Disk I/O
→Domain decomposition
• Overlapping computation and communication
• Development cycle and debugging
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 46
Reverse Time Migration on GMAC
•CUDA-RT implementation (single-transfer exchange)
• Streams and page-locked memory buffers must be used
• Page-locked memory buffers can be too big
└ Domain decomposition (CUDA-RT)
CPU addressspace
GPU2
GPU3
GPU4
GPUs’ addressspaces
GPU1
NVIDIA GPU Technology Conference – 22nd of September, 2010 47
•GMAC implementation (multiple-transfer exchange)
• Exchange of boundaries performed using a simple memcpy!
• Full PCIe utilization: internally GMAC performs several transfers and double buffering
Reverse Time Migration on GMAC└ Domain decomposition (GMAC)
Unified globaladdress space
GPU1
GPU3
GPU4
GPU2
NVIDIA GPU Technology Conference – 22nd of September, 2010 48
Outline
•Introduction
•GMAC at a glance
•Reverse Time Migration on GMAC
• Disk I/O
• Domain decomposition
→Overlapping computation and communication
• Development cycle and debugging
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 49
Reverse Time Migration on GMAC
•No streams, no page-locked buffers, similar performance: ±2%
└ Overlapping computation and communication
readVelocity(velociy);cudaMalloc(&d_input, W_SIZE);cudaMalloc(&d_output, W_SIZE);cudaHostAlloc(&i_halos, H_SIZE);cudaHostAlloc(&disk_buffer, W_SIZE);cudaStreamCreate(&s1);cudaStreamCreate(&s2);cudaMemcpy(d_velocity, velocity, W_SIZE)for all time steps do launch_stage1(d_output, d_input, s1); launch_stage2(d_output, d_input, s2); cudaMemcpyAsync(i_halos, d_output, s1); cudaStreamSynchronize(s1); barrier(); cudaMemcpyAsync(d_output, i_halos, s1); cudaThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); transfer_to_host(disk_buffer); barrier_write_to_disk(); } // ... Update pointersend for
fread(velocity);gmacMalloc(&input, W_SIZE);gmacMalloc(&output, W_SIZE);
for all time steps do launch_stage1( output, input ); gmacThreadSynchronize(); launch_stage2( output, input ); memcpy(neighbor, output); gmacThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output);
barrier_write_to_disk(); } // ... Update pointersend forCUDA-RT GMAC
NVIDIA GPU Technology Conference – 22nd of September, 2010 50
Outline
•Introduction
•GMAC at a glance
•Reverse Time Migration on GMAC
• Disk I/O
• Domain decomposition
• Inter-domain communication
→Development cycle and debugging
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 51
Reverse Time Migration on GMAC└ Development cycle and debugging
3D-Stencil
Absorbing Boundary Conditions
Source insertion
Compression
•CUDA-RT
• Start from a simple, correct sequential code
• Implement kernels one at a time and checkcorrectness
• Two allocations per data structure
• Keep data consistency by hand (cudaMemcpy)
• To introduce modifications to any kernel
• Two allocations per data structure
• Keep data consistency by hand (cudaMemcpy)
NVIDIA GPU Technology Conference – 22nd of September, 2010 52
Reverse Time Migration on GMAC
•GMAC
• Allocate objects with gmacMalloc
• Single pointer
• Use pointer both in the host and GPU kernelimplementations
• No copies
└ Development cycle and debugging
3D-Stencil
Absorbing Boundary Conditions
Source insertion
Compression
NVIDIA GPU Technology Conference – 22nd of September, 2010 53
Outline
•Introduction
•Reverse Time Migration on CUDA
•GMAC at a glance
•Reverse Time Migration on GMAC
•Conclusions
NVIDIA GPU Technology Conference – 22nd of September, 2010 54
Conclusions
•Heterogeneous systems based on GPUs are currently the most appropriate to implement RTM
•CUDA has programmability issues
• CUDA provides a good language to expose data parallelism in the code to be run on the GPU
• The host-side interface provided by the CUDA-RT makes difficult to implement even some basic optimizations
GMAC eases the development of applications for GPU-based systems with no performance penalty6-month part-time single programmer: full RTM version (5x speedup
over the previous Cell implementation)
NVIDIA GPU Technology Conference – 22nd of September, 2010 55
Acknowledgements
•Barcelona Supercomputing Center
•Repsol
•Universitat Politècnica de Catalunya
•University of Illinois at Urbana-Champaign
NVIDIA GPU Technology Conference – 22nd of September, 2010 56
Thank you!
Questions?