javier cabezas mauricio araya isaac gelado thomas bradley gladys gonzález josé maría cela

Javier CabezasMauricio ArayaIsaac GeladoThomas BradleyGladys GonzálezJosé María CelaNacho Navarro

Reverse Time Migrationon GMAC

NVIDIA GTC22nd of September, 2010

BSCRepsol/BSCUPC/UIUCNVIDIARepsolUPC/BSCUPC/BSC

NVIDIA GPU Technology Conference – 22nd of September, 2010 2

Outline

•Introduction

•Reverse Time Migration on CUDA

•GMAC at a glance

•Reverse Time Migration on GMAC

•Conclusions


Reverse Time Migration on CUDA

•RTM generates an image of the subsurface layers

•Uses traces recorded by sensors in the field

•RTM’s algorithm1.Propagation of a modeled wave (forward in time)

2.Propagation of the recorded traces (backward in time)

3.Correlation of the forward and backward wavefields

• Last forward wavefield with the first backward wavefield

•FDTD are preferred to FFT

• 2nd-order finite differencing in time

• High-order finite differencing in space

└ RTM


Introduction

•BSC and Repsol: Kaleidoscope project

• Develop better algorithms/techniques for seismic imaging

• We focused on Reverse Time Migration (RTM), as it is the most popular seismic imaging technique for depth exploration

•Due to the high computational power required, the project started a quest for the most suitable hardware

• PowerPC: scalability issues

• Cell: good performance (in production @ Repsol), difficult programmability

• FPGA: potentially best performance, programmability nightmare

• GPUs: 5x speedup vs Cell (GTX280), what about programmability?

└ Barcelona Supercomputing Center (BSC)


Outline

•Introduction


→General approach

• Disk I/O

• Domain decomposition

• Overlapping computation and communication

•GMAC at a glance


•Conclusions



•We focus on the host-side part of the implementation

1.Avoid memory transfers between host and GPU memories

• Implement on the GPU as many computations as possible

2.Hide latency of memory transfers

• Overlap memory transfers and kernel execution

3.Take advantage of the PCIe full-duplex capabilities (Fermi)

• Overlap deviceToHost and hostToDevice memory transfers

└ General approach


Reverse Time Migration on CUDA└ General approach

3D-Stencil

Absorbing Boundary Conditions

Source insertion

Compression

Write to disk

3D-Stencil


Traces insertion

Decompression

Read from disk

Correlation

Forward Backward



•Data structures used in the RTM algorithm

• Read/Write structures

• 3D volume for the wavefield (can be larger than 1000x1000x1000 points)

• State of the wavefiled in previous time-steps to compute finite differences in time

• Some extra points in each direction at the boundaries (halos)

• Read-Only structures

• 3D volume of the same size as the wavefield

• Geophones’ recorded traces: time-steps x #geophones




•Data flow-graph (forward)


3D-Stencil ABC

Source Compress

WavefieldsConstant read-only data: velocity model, geophones’ traces



•Simplified data flow-graph (forward)


RTM Kernel

Compress

Wave-fieldsConstant read-only data: velocity model, geophones’ traces



•Control flow-graph (forward)

• RTM Kernel Computation

• Compress and transfer to disk

• deviceToHost + Disk I/O

• Performed every N steps

• Can run in parallel withthe next compute steps


RTM Kernel

i%N == 0

i < steps

no

yes

yes

no

Compress

Disk I/O

End

Start

i = 0

i++

toHost

Runs on the GPURuns on the CPU


Outline

•Introduction


• General approach

→Disk I/O



•GMAC at a glance


•Conclusions



•GPU → Disk transfers are very time-consuming

•Transferring to disk can be overlapped with the next (compute-only) steps

└ Disk I/O

K1

K2

K3

K4 Disk I/O K

5C

K1

K2

K3

K4

Disk I/O

K5C K

6K7

K8

time

time

toHost

toHost




•Single transfer: wait for all the data to be in host memory

•Multiple transfers: overlap deviceToHost transfers with disk I/O

• Double buffering

└ Disk I/O

deviceToHost

time

Disk I/O

toH

time

toH

toH

toH

Disk I/O Disk I/O Disk I/O Disk I/O



•CUDA-RT limitations

• GPU memory accessible by the owner host thread only

→deviceToHost transfers must be performed by the compute thread

└ Disk I/O

CPU addressspace

GPU

GPU addressspace

Computethread

I/Othread



•CUDA-RT Implementation (single transfer)

• CUDA streams must be used not to block GPU execution

→Intermediate page-locked buffer must be used: for real-size problems the system can run out of memory!

└ Disk I/O

CPU addressspace

GPU addressspace

GPU


GPU


•CUDA-RT Implementation (multiple transfers)

• Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU

→Lots of synchronization code in the compute thread

└ Disk I/O

CPU addressspace

GPU addressspace


Outline

•Introduction



• Disk I/O

→Domain decomposition


•GMAC at a glance


•Conclusions



•But… wait, real-size problems require > 16GB of data!

•Volumes are split into tiles (along the Z-axis)

• 3D-Stencil introduces data dependencies

└ Domain decomposition

y zx

D1

D2

D3

D4



•Multi-node may be required to overcome memory capacity limitations

• Shared memory for intra-node communication

• MPI for inter-node communication


GPU1 GPU2 GPU3 GPU4 GPU1 GPU2 GPU3 GPU4

Node 1 Node 2

MPIHost Memory Host Memory



•Data flow-graph (multi-domain)


RTM KernelCompress

RTM KernelCompress

Wave-fields (domain 1)Constant read-only data: velocity model, geophones’ traces

Wave-fields (domain 2)



•Control flow-graph (multi-domain)

• Boundary exchange everytime-step

• Inter-domain communicationblocks execution of the nextsteps!


Kernel

s%N == 0

i < steps

no

yes

yes

no

Compress

End

Start

i = 0

Exchange

sync

i++ Disk I/O

toHost




•Boundary exchange every time-step is needed


K1 CK

2K3

K4

K5

K6

K7

Disk I/O

toHost

time

X X X X XX



•Single-transfer exchange

• “Easy” to program, needs large page-locked buffers

•Multiple-transfer exchange to maximize PCI-Express utilization

• “Complex” to program, needs smaller page-locked buffers


toH

toH

toH

toH

deviceToHost

hostToDevice

deviceToHost

hostToDevice

deviceToHost

hostToDevice

toH

toH

toH

toH

toH

toH

toH

toH

toD

toD

toD

toD

toD

toD

toD

toD

toD

toD

toD

toD

time

time


GPU1

GPU2

GPU3

GPU4


•CUDA-RT limitations

• Each host thread can only access to the memory objects it allocates


CPU addressspace

GPUs’ addressspaces



•CUDA-RT implementation (single-transfer exchange)

• Streams and page-locked memory buffers must be used

• Page-locked memory buffers can be too big


CPU addressspace

GPU1

GPU2

GPU3

GPU4



•CUDA-RT implementation (multiple-transfer exchange)

• Uses small page-locked buffers

• More synchronization code

•Too complex to be represented using Powerpoint!

•Very difficult to implement in real code!



Outline

•Introduction



• Disk I/O


→Overlapping computation and communication

•GMAC at a glance


•Conclusions



•Problem: boundary exchange blocks the execution of the following time-step

└ Overlapping computation and communication

K1 CK

2K3

K4

K5

K6

K7

Disk I/O

toHost

time

X X X X XX



•Solution: with a 2-stage execution plan we can effectively overlap the boundary exchange between domains


k1

X

K1

k2

X

K2

k3

X

K3

k4

X

K4

k5

X

K5

k6

X

K6

k7

X

K7

k8

X

K8

time

k9

K9C C

X

Disk I/O

toHost Disk I/OtoHo

st

Disk I/O


Reverse Time Migration on CUDA└ Overlapping computation and communication

GPU1 GPU2

y zx

•Approach: two-stage execution

• Stage 1: compute the wavefield points to be exchanged


Reverse Time Migration on CUDA└ Overlapping computation and communication

GPU1 GPU2

y zx

•Approach: two-stage execution

• Stage 2: Compute the remaining points while exchanging the boundaries



•But two-stage execution requires more abstractions and code complexity

• An additional stream per domain

• We already have 1 to launch kernels, 1 to overlap transfers to disk, 1 to exchange boundaries

→At this point the code is a complete mess!

• Requires 4 streams per domain, many page-locked buffers, lots of inter-thread synchronization

• Poor readability and maintainability

• Easy to introduce bugs



Outline

•Introduction


•GMAC at a glance

→Features

• Code examples


•Conclusions


GMAC at a glance

•Library that enhances the host programming model of CUDA

•Freely available at http://code.google.com/p/adsm/

• Developed by BSC and UIUC

• NCSA license (BSD-like)

• Works in Linux and MacOS X (Windows version coming soon)

•Presented in detail tomorrow at 9 am @ San Jose Ballroom

└ Introduction

http://code.google.com/p/adsm/


GMAC at a glance

•Unified virtual address space for all the memories in the system

• Single allocation for shared objects

• Special API calls: gmacMalloc, gmacFree

• GPU memory allocated by a host thread is visible to all host threads

→Brings POSIX thread semantics back to developers

└ Features

CPU

Memory

GPU

Shared Data

CPU Data


GMAC at a glance

•Parallelism exposed via regular POSIX threads

• Replaces the explicit use of CUDA streams

• OpenMP support

•GMAC uses streams and page-locked buffers internally

• Concurrent kernel execution and memory transfers for free

└ Features

GPU


GMAC at a glance

•Optimized bulk memory operations via library interposition

• File I/O

• Standard I/O functions: fwrite, fread

• Automatic overlap of Disk I/O and hostToDevice and deviceToHost transfers

• Optimized GPU to GPU transfers via regular memcpy

• Enhanced versions of the MPI send/receive calls

└ Features


Outline

•Introduction


•GMAC at a glance

• Features

→Code examples


•Conclusions


GMAC at a glance

•Single allocation (and pointer) for shared objects

└ Examples

void compute(FILE *file, int size){1 float *foo, *dev_foo;2 foo = malloc(size);3 fread(foo, size, 1, file);4 cudaMalloc(&dev_foo, size);5 cudaMemcpy(dev_foo, foo, size, ToDevice);6 kernel<<<Dg, Db>>>(dev_foo, size);7 cudaThreadSynchronize();8 cudaMemcpy(foo, dev_foo, size, ToHost);9 cpuComputation(foo);10 cudaFree(dev_foo);11 free(foo);}

void compute(FILE *file, int size){1 float *foo;2 foo = gmacMalloc(size);3 fread(foo, size, 1, file);456 kernel<<<Dg, Db>>>(foo, size);7 gmacThreadSynchronize();89 cpuComputation(foo);10 gmacFree(foo);11}

CUDA-RT GMAC


GMAC at a glance

•Optimized support for bulk memory operations

└ Examples

void compute(FILE *file, int size){1 float *foo, *dev_foo;2 foo = malloc(size);3 fread(foo, size, 1, file);4 cudaMalloc(&dev_foo, size);5 cudaMemcpy(dev_foo, foo, size, ToDevice);6 kernel<<<Dg, Db>>>(dev_foo, size);7 cudaThreadSynchronize();8 cudaMemcpy(foo, dev_foo, size, ToHost);9 cpuComputation(foo);10 cudaFree(dev_foo);11 free(foo);}

void compute(FILE *file, int size){1 float *foo;2 foo = gmacMalloc(size);3 fread(foo, size, 1, file);456 kernel<<<Dg, Db>>>(foo, size);7 gmacThreadSynchronize();89 cpuComputation(foo);10 gmacFree(foo);11}

CUDA-RT GMAC


Outline

•Introduction

•GMAC at a glance


→Disk I/O



• Development cycle and debugging

•Conclusions


GPU

Reverse Time Migration on GMAC

•CUDA-RT Implementation (multiple transfers)

• Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU

→Lots of synchronization code in the compute thread

└ Disk I/O

CPU addressspace

GPU addressspace



•GMAC implementation

• deviceToHost transfers performed by the I/O thread

• deviceToHost and Disk I/O transfers overlap for free

• Small page-locked buffers are used

└ Disk I/O (GMAC)

Global addressspace

GPU


Outline

•Introduction

•GMAC at a glance


• Disk I/O

→Domain decomposition



•Conclusions



•CUDA-RT implementation (single-transfer exchange)

• Streams and page-locked memory buffers must be used

• Page-locked memory buffers can be too big

└ Domain decomposition (CUDA-RT)

CPU addressspace

GPU2

GPU3

GPU4


GPU1


•GMAC implementation (multiple-transfer exchange)

• Exchange of boundaries performed using a simple memcpy!

• Full PCIe utilization: internally GMAC performs several transfers and double buffering

Reverse Time Migration on GMAC└ Domain decomposition (GMAC)

Unified globaladdress space

GPU1

GPU3

GPU4

GPU2


Outline

•Introduction

•GMAC at a glance


• Disk I/O


→Overlapping computation and communication


•Conclusions



•No streams, no page-locked buffers, similar performance: ±2%


readVelocity(velociy);cudaMalloc(&d_input, W_SIZE);cudaMalloc(&d_output, W_SIZE);cudaHostAlloc(&i_halos, H_SIZE);cudaHostAlloc(&disk_buffer, W_SIZE);cudaStreamCreate(&s1);cudaStreamCreate(&s2);cudaMemcpy(d_velocity, velocity, W_SIZE)for all time steps do launch_stage1(d_output, d_input, s1); launch_stage2(d_output, d_input, s2); cudaMemcpyAsync(i_halos, d_output, s1); cudaStreamSynchronize(s1); barrier(); cudaMemcpyAsync(d_output, i_halos, s1); cudaThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); transfer_to_host(disk_buffer); barrier_write_to_disk(); } // ... Update pointersend for

fread(velocity);gmacMalloc(&input, W_SIZE);gmacMalloc(&output, W_SIZE);

for all time steps do launch_stage1( output, input ); gmacThreadSynchronize(); launch_stage2( output, input ); memcpy(neighbor, output); gmacThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output);

barrier_write_to_disk(); } // ... Update pointersend forCUDA-RT GMAC


Outline

•Introduction

•GMAC at a glance


• Disk I/O


• Inter-domain communication

→Development cycle and debugging

•Conclusions


Reverse Time Migration on GMAC└ Development cycle and debugging

3D-Stencil


Source insertion

Compression

•CUDA-RT

• Start from a simple, correct sequential code

• Implement kernels one at a time and checkcorrectness

• Two allocations per data structure

• Keep data consistency by hand (cudaMemcpy)

• To introduce modifications to any kernel

• Two allocations per data structure

• Keep data consistency by hand (cudaMemcpy)



•GMAC

• Allocate objects with gmacMalloc

• Single pointer

• Use pointer both in the host and GPU kernelimplementations

• No copies

└ Development cycle and debugging

3D-Stencil


Source insertion

Compression


Outline

•Introduction


•GMAC at a glance


•Conclusions


Conclusions

•Heterogeneous systems based on GPUs are currently the most appropriate to implement RTM

•CUDA has programmability issues

• CUDA provides a good language to expose data parallelism in the code to be run on the GPU

• The host-side interface provided by the CUDA-RT makes difficult to implement even some basic optimizations

GMAC eases the development of applications for GPU-based systems with no performance penalty6-month part-time single programmer: full RTM version (5x speedup

over the previous Cell implementation)


Acknowledgements

•Barcelona Supercomputing Center

•Repsol

•Universitat Politècnica de Catalunya

•University of Illinois at Urbana-Champaign


Thank you!

Questions?

javier cabezas mauricio araya isaac gelado thomas bradley gladys gonzález josé maría cela

Documents

glancereverse time migration

reverse time migration

gpu memoriesimplement

problems of cuda

maximum performance

best performance

nvdia gpus

recorded traces