richard ansorge

8/13/2019 Richard Ansorge

1/31

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Medical Image Registration

A Quick Win

Richard Ansorge


2/31


The problem

CT, MRI, PET and Ultrasound produce 3D

volume images

Typically 256 x 256 x 256 = 16,777,216 image

voxels. Combining modalities (intermodality) gives extra

information.

Repeated imaging over time same modality, e.g.MRI, (intramodality) equally important.

Have to spatially register the images.


3/31


Examplebrain lesion

CT MRI PET


4/31


PET-MR Fusion

The PETimage showsmetabolicactivity.

Thiscomplements

the MRstructuralinformation


5/31


Registration Algorithm

Transform

Im B to

match Im A

Im A Im A

Im B

Im B

Compute

Cost

Function

Done

Update

transform

parameters

Yes

No

good

fit?

NB Cost function

calculation dominates

for 3D images and is

inherently parallel


6/31


Transformations

11 12 13 14

21 22 23 24

31 32 33 34

0 0 0 1

a a a a

a a a a

a a a a

General affine transform has

12 parameters:

Polynomial transformations can be useful for e.g. pin-

cushion type distortions:2 2 2

11 12 13 14 1 2 3 4 5 6x a x a y a z a b x b xy b y b z b xz b yz

z

= + + + + + + + + +

==

L

K

Local, non-linear transformations, e.g using cubic BSplines,

increasingly popular, verycomputationally demanding.


7/31


We tried this before6 Parameter Rigid Registration - done 8 years ago

0

200

400

600

800

1000

1200

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Number of Processors

Time/secs

0

8

16

24

32

40

48

56

64

SpeedupFactor

SR2201 PC 333MHz Speedup perfect scaling


8/31


Now - Desktop PC - Windows XP

Needs 400 W power supply


9/31


Free Software: CUDA & Visual C++ Express


10/31


Visual C++ SDK in action


11/31


Visual C++ SDK in action


12/31


Architecture


13/31


9600 GT Device Query

Current

GTX 280

has 240cores!


14/31


Matrix Multiply from SDK

NB using 4-byte floats


15/31


Matrix Multiply (from SDK)

GPU v CPU for NxN Matrix Multipy

0

50

100

150

200

250

300

350

400

0 1024 2048 3072 4096 5120 6144

N

GPUSpeedup

average speedup


16/31




0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

0 1024 2048 3072 4096 5120 6144

N

GPUSpeedup

speedup average speedup


17/31




0

100

200

300

400

500

600

700

800

0 1024 2048 3072 4096 5120 6144

N

GPUSpeedup

0

5

10

15

20

25

30

35

40

speed/mads/nsormad

s/100ns

speedup CPU mads/100 ns GPU mads/ns


18/31


Image Registration

CUDA Code


19/31


#include

texture tex1; // Target Image in texture

__constant__ floatc_aff[16]; // 4x4 Affine transform

// Function arguments are image dimensions and pointers to output buffer b

// and Source Image s. These buffers are in device memory

__global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)

{intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches

intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y

floatx = (float)ix;

floaty = (float)iy;

floatz = 0.0f; // start with slice zero

float4v = make_float4(x,y,z,1.0f);

float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);


float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);

float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?

floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);

floattz = dot(r2,v);

floatsource = 0.0f;

floattarget = 0.0f;

floatcost = 0.0f;

uintis = iy*nx+ix;

uintistep = nx*ny;

for(intiz=0;iz


20/31


#include








floatx = (float)ix;

floaty = (float)iy;









floatsource = 0.0f;

floattarget = 0.0f;

floatcost = 0.0f;

uintis = iy*nx+ix;

uintistep = nx*ny;

for(intiz=0;iz


21/31


#include








floatx = (float)ix;

floaty = (float)iy;









floatsource = 0.0f;

floattarget = 0.0f;

floatcost = 0.0f;

uintis = iy*nx+ix;

uintistep = nx*ny;

for(intiz=0;iz


22/31


#include








floatx = (float)ix;

floaty = (float)iy;









floatsource = 0.0f;

floattarget = 0.0f;

floatcost = 0.0f;

uintis = iy*nx+ix;

uintistep = nx*ny;

for(intiz=0;iz


23/31


#include








floatx = (float)ix;

floaty = (float)iy;









floatsource = 0.0f;

floattarget = 0.0f;

floatcost = 0.0f;

uintis = iy*nx+ix;

uintistep = nx*ny;

for(intiz=0;iz


24/31


#include








floatx = (float)ix;

floaty = (float)iy;









floatsource = 0.0f;

floattarget = 0.0f;

floatcost = 0.0f;

uintis = iy*nx+ix;

uintistep = nx*ny;

for(intiz=0;iz


25/31


Host Code Initialization Fragment

...blockSize.x = blockSize.y = 16; // multiples of 16 a VERYgood idea

gridSize.x = (w2+15) / blockSize.x;

gridSize.y = (h2+15) / blockSize.y;

// allocate working buffers, image is W2 x H2 x D2

cudaMalloc((void**)&dbuff,w2*h2*sizeof(float)); // passed as b to kernel

bufflen = w2*h2;Array1D shbuff = Array1D(bufflen);

shbuff.Zero();

hbuff = shbuff.v;

cudaMalloc((void**)&dnewbuff,w2*h2*d2*sizeof(float)); //passed as s to kernel

cudaMemcpy(dnewbuff,vol2,w2*h2*d2*sizeof(float),cudaMemcpyHostToDevice);

e = make_float3((float)w2/2.0f,(float)h2/2.0f,(float)d2/2.0f); // fixed rotation origino = make_float3(0.0f); // translations

r = make_float3(0.0f); // rotations

s = make_float3(1.0f,1.0f,1.0f); // scale factors

t = make_float3(0.0f); // tans of shears

...


26/31


Calling the Kernel

doublenr_costfun(Array1D &a)

{

static Array2D affine = Array2D(4,4); // a holds current transformation

double sum = 0.0;

make_affine_from_a(nr_fit,affine,a); // convert to 4x4 matrix of floats

cudaMemcpyToSymbol(c_aff,affine.v[0],4*4*sizeof(float)); // load constant mem

d_costfun(w2,h2,d2,dbuff,dnewbuff); // run kernel

CUT_CHECK_ERROR("kernel failed"); // OK?cudaThreadSynchronize(); // make sure all done

// copy partial sums from device to host

cudaMemcpy(hbuff,dbuff,bufflen*sizeof(float),cudaMemcpyDeviceToHost);

for(intiy=0;iy


27/31


Example Run (240x256x176 images)C: >airwc

airwc v2.5 Usage: AirWc opts(12rtdgsf)

C:>airwc sb1 sb2 junk 1f

NIFTI Header on File sb1.nii

converting short to float 0 0.000000

NIFTI Header on File sb2.nii

converting short to float 0 0.000000

Using device 0: GeForce 9600 GT

Initial correlation 0.734281

using cost function 1 (abs-difference)

using cost function 1 (abs-difference)

Amoeba time: 4297, calls 802, cost:127946102

Cuda Total time 4297, Total calls 802

File dofmat.mat writtenNifti file junk.nii written, bswop=0

Full Time 6187

timer 0 1890 ms

timer 1 0 ms

timer 2 3849 ms

timer 3 448 ms

timer 4 0 ms

Total 6.187 secsFinal Transformation:

0.944702 -0.184565 0.017164 40.637428

0.301902 0.866726 -0.003767 -38.923237

-0.028792 -0.100618 0.990019 18.120852

0.000000 0.000000 0.000000 1.000000

Final rots and shifts

6.096217 -0.156668 -19.187197

-0.012378 0.072203 0.122495

scales and shears

0.952886 0.912211 0.995497

0.150428 -0.101673 0.009023


28/31


Desktop 3D Registration

Registration

with

CUDA

6 Seconds

Registrationwith

FLIRT 4.1

8.5 Minutes


29/31


Comments

This is actually already very useful. Almost

interactive (add visualisation)

Further speedups possible

Faster card

Smarter optimiser

Overlap IO and Kernel execution

Tweek CUDA code

Extend to non-linear local registration


30/31


Intel Larabee?

Figure 1: Schematic of the Larabee many-core architecture: The

number of CPU cores and the number and type of co-processors and

I/O blocks are implementation-dependent, as are the positions of the

CPU and non-CPU blocks on the chip.

Porting from

CUDA to

Larabee

should be

easy


31/31


Thank you

richard ansorge

Documents