richard ansorge
TRANSCRIPT
-
8/13/2019 Richard Ansorge
1/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Medical Image Registration
A Quick Win
Richard Ansorge
-
8/13/2019 Richard Ansorge
2/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
The problem
CT, MRI, PET and Ultrasound produce 3D
volume images
Typically 256 x 256 x 256 = 16,777,216 image
voxels. Combining modalities (intermodality) gives extra
information.
Repeated imaging over time same modality, e.g.MRI, (intramodality) equally important.
Have to spatially register the images.
-
8/13/2019 Richard Ansorge
3/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Examplebrain lesion
CT MRI PET
-
8/13/2019 Richard Ansorge
4/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
PET-MR Fusion
The PETimage showsmetabolicactivity.
Thiscomplements
the MRstructuralinformation
-
8/13/2019 Richard Ansorge
5/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Registration Algorithm
Transform
Im B to
match Im A
Im A Im A
Im B
Im B
Compute
Cost
Function
Done
Update
transform
parameters
Yes
No
good
fit?
NB Cost function
calculation dominates
for 3D images and is
inherently parallel
-
8/13/2019 Richard Ansorge
6/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Transformations
11 12 13 14
21 22 23 24
31 32 33 34
0 0 0 1
a a a a
a a a a
a a a a
General affine transform has
12 parameters:
Polynomial transformations can be useful for e.g. pin-
cushion type distortions:2 2 2
11 12 13 14 1 2 3 4 5 6x a x a y a z a b x b xy b y b z b xz b yz
z
= + + + + + + + + +
==
L
K
Local, non-linear transformations, e.g using cubic BSplines,
increasingly popular, verycomputationally demanding.
-
8/13/2019 Richard Ansorge
7/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
We tried this before6 Parameter Rigid Registration - done 8 years ago
0
200
400
600
800
1000
1200
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Number of Processors
Time/secs
0
8
16
24
32
40
48
56
64
SpeedupFactor
SR2201 PC 333MHz Speedup perfect scaling
-
8/13/2019 Richard Ansorge
8/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Now - Desktop PC - Windows XP
Needs 400 W power supply
-
8/13/2019 Richard Ansorge
9/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Free Software: CUDA & Visual C++ Express
-
8/13/2019 Richard Ansorge
10/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Visual C++ SDK in action
-
8/13/2019 Richard Ansorge
11/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Visual C++ SDK in action
-
8/13/2019 Richard Ansorge
12/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Architecture
-
8/13/2019 Richard Ansorge
13/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
9600 GT Device Query
Current
GTX 280
has 240cores!
-
8/13/2019 Richard Ansorge
14/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply from SDK
NB using 4-byte floats
-
8/13/2019 Richard Ansorge
15/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply (from SDK)
GPU v CPU for NxN Matrix Multipy
0
50
100
150
200
250
300
350
400
0 1024 2048 3072 4096 5120 6144
N
GPUSpeedup
average speedup
-
8/13/2019 Richard Ansorge
16/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply (from SDK)
GPU v CPU for NxN Matrix Multipy
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
0 1024 2048 3072 4096 5120 6144
N
GPUSpeedup
speedup average speedup
-
8/13/2019 Richard Ansorge
17/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply (from SDK)
GPU v CPU for NxN Matrix Multipy
0
100
200
300
400
500
600
700
800
0 1024 2048 3072 4096 5120 6144
N
GPUSpeedup
0
5
10
15
20
25
30
35
40
speed/mads/nsormad
s/100ns
speedup CPU mads/100 ns GPU mads/ns
-
8/13/2019 Richard Ansorge
18/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Image Registration
CUDA Code
-
8/13/2019 Richard Ansorge
19/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include
texture tex1; // Target Image in texture
__constant__ floatc_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b
// and Source Image s. These buffers are in device memory
__global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)
{intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
floatx = (float)ix;
floaty = (float)iy;
floatz = 0.0f; // start with slice zero
float4v = make_float4(x,y,z,1.0f);
float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);
floattz = dot(r2,v);
floatsource = 0.0f;
floattarget = 0.0f;
floatcost = 0.0f;
uintis = iy*nx+ix;
uintistep = nx*ny;
for(intiz=0;iz
-
8/13/2019 Richard Ansorge
20/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include
texture tex1; // Target Image in texture
__constant__ floatc_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b
// and Source Image s. These buffers are in device memory
__global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)
{intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
floatx = (float)ix;
floaty = (float)iy;
floatz = 0.0f; // start with slice zero
float4v = make_float4(x,y,z,1.0f);
float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);
floattz = dot(r2,v);
floatsource = 0.0f;
floattarget = 0.0f;
floatcost = 0.0f;
uintis = iy*nx+ix;
uintistep = nx*ny;
for(intiz=0;iz
-
8/13/2019 Richard Ansorge
21/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include
texture tex1; // Target Image in texture
__constant__ floatc_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b
// and Source Image s. These buffers are in device memory
__global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)
{intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
floatx = (float)ix;
floaty = (float)iy;
floatz = 0.0f; // start with slice zero
float4v = make_float4(x,y,z,1.0f);
float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);
floattz = dot(r2,v);
floatsource = 0.0f;
floattarget = 0.0f;
floatcost = 0.0f;
uintis = iy*nx+ix;
uintistep = nx*ny;
for(intiz=0;iz
-
8/13/2019 Richard Ansorge
22/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include
texture tex1; // Target Image in texture
__constant__ floatc_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b
// and Source Image s. These buffers are in device memory
__global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)
{intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
floatx = (float)ix;
floaty = (float)iy;
floatz = 0.0f; // start with slice zero
float4v = make_float4(x,y,z,1.0f);
float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);
floattz = dot(r2,v);
floatsource = 0.0f;
floattarget = 0.0f;
floatcost = 0.0f;
uintis = iy*nx+ix;
uintistep = nx*ny;
for(intiz=0;iz
-
8/13/2019 Richard Ansorge
23/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include
texture tex1; // Target Image in texture
__constant__ floatc_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b
// and Source Image s. These buffers are in device memory
__global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)
{intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
floatx = (float)ix;
floaty = (float)iy;
floatz = 0.0f; // start with slice zero
float4v = make_float4(x,y,z,1.0f);
float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);
floattz = dot(r2,v);
floatsource = 0.0f;
floattarget = 0.0f;
floatcost = 0.0f;
uintis = iy*nx+ix;
uintistep = nx*ny;
for(intiz=0;iz
-
8/13/2019 Richard Ansorge
24/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include
texture tex1; // Target Image in texture
__constant__ floatc_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b
// and Source Image s. These buffers are in device memory
__global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)
{intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches
intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y
floatx = (float)ix;
floaty = (float)iy;
floatz = 0.0f; // start with slice zero
float4v = make_float4(x,y,z,1.0f);
float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);
float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);
float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);
float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?
floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);
floattz = dot(r2,v);
floatsource = 0.0f;
floattarget = 0.0f;
floatcost = 0.0f;
uintis = iy*nx+ix;
uintistep = nx*ny;
for(intiz=0;iz
-
8/13/2019 Richard Ansorge
25/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Host Code Initialization Fragment
...blockSize.x = blockSize.y = 16; // multiples of 16 a VERYgood idea
gridSize.x = (w2+15) / blockSize.x;
gridSize.y = (h2+15) / blockSize.y;
// allocate working buffers, image is W2 x H2 x D2
cudaMalloc((void**)&dbuff,w2*h2*sizeof(float)); // passed as b to kernel
bufflen = w2*h2;Array1D shbuff = Array1D(bufflen);
shbuff.Zero();
hbuff = shbuff.v;
cudaMalloc((void**)&dnewbuff,w2*h2*d2*sizeof(float)); //passed as s to kernel
cudaMemcpy(dnewbuff,vol2,w2*h2*d2*sizeof(float),cudaMemcpyHostToDevice);
e = make_float3((float)w2/2.0f,(float)h2/2.0f,(float)d2/2.0f); // fixed rotation origino = make_float3(0.0f); // translations
r = make_float3(0.0f); // rotations
s = make_float3(1.0f,1.0f,1.0f); // scale factors
t = make_float3(0.0f); // tans of shears
...
-
8/13/2019 Richard Ansorge
26/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Calling the Kernel
doublenr_costfun(Array1D &a)
{
static Array2D affine = Array2D(4,4); // a holds current transformation
double sum = 0.0;
make_affine_from_a(nr_fit,affine,a); // convert to 4x4 matrix of floats
cudaMemcpyToSymbol(c_aff,affine.v[0],4*4*sizeof(float)); // load constant mem
d_costfun(w2,h2,d2,dbuff,dnewbuff); // run kernel
CUT_CHECK_ERROR("kernel failed"); // OK?cudaThreadSynchronize(); // make sure all done
// copy partial sums from device to host
cudaMemcpy(hbuff,dbuff,bufflen*sizeof(float),cudaMemcpyDeviceToHost);
for(intiy=0;iy
-
8/13/2019 Richard Ansorge
27/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Example Run (240x256x176 images)C: >airwc
airwc v2.5 Usage: AirWc opts(12rtdgsf)
C:>airwc sb1 sb2 junk 1f
NIFTI Header on File sb1.nii
converting short to float 0 0.000000
NIFTI Header on File sb2.nii
converting short to float 0 0.000000
Using device 0: GeForce 9600 GT
Initial correlation 0.734281
using cost function 1 (abs-difference)
using cost function 1 (abs-difference)
Amoeba time: 4297, calls 802, cost:127946102
Cuda Total time 4297, Total calls 802
File dofmat.mat writtenNifti file junk.nii written, bswop=0
Full Time 6187
timer 0 1890 ms
timer 1 0 ms
timer 2 3849 ms
timer 3 448 ms
timer 4 0 ms
Total 6.187 secsFinal Transformation:
0.944702 -0.184565 0.017164 40.637428
0.301902 0.866726 -0.003767 -38.923237
-0.028792 -0.100618 0.990019 18.120852
0.000000 0.000000 0.000000 1.000000
Final rots and shifts
6.096217 -0.156668 -19.187197
-0.012378 0.072203 0.122495
scales and shears
0.952886 0.912211 0.995497
0.150428 -0.101673 0.009023
-
8/13/2019 Richard Ansorge
28/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Desktop 3D Registration
Registration
with
CUDA
6 Seconds
Registrationwith
FLIRT 4.1
8.5 Minutes
-
8/13/2019 Richard Ansorge
29/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Comments
This is actually already very useful. Almost
interactive (add visualisation)
Further speedups possible
Faster card
Smarter optimiser
Overlap IO and Kernel execution
Tweek CUDA code
Extend to non-linear local registration
-
8/13/2019 Richard Ansorge
30/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Intel Larabee?
Figure 1: Schematic of the Larabee many-core architecture: The
number of CPU cores and the number and type of co-processors and
I/O blocks are implementation-dependent, as are the positions of the
CPU and non-CPU blocks on the chip.
Porting from
CUDA to
Larabee
should be
easy
-
8/13/2019 Richard Ansorge
31/31
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Thank you