richard ansorge

Upload: sudha

Post on 04-Jun-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Richard Ansorge

    1/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Medical Image Registration

    A Quick Win

    Richard Ansorge

  • 8/13/2019 Richard Ansorge

    2/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    The problem

    CT, MRI, PET and Ultrasound produce 3D

    volume images

    Typically 256 x 256 x 256 = 16,777,216 image

    voxels. Combining modalities (intermodality) gives extra

    information.

    Repeated imaging over time same modality, e.g.MRI, (intramodality) equally important.

    Have to spatially register the images.

  • 8/13/2019 Richard Ansorge

    3/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Examplebrain lesion

    CT MRI PET

  • 8/13/2019 Richard Ansorge

    4/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    PET-MR Fusion

    The PETimage showsmetabolicactivity.

    Thiscomplements

    the MRstructuralinformation

  • 8/13/2019 Richard Ansorge

    5/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Registration Algorithm

    Transform

    Im B to

    match Im A

    Im A Im A

    Im B

    Im B

    Compute

    Cost

    Function

    Done

    Update

    transform

    parameters

    Yes

    No

    good

    fit?

    NB Cost function

    calculation dominates

    for 3D images and is

    inherently parallel

  • 8/13/2019 Richard Ansorge

    6/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Transformations

    11 12 13 14

    21 22 23 24

    31 32 33 34

    0 0 0 1

    a a a a

    a a a a

    a a a a

    General affine transform has

    12 parameters:

    Polynomial transformations can be useful for e.g. pin-

    cushion type distortions:2 2 2

    11 12 13 14 1 2 3 4 5 6x a x a y a z a b x b xy b y b z b xz b yz

    z

    = + + + + + + + + +

    ==

    L

    K

    Local, non-linear transformations, e.g using cubic BSplines,

    increasingly popular, verycomputationally demanding.

  • 8/13/2019 Richard Ansorge

    7/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    We tried this before6 Parameter Rigid Registration - done 8 years ago

    0

    200

    400

    600

    800

    1000

    1200

    0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

    Number of Processors

    Time/secs

    0

    8

    16

    24

    32

    40

    48

    56

    64

    SpeedupFactor

    SR2201 PC 333MHz Speedup perfect scaling

  • 8/13/2019 Richard Ansorge

    8/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Now - Desktop PC - Windows XP

    Needs 400 W power supply

  • 8/13/2019 Richard Ansorge

    9/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Free Software: CUDA & Visual C++ Express

  • 8/13/2019 Richard Ansorge

    10/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Visual C++ SDK in action

  • 8/13/2019 Richard Ansorge

    11/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Visual C++ SDK in action

  • 8/13/2019 Richard Ansorge

    12/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Architecture

  • 8/13/2019 Richard Ansorge

    13/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    9600 GT Device Query

    Current

    GTX 280

    has 240cores!

  • 8/13/2019 Richard Ansorge

    14/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Matrix Multiply from SDK

    NB using 4-byte floats

  • 8/13/2019 Richard Ansorge

    15/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Matrix Multiply (from SDK)

    GPU v CPU for NxN Matrix Multipy

    0

    50

    100

    150

    200

    250

    300

    350

    400

    0 1024 2048 3072 4096 5120 6144

    N

    GPUSpeedup

    average speedup

  • 8/13/2019 Richard Ansorge

    16/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Matrix Multiply (from SDK)

    GPU v CPU for NxN Matrix Multipy

    0.0

    100.0

    200.0

    300.0

    400.0

    500.0

    600.0

    700.0

    800.0

    0 1024 2048 3072 4096 5120 6144

    N

    GPUSpeedup

    speedup average speedup

  • 8/13/2019 Richard Ansorge

    17/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Matrix Multiply (from SDK)

    GPU v CPU for NxN Matrix Multipy

    0

    100

    200

    300

    400

    500

    600

    700

    800

    0 1024 2048 3072 4096 5120 6144

    N

    GPUSpeedup

    0

    5

    10

    15

    20

    25

    30

    35

    40

    speed/mads/nsormad

    s/100ns

    speedup CPU mads/100 ns GPU mads/ns

  • 8/13/2019 Richard Ansorge

    18/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Image Registration

    CUDA Code

  • 8/13/2019 Richard Ansorge

    19/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    #include

    texture tex1; // Target Image in texture

    __constant__ floatc_aff[16]; // 4x4 Affine transform

    // Function arguments are image dimensions and pointers to output buffer b

    // and Source Image s. These buffers are in device memory

    __global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)

    {intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches

    intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y

    floatx = (float)ix;

    floaty = (float)iy;

    floatz = 0.0f; // start with slice zero

    float4v = make_float4(x,y,z,1.0f);

    float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);

    float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);

    float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);

    float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?

    floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);

    floattz = dot(r2,v);

    floatsource = 0.0f;

    floattarget = 0.0f;

    floatcost = 0.0f;

    uintis = iy*nx+ix;

    uintistep = nx*ny;

    for(intiz=0;iz

  • 8/13/2019 Richard Ansorge

    20/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    #include

    texture tex1; // Target Image in texture

    __constant__ floatc_aff[16]; // 4x4 Affine transform

    // Function arguments are image dimensions and pointers to output buffer b

    // and Source Image s. These buffers are in device memory

    __global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)

    {intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches

    intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y

    floatx = (float)ix;

    floaty = (float)iy;

    floatz = 0.0f; // start with slice zero

    float4v = make_float4(x,y,z,1.0f);

    float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);

    float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);

    float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);

    float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?

    floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);

    floattz = dot(r2,v);

    floatsource = 0.0f;

    floattarget = 0.0f;

    floatcost = 0.0f;

    uintis = iy*nx+ix;

    uintistep = nx*ny;

    for(intiz=0;iz

  • 8/13/2019 Richard Ansorge

    21/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    #include

    texture tex1; // Target Image in texture

    __constant__ floatc_aff[16]; // 4x4 Affine transform

    // Function arguments are image dimensions and pointers to output buffer b

    // and Source Image s. These buffers are in device memory

    __global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)

    {intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches

    intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y

    floatx = (float)ix;

    floaty = (float)iy;

    floatz = 0.0f; // start with slice zero

    float4v = make_float4(x,y,z,1.0f);

    float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);

    float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);

    float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);

    float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?

    floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);

    floattz = dot(r2,v);

    floatsource = 0.0f;

    floattarget = 0.0f;

    floatcost = 0.0f;

    uintis = iy*nx+ix;

    uintistep = nx*ny;

    for(intiz=0;iz

  • 8/13/2019 Richard Ansorge

    22/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    #include

    texture tex1; // Target Image in texture

    __constant__ floatc_aff[16]; // 4x4 Affine transform

    // Function arguments are image dimensions and pointers to output buffer b

    // and Source Image s. These buffers are in device memory

    __global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)

    {intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches

    intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y

    floatx = (float)ix;

    floaty = (float)iy;

    floatz = 0.0f; // start with slice zero

    float4v = make_float4(x,y,z,1.0f);

    float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);

    float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);

    float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);

    float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?

    floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);

    floattz = dot(r2,v);

    floatsource = 0.0f;

    floattarget = 0.0f;

    floatcost = 0.0f;

    uintis = iy*nx+ix;

    uintistep = nx*ny;

    for(intiz=0;iz

  • 8/13/2019 Richard Ansorge

    23/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    #include

    texture tex1; // Target Image in texture

    __constant__ floatc_aff[16]; // 4x4 Affine transform

    // Function arguments are image dimensions and pointers to output buffer b

    // and Source Image s. These buffers are in device memory

    __global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)

    {intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches

    intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y

    floatx = (float)ix;

    floaty = (float)iy;

    floatz = 0.0f; // start with slice zero

    float4v = make_float4(x,y,z,1.0f);

    float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);

    float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);

    float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);

    float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?

    floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);

    floattz = dot(r2,v);

    floatsource = 0.0f;

    floattarget = 0.0f;

    floatcost = 0.0f;

    uintis = iy*nx+ix;

    uintistep = nx*ny;

    for(intiz=0;iz

  • 8/13/2019 Richard Ansorge

    24/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    #include

    texture tex1; // Target Image in texture

    __constant__ floatc_aff[16]; // 4x4 Affine transform

    // Function arguments are image dimensions and pointers to output buffer b

    // and Source Image s. These buffers are in device memory

    __global__ voidd_costfun(intnx,intny,intnz,float*b,float*s)

    {intix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches

    intiy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y

    floatx = (float)ix;

    floaty = (float)iy;

    floatz = 0.0f; // start with slice zero

    float4v = make_float4(x,y,z,1.0f);

    float4r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);

    float4r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);

    float4r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);

    float4r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?

    floattx = dot(r0,v); // Matrix Multiply using dot productsfloatty = dot(r1,v);

    floattz = dot(r2,v);

    floatsource = 0.0f;

    floattarget = 0.0f;

    floatcost = 0.0f;

    uintis = iy*nx+ix;

    uintistep = nx*ny;

    for(intiz=0;iz

  • 8/13/2019 Richard Ansorge

    25/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Host Code Initialization Fragment

    ...blockSize.x = blockSize.y = 16; // multiples of 16 a VERYgood idea

    gridSize.x = (w2+15) / blockSize.x;

    gridSize.y = (h2+15) / blockSize.y;

    // allocate working buffers, image is W2 x H2 x D2

    cudaMalloc((void**)&dbuff,w2*h2*sizeof(float)); // passed as b to kernel

    bufflen = w2*h2;Array1D shbuff = Array1D(bufflen);

    shbuff.Zero();

    hbuff = shbuff.v;

    cudaMalloc((void**)&dnewbuff,w2*h2*d2*sizeof(float)); //passed as s to kernel

    cudaMemcpy(dnewbuff,vol2,w2*h2*d2*sizeof(float),cudaMemcpyHostToDevice);

    e = make_float3((float)w2/2.0f,(float)h2/2.0f,(float)d2/2.0f); // fixed rotation origino = make_float3(0.0f); // translations

    r = make_float3(0.0f); // rotations

    s = make_float3(1.0f,1.0f,1.0f); // scale factors

    t = make_float3(0.0f); // tans of shears

    ...

  • 8/13/2019 Richard Ansorge

    26/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Calling the Kernel

    doublenr_costfun(Array1D &a)

    {

    static Array2D affine = Array2D(4,4); // a holds current transformation

    double sum = 0.0;

    make_affine_from_a(nr_fit,affine,a); // convert to 4x4 matrix of floats

    cudaMemcpyToSymbol(c_aff,affine.v[0],4*4*sizeof(float)); // load constant mem

    d_costfun(w2,h2,d2,dbuff,dnewbuff); // run kernel

    CUT_CHECK_ERROR("kernel failed"); // OK?cudaThreadSynchronize(); // make sure all done

    // copy partial sums from device to host

    cudaMemcpy(hbuff,dbuff,bufflen*sizeof(float),cudaMemcpyDeviceToHost);

    for(intiy=0;iy

  • 8/13/2019 Richard Ansorge

    27/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Example Run (240x256x176 images)C: >airwc

    airwc v2.5 Usage: AirWc opts(12rtdgsf)

    C:>airwc sb1 sb2 junk 1f

    NIFTI Header on File sb1.nii

    converting short to float 0 0.000000

    NIFTI Header on File sb2.nii

    converting short to float 0 0.000000

    Using device 0: GeForce 9600 GT

    Initial correlation 0.734281

    using cost function 1 (abs-difference)

    using cost function 1 (abs-difference)

    Amoeba time: 4297, calls 802, cost:127946102

    Cuda Total time 4297, Total calls 802

    File dofmat.mat writtenNifti file junk.nii written, bswop=0

    Full Time 6187

    timer 0 1890 ms

    timer 1 0 ms

    timer 2 3849 ms

    timer 3 448 ms

    timer 4 0 ms

    Total 6.187 secsFinal Transformation:

    0.944702 -0.184565 0.017164 40.637428

    0.301902 0.866726 -0.003767 -38.923237

    -0.028792 -0.100618 0.990019 18.120852

    0.000000 0.000000 0.000000 1.000000

    Final rots and shifts

    6.096217 -0.156668 -19.187197

    -0.012378 0.072203 0.122495

    scales and shears

    0.952886 0.912211 0.995497

    0.150428 -0.101673 0.009023

  • 8/13/2019 Richard Ansorge

    28/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Desktop 3D Registration

    Registration

    with

    CUDA

    6 Seconds

    Registrationwith

    FLIRT 4.1

    8.5 Minutes

  • 8/13/2019 Richard Ansorge

    29/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Comments

    This is actually already very useful. Almost

    interactive (add visualisation)

    Further speedups possible

    Faster card

    Smarter optimiser

    Overlap IO and Kernel execution

    Tweek CUDA code

    Extend to non-linear local registration

  • 8/13/2019 Richard Ansorge

    30/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Intel Larabee?

    Figure 1: Schematic of the Larabee many-core architecture: The

    number of CPU cores and the number and type of co-processors and

    I/O blocks are implementation-dependent, as are the positions of the

    CPU and non-CPU blocks on the chip.

    Porting from

    CUDA to

    Larabee

    should be

    easy

  • 8/13/2019 Richard Ansorge

    31/31

    CUDA Image Registration 29 Oct 2008 Richard Ansorge

    Thank you