implementation of voxel volume projection operators using cuda applications to iterative cone-beam...

17
Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Upload: rebekah-sherwood

Post on 28-Mar-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Implementation of Voxel Volume Projection Operators Using

CUDAApplications to Iterative Cone-Beam

CT reconstruction

Page 2: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Three questions to keep in mind during the presentation1. Name three differences between the GPU and

CPU contributing to the observed performance improvements.

2. Scattered accumulation (a[ix] = a[ix] + b) can result in reduced performance and unexpected results in CUDA. Why? Please state two possible problems!

3. Suppose that each thread in a CUDA program is responsible for writing to one unique output element. Give one possible solution for handling ”odd sized” output data.

Page 3: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

What are voxel volume projectors?

• Object: f• Projection data: p (line integrals through f)• Forward projection: P, mapping f to p• Backprojection: PT, mapping p to f

X-ray source

Detector

f

p

Page 4: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

The tomography problem

• Given p, find an image/volume f, such that the difference between Pf and p is small in some sense, e.g. z(f)=||Pf-p||2 attains a minimum.

• Steepest descent approach: – Gradient: z(f)=2PT(Pf-p)– Update step: fk+1=fk-2PT(Pfk-p)

• We use this method only for illustration of why P and PT are needed. In practice, faster and better regularized methods are needed.

Page 5: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Why implement P and PT on the GPU?

• Reasonable sizes of f and p are around 2GB, implying a size of 2GBx2GB for P and PT.

• Although these matrices are sparse, the number of nonzero elements is approximately 2000GB, i.e., matrix-vector products involving these matrices are computationally demanding.

• The GPU offers many solutions that help speeding up the computations:– Massive parallelism– Hardware linear interpolation– Very high memory bandwidth– Texture caches ”optimized for 2D locality”

Page 6: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

The Joseph forward projector [1]

• Illustration in the 2D case:

• Generalization to 3D is straightforward. Instead of linear intepolation, bilinear interpolation is used.

[1] Joseph, P. M., An improved algorithm for reprojecting rays through pixel images, IEEE Transactions on Medical Imaging, 1982, MI-1, 192-196

L

x0

x1

x2

x3

pi= L (x0+x1+x2+x3)

Page 7: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Implementation sketch for P

• We choose to let one thread correspond to one ray.

• For each ray/thread:1. Determine the ray origin and direction.2. Determine the ray and voxel volume intersection.3. Step along the ray and accumulate values from

the voxel volume, using the 3D-texture cache.4. Multiply with L and store to output element.

Page 8: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Handling of ”odd” detector sizes

• The x-ray detector is divided into 2D blocks corresponding to CUDA thread blocks, using a static block size (16x2).

• To handle the case with detector dimensions not divisible with the block size, conditional statements were used:

• Although this reduces efficiency due to divergent programs, the reduction is small for detectors with reasonably large number of channels and rows.

// Return if outside detectorif (rowIx>=Nrows) return;if (chIx>=Nch) return;

Page 9: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Implementation details 1Calc. of source and detector positions

// Calculate focus center and displacement vectorsfloat alpha = theta + D_ALPHA(chIx);float3 fc = make_float3(Rf*cos(alpha), Rf*sin(alpha), z0+D_Z(chIx));float3 fw = make_float3(-sin(alpha), cos(alpha), 0.f);float3 fh = make_float3(-cos(alpha)*sin(aAngle), -sin(alpha)*sin(aAngle), cos(aAngle));

// Calculate detector center and displacement vectorsfloat3 dc = fc + (Rf+Rd) * make_float3(-cos(theta), -sin(theta), D_SLOPE(rowIx));float3 dw = make_float3(sin(theta), -cos(theta), 0.f);float3 dh = make_float3(0.f, 0.f, 1.f);

// Calculate focus position in texture index coordinatesfloat3 f = ((fc + fOffset.x*fw + fOffset.y*fh) - origin) / spacing;

// Calculate detector position in texture index coordinatesfloat3 d = ((dc + dOffset.x*dw + dOffset.y*dh) - origin) / spacing;

// Create ray structRay ray;ray.o = f;ray.d = normalize(d - f);

Code based on NVIDIA CUDA SDK ”volumeRender”

Page 10: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Implementation details 2Accumulation of contributions

// Accumulate contributions along rayfloat dValue = 0;float t = tnear;for(int i=0; i<dimensions.x; i++) { // Calculate current position float3 pos = ray.o + ray.d*t + 0.5f; // texture origin: (0.5f, 0.5f, 0.5f)

// read from 3D texture dValue += tex3D(devT_reconVolume, pos.x, pos.y, pos.z);

// increase t t += tstep; if (t >= tfar) break;}

// Update detector data valuedValue *= length((ray.d*spacing))*tstep;projectionData[rowIx*Nch+chIx] += dValue;

Code based on NVIDIA CUDA SDK ”volumeRender”

Page 11: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Experiments – forward projection (P)

• Input data dimensions: 672x24x3000 floats• Output data dimensions: 512x512x257 floats• Only very small differences in accuracy, without

practical implications, occur between CPU and GPU implementations.

• Calculation times– CPU: 2500 s– GPU: 47 s

• Speed up factor: approximately 50x• For a larger collection of problems, speedups between

20x and 50x have been observed.

Page 12: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Efficient implementation of the exact adjoint operator PT is much trickier, why?• 2D illustration of the procedure:

• Same interpolation coefficients as for P, but with scattered accumulation instead of reading.

• No hardware interpolation. No 2D/3D textures.• New parallelization setup is needed. One ray/one

thread leads to corrupted results.

pi

Page 13: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

One slow but exact implementation for the exact transpose of P• Let one thread represent only one accumulation from a ray,

i.e.,– calculation of position in voxel volume.– calculation and multiplication with interpolation coefficients.– accumulation to the 4 voxels closest to the active ray/voxel

volume plane intersection.• Very short, and very many rays threads.• One thread block represents a number of rays, separated

so that conflicting updates do not occur:

Block 0 Block 1 Block 2 ...

Page 14: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Experiments – backprojection (PT)

• Input data dimensions: 512x512x257 floats• Output data dimensions: 672x24x3000 floats• Calculation times– CPU: 2700 s– GPU: 700 s

• Speed up factor: approximately 4x• For a larger collection of problems, speedups

between 4x and 8x have been observed.

Page 15: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Approximate implementation of the adjoint operator PT

• Bilinear interpolation on the detector.• 2D-illustration:

• Parallelization is now accomplished by letting one pixel correspond to one thread.

• This method is approximate since the interpolation coefficients generally are different from those of PT.

Page 16: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Performance comparisonApproximate versus exact PT operator• Calculation times

– CPU PT : 2700s– GPU PT (exact): 700s– GPU PT (approximate): 60s

• Axial images of Head phantom reconstructions (25HU-75HU):

Exact Approximate

Page 17: Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Conclusions and future research• For operations such as P and PT, speedups of 4x to 50x can be

obtained.• The amount of speedup is highly dependent on

– The possibility to efficiently read memory (coalesced or by the use of texture caches / constant memory).

– The possibility to use hardware interpolation– Complexity of the kernel. Using too many registers slows down the

program. Lookup tables in constant memory can help reducing the amount of registers.

• Although scattered write is supported by CUDA, for these problemes, it did not seem good to use it from a performance point of view. Note that this prohibits using cudaArray textures.

• Remark: Even if the approximate PT operator give a bad result in the example given here, there are other situations where the exact operator is superior. It is therefore of interest to find better approximations.