generation of planar radiographs from 3d anatomical models using the gpu

Generation of planar radiographs from 3Danatomical models using the GPU

André dos Santos CardosoSupervisor: Jorge M. G. Barbosa

University of PortoFaculty of Engineering of University of Porto

11th February, 2011

André Cardoso [email protected] DRR Synthesis Algorithms 1/271/27

Contents

Introduction and Context

CUDA Platform

Input Data

Pre-Processing Steps

Developed Algorithms

Conclusion



CUDA Platform

Input Data



Conclusion


DRRs

• Digitally Reconstructed Radiographs – DRRs• Artificial Radiographs taken from vertebrae models

Figure: L3 Vertebra, frontal DRR Figure: L3 Vertebra, lateral DRR


DRRs – Why?

• Shape recovery of human spine◦ 100s of DRRs per second

• Scoliosis Evaluation◦ Alternative to MRIs and CTs


Project’s Objective

Build Fast DRR Algorithms• Common bottleneck!◦ Applications in medical area – high throughputs are demanded

• Take advantage new GPUs and APIs◦ Common workstations could do the job!


Existing Solution – GLSL

• GLSL implementation – multi-pass working solution• Depth Peeling Based – Cass Everitt, InteractiveOrder-Independent Transparency

• Let’s try to enhance its performance!!


Algorithm Concepts

P4

P3

P2

P1

Object

X-ray source

Image Plane

Problem!Potential Artifact Generation!

Object

• Each ray traverses the object◦ Energy is attenuatedPixelColor = exp ((||P2 − P1||+ ||P4 − P3||)× AttenuationFactor)

• Common edges may lead to artifact generation!André Cardoso [email protected] DRR Synthesis Algorithms 7/27

7/27


CUDA Platform

Input Data



Conclusion


CUDA Platform

• Compute Unified Device Architecture◦ Parallel Computing Architecture◦ Exposes GPU functions and memory◦ SIMT execution model◦ Allows hierarchical configuration of

threads

• Cheap threads, dozens/hundreds of cores◦ Thousands of concurrent threads!

• GeForce GT 240◦ 96 cores◦ 12288 active threads


CUDA Platform – Threading and Memory



CUDA Platform

Input Data



Conclusion


Inputs for Our Algorithms

• Geometry file – thevertebrae models



• Camera Calibration Matrix



• Camera Calibration Matrix

Figure: Pinhole Model

C =

αu λ u00 αv v00 0 1

P =

f 0 0 00 f 0 00 0 1 0

K =

[R t0T

3 1

]

s

uv1

= C.P.K.

XYZ1

André Cardoso [email protected] DRR Synthesis Algorithms 10/27

10/27


CUDA Platform

Input Data



Conclusion



1. 2D Bounding Box

2. (Projection Source)3. Ray Direction

(for each pixel)

◦ R(t) = O + tD



1. 2D Bounding Box2. (Projection Source)

3. Ray Direction(for each pixel)

◦ R(t) = O + tD



1. 2D Bounding Box2. (Projection Source)3. Ray Direction

(for each pixel)

◦ R(t) = O + tD



CUDA Platform

Input Data



Conclusion


Image Order Approach

• Ray Casting!

1 Thread for Each Pixel• Thread ⇐⇒ Ray• Thread loops over ALL triangles◦ Tests intersections between ray and

triangle◦ Acumulates distances to source

along ray path


Image Order Approach – Problems

1. Many threads loopingover many triangles

2. Useless intersectiontests – heavyoperations!

3. Artifacts – hard to takecare of!

L3 Vertebra Model• 776 vertices, 1552 triangles• PA perspective: 266 × 138 pixels =36708 threads


Image Order Approach – Problems

1. Many threads loopingover many triangles

2. Useless intersectiontests – heavyoperations!

3. Artifacts – hard to takecare of!


Image Order Approach – Results

• L3 vertebra model• PA camera – 265 × 137pixels

• GPU time only!• Incomplete implementation

SLOW!


Object Order Approach

• Ray Casting!• Threads spanned foreach triangle◦ Reverse the approach

of the formeralgorithm!

1 Thread for Each Triangle• Thread loops over each pixel coveredby the triangle bounding box◦ Tests intersections between ray and

triangle◦ Acumulates distances to source

along ray path• Concurrency problems!


Object Order Approach – Problems

1. Concurrency problems onpixel data.◦ Fang Liu et al, FreePipe:

a programmable parallelrendering architecture forefficient multi-fragmenteffects

2. Still many intersectiontests

3. Artifacts still hard to avoidor correct

int index = atomicInc(sharedCounter);

Pixe

l Bu�

er

Concurrent Threads


Object Order Approach – Problems

1. Concurrency problems onpixel data.

2. Still many intersectiontests

3. Artifacts still hard to avoidor correct


Object Order Approach – Results

• L3 vertebra model• PA camera – 265 × 137pixels

• GPU time only!• Incomplete implementation

SLOW!


Multi-depth Approach - Principle

Assume a Simplification• Discard the Euclidean distance between intersections!• Consider only distance between Fragments, along depth axis!!

Source

P1

P2

P’2P’1

d1

d2


Multi-depth Approach - Pipeline

• Rasterization done using Scanline+Bresenham algorithm◦ Filling convention avoids artifacts :) !

• Interpolation in Integer interval◦ Depth = Z−Zmin

Zmax−Zmin × INT_MAX

• Saving depth in pixel array, raises concurrency problems (again)!


Multi-depth Approach - Depth arrayOrdering

atomicMin inserts in right place1: initializeDepthArrays(MAX_INTEGER)2: Znew ← interpolateDepth()3: for i = 0 to DEPTH_ARRAY_SIZE − 1 do4: Zold ← atomicMin(&(getPixelDepthArray(u, v , i)),Znew)5: if Zold == MAX_INTEGER then6: break7: end if8: Znew ← fmaxf (Znew ,Zold)9: end for

• Fang Liu et al, FreePipe: a programmable parallel renderingarchitecture for efficient multi-fragment effects


Multi-depth Approach - Results• Best time:◦ 202 × 132 pixels◦ GPU + CPU time!

◦ Performance With andWithout DRR transfer tohost!

BETTER!André Cardoso [email protected] DRR Synthesis Algorithms 21/27

21/27

Multi-depth Optimization

• Multi-depth allows for an ordered set of depths◦ More depths =⇒ more atomicMin() calls

We can postpone depth Ordering...1: index← atomicInc(&counter, INT_MAX)2: depthArray [index ]← Znew // RAW-hazard free!!!!

• depthArray has all the depth values;◦ Ordering can be done on a post-processing kernel!!!


Multi-depth Optimization

int index = atomicInc(sharedCounter);Pi

xel B

u�er

Concurrent Threads


Multi-depth Optimization – Results

• A-buffer Scheme Versus GLSL Solution• 202 × 132 pixels


Multi-depth Optimization – Results

Better than Current Solution



CUDA Platform

Input Data



Conclusion


Conclusion

• CUDA implementations for DRR extraction◦ Both pre-processing and main computation tasks◦ Artifact-free

• Single geometry pass• Shared memory model◦ May be adapted to other technologies

• Final implementation shows better performance than GLSL


Future Work

There’s a Big Chart to Fill Up...


Future Work

• Still some artifacts• Memory operations optimizations• Comparisons with other implementations, other geometrymodels

• Build a DRR generation library◦ possibly an open-source project

• Participation in IJUP’11 • Paper preparation forVIPIMAGE 2011. AbstractDeadline: 15th March.


Thank You for Listening!Ask Away!


generation of planar radiographs from 3d anatomical models using the gpu

Documents