gpus in ct reconstruction - meetupfiles.meetup.com/1774957/hpc meetup - gpus in ct...
TRANSCRIPT
![Page 1: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/1.jpg)
GPUs in CT Reconstruction Logan Johnson
<3
2
1
![Page 2: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/2.jpg)
Agenda
• Introduction
• CT Essentials
• Forward Projection
• GPU Programming 101
• GPU Optimization of Forward Projector
![Page 3: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/3.jpg)
This Guy
Professional • BS Bioengineering - Clemson University (2009)
• 5 years at GE Healthcare, CT Recon
• Just started at NeuroLogica, Mobile CT Recon
• Algorithm design and optimization – CPU, GPU, and Xeon Phi architectures
– CUDA and OpenCL
Unprofessional • Runner, writer, and digital artist
• Lover of “coffeine” and scotch
Glennfiddich distillery in Dufftown, Scotland
![Page 4: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/4.jpg)
CT ESSENTIALS
![Page 5: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/5.jpg)
What is CT?
Biggest drawback:
Irradiates patient (and potential use of contrast agent)
2
3D Imaging
Trauma/ER Cardiac
Perfusion
Hard tissues
Guided surgery
Great for:
1
![Page 6: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/6.jpg)
What is CT really?
https://www.youtube.com/watch?v=2CWpZKuy-NE
![Page 7: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/7.jpg)
CT Reconstruction in a Nutshell
SCAN
RECONSTRUCT
CO
RR
ECT
RAW PROJECTIONS
SINOGRAM IMAGES
1
2
FBP or Iterative
![Page 8: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/8.jpg)
Filtered Back Projection
Fourier + Radon transform based algorithm
1
1
CT scan is like a Radon transform of a patient. Goal is to inverse Radon transform (FBP) to recover anatomy.
F BP
![Page 9: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/9.jpg)
Core FBP Reconstruction Math-magics
Raw View
Calibration
Beer’s Law vout = -ln(vin/vref)
vout = vin * gain + offsets
Filter
Rebinning
vout = conv1D(vin, rampFilter)
vout = interp2D(vin-100, vin, vin+100)
Step Output Projection Simplified Math
Generally, core steps are easily parallelizable algorithms and projections can be processed independently of one another (except rebin).
vout = raw scanner data
![Page 10: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/10.jpg)
Back Projection
Final step is Back Projection, which is also easily parallelizable but requires many projections.
1
![Page 11: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/11.jpg)
More Reasons for using GPU
Reasons:
• Off-the-shelf technology = cost savings
• Much better performance than x86/64
• Easier to program/develop than FPGA
• Floating point performance > FPGA
Draw-backs:
• Short GPU life cycle = more cost in V&V, inventory
Full-body scan of 6’ patient ready in < 5 minutes
![Page 12: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/12.jpg)
Iterative Reconstruction
Improvements in HPC technology enable more sophisticated reconstruction algorithms
1
GE Veo Model Based Iterative Reconstruction (MBIR) on BladeCenter
![Page 13: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/13.jpg)
Iterative Reconstruction
GE
V
EO
Siem
ens
IRIS
Siem
ens
- SA
FIR
E P
hill
ips
- iD
ose
1
1
Algorithms are generally much more complex than FBP, therefore slower
![Page 14: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/14.jpg)
Iterative Reconstruction
1
You get what you compute for.
![Page 15: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/15.jpg)
Iterative Reconstruction
1
Next big challenge in CT imaging for GPUs – Veo quality at SAFIRE/iDose speeds
![Page 16: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/16.jpg)
FORWARD PROJECTION
![Page 17: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/17.jpg)
What is Forward Projection?
SCAN
Forward Project
CORRECTED PROJECTIONS
RE-PROJECTIONS
Forward projection is like simulating a CT scan. The input to this simulation are CT images. Reprojections should be similar to original corrected projections which made the aforementioned input images.
1
2
![Page 18: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/18.jpg)
What is Forward Projection?
1
![Page 19: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/19.jpg)
Modeling X-Ray Transmission
−ln𝜆𝑜𝑢𝑡𝜆𝑖𝑛= 𝑎𝑖𝑙𝑖𝑖
𝜆𝑖𝑛
𝜆𝑜𝑢𝑡
Intensity, 𝜆, decreases as beam passes through object
Σ𝑜𝑢𝑡 = 𝑎𝑖𝑙𝑖𝑖
Σ𝑖𝑛
Σ𝑜𝑢𝑡
Real System FP with CT Image Input
Sum of attenuations, Σ, increases as ray passes through image
Beer (-Lambert)’s Law!
1
2
![Page 20: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/20.jpg)
Modeling X-Ray Transmission
Σ𝑜𝑢𝑡 = 𝑎𝑖𝑙𝑖𝑖
Summing attenuation values
For each row, compute attenuation by interpolating between pixels at intersection with ray. Add these to an accumulator, and multiply the result by the geometric scaling factor, 𝑙𝑖, since this value is constant for all rows for this particular ray.
3 5
5 1 𝑎𝑛 = 5 ∗ .5 + 1 ∗ .5 = 3
𝑎𝑛+1 = 3 ∗ .2 + 5 ∗ .8 = 4.6
𝜃 𝑙𝑖 =
ℎ
sin (𝜃)
…
…
ℎ
![Page 21: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/21.jpg)
Modeling X-Ray Transmission
“Walking” across just rows
“Walking” across rows OR columns
Choose between sampling pattern with “if |cos(𝜃)| > |sin(𝜃)|”, where 𝜃 is the ray angle
Just two samples?
That’s more like it.
![Page 22: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/22.jpg)
Modeling a CT Scanner
sou
rce to iso
center
sou
rce to d
etector
Detector channel radial width
Detector row width
det
ecto
r R
ow
s
detector Channels
X-Ray Source (Tube)
X-ray source and detector rotate around isocenter. Detector channels are equiangularly spaced w.r.t. to source. Rows are all the same width.
CT Detector
1
2
3
4
![Page 23: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/23.jpg)
Modeling a CT Scanner
One rotation 21 equally spaced views
(Not a realistic scan)
View
0
View
1
View
2
…
View
20
View
19
View
18
Two key parameters – views (exposures) per rotation and rotation speed.
-180° -90° 0° 90° 180°
1
2
![Page 24: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/24.jpg)
Ray Driven Cone Beam
Forward Projection
One rotation per N seconds, M equally spaced views Want to compute projection for each ray at each view location.
View
0
View
1
View
2
…
View
M
View
M-1
Vie
w M
-2
ray
chan
nel
dir
ecti
on
ray row direction
walk across IMAGE COLUMNS
walk across IMAGE ROWS
walk across IMAGE COLUMNS
3D Ray Tracing!
x
y
In-plane Geometry
ROTATE
Total output elements = rows * channels * views
-z ← Out of plane geometry → +z
1
![Page 25: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/25.jpg)
GPU PROGRAMMING 101
![Page 26: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/26.jpg)
Programming GPUs
• Compute Uniform Device Architecture • NVidia Proprietary • GPU only • Block size and grid size
• Open Compute Language • Khronos Group open standard • AMD, NVidia, Intel, Altera, Xilinx • GPU, CPU, Phi, FPGA, others (?) • Global work size and work group size
Very similar paradigms, and both are C/C++ API’s. Comparing CUDA to OpenCL is like comparing Java to C++
1
2
![Page 27: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/27.jpg)
CUDA Programming Model
Key concepts: • SIMT – Single Instruction Multiple Threads • 32 threads / warp • Threads are grouped into blocks • One warp worth of threads are executed in
parallel per compute unit. • Each warp executes same instruction at the same
time – lock-step execution • Branch divergence when threads within half-warp
choose different logical paths
1
![Page 28: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/28.jpg)
Architecture
NVidia Maxwell (GM204)
32 cores/SM for 1 warp 1 2
![Page 29: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/29.jpg)
Architecture
NVidia Maxwell (GM204)
32 cores/SM for 1 warp 298 mm2
1 2
![Page 30: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/30.jpg)
Memory Architecture
1 to 32 cycles
1 cycle
400 to 600 cycles
Avoid global memory accesses, try to use shared memory.
Access latency
1
![Page 31: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/31.jpg)
Performance Optimization
Tools NVidia NVVP AMD CodeXL
Knowledge GPU Gems
AMD/NVidia Programming Guides Experience
Creativity (borderline madness)
1
2,3
4
![Page 32: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/32.jpg)
GPU OPTIMIZATION OF FORWARD PROJECTOR
![Page 33: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/33.jpg)
Introduction
1
![Page 34: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/34.jpg)
Experimental Setup
System Configuration • CUDA 6.5 • NVidia K20m • Visual Studio 2012
Projector Configuration • Joseph et. al 1982 Projection Model • 32 rows • 800 views per rotation • 1 rotation per second • 32 mm/s movement in Z • RTK 12 CPU Reference – 1473 seconds
NIH-NLM Visible Human Body Project
Frozen Female 512 (x) 512 (y) 1784(z) image matrix size
CT Scan Case
Reconstruction Toolkit (RTK) by Creatis, MGH, et. al also contains an excellent example of this algorithm in CUDA.
![Page 35: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/35.jpg)
Performance Goal
Acquires 1 rotation per second
Have a performance goal before you begin designing! (Even if it’s roughly 1400x)
So…
Processing at least 1 rotation per second will ensure FP is not pipeline bottleneck
1 2
![Page 36: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/36.jpg)
Naïve Implementation
Priorities:
• Needs to produce correct results
• Write GPU friendly code
if Avoid big if conditions
t0 t1 t2 t3
d0 d1 d2 d3
Output-driven parallelism (one thread / output element)
1
![Page 37: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/37.jpg)
Trilinear In
terp
olatio
n
Z-coordinate computation
Weight, write final result to global memory once
Kernel Source Code
The Inner Loop – executed at most 512 times!
Somewhat redundant, but good for prototyping
![Page 38: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/38.jpg)
Determine if walking across rows or columns Compute ray change in x and y accordingly
Compute line integral weighting
Kernel Source Code
Projection loops don’t need to be inside if condition. 1. Avoids unnecessary and costly warp divergence 2. Eliminates duplicate code
![Page 39: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/39.jpg)
Results of Naïve Implementation
in a blazing
One rotation of data
433 seconds!!!
For this much of the anatomy
![Page 40: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/40.jpg)
Performance Profiling
NVidia Visual Profiler
Very basic profiling on a 7 minute application took overnight to complete. Try running a smaller but representative case.
![Page 41: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/41.jpg)
Performance Profiling
Complete profiling took 10 minutes for 16 of 800 views. Overflow issues still persistent, but sufficient information to begin optimizing.
![Page 42: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/42.jpg)
Guided Performance Analysis
Helpful tool to run the most relevant profiling experiments for your kernel. Took five minutes.
![Page 43: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/43.jpg)
Register Usage
Registers/thread mostly driven by number of variables in kernel
Executive summary on kernel performance
![Page 44: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/44.jpg)
Register Usage
This function
Does ~30 loads
And hits peak register usage
nvdiasm gives some insight into what is using up all the registers
![Page 45: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/45.jpg)
Register Usage
~30
ele
me
nts
Perhaps this huge structure causing a lot of register spillage in the inner most loop?
![Page 46: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/46.jpg)
Register Usage
Remove covertImageCoordinatesToSpace from inner loop with algebraic factorization
433s / rotation 67 registers
27.8s / rotation 65 registers
Yet we still need 60+ registers.
![Page 47: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/47.jpg)
Register Usage
Since we’re optimizing the inner loop….
27.8 s/rotation 65 registers
16.2 s/rotation 74 registers
Simplified calculations and introduced pitched memory (more on this later). What else changed that could have driven up register usage?
![Page 48: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/48.jpg)
Register Usage
Went from a pointer to a struct
Passing big structs by value, not by reference, to CUDA kernel is apparently a bad idea.
27.8 s/rotation, 65 registers
16.2 s/rotation, 74 registers
12.5 s/rotation, 48 registers
![Page 49: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/49.jpg)
Occupancy
10.3 s/rotation 12.5 s/rotation
Changing block size (for this algorithm) is simple and can quickly yield improvements in device utilization. Using shared memory might make such tweaks more challenging.
64 threads/block 128 threads/block
![Page 50: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/50.jpg)
Occupancy
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
What can we do to further improve on 63% occupancy?
(Occupancy = WarpsPerSM / TotalSM * 100% )
But is it worth it?
![Page 51: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/51.jpg)
Removing Expensive Instructions
IPC = Instructions Per Clock. Higher is faster. Expected from CUDA Programming Guide Measured from my laptop
Expected IPC Measured IPC
float32 add/mply 6 5.26
float32 divide ? 3.41
float32 rsqrtf() 1 1.2
float32 1.0f/rsqrtf() ? 1.1
float32 sqrtf() ? 1.08
int32 add 5 4.01
int32 mply 1 1.09
Quadro K4100M (3.0)
10.3 s/ rotation, 48 registers per thread
9.00 s/rotation, 39 registers per thread
Simple factorization removed 512 sqrt computations / thread, some less expensive multiplications, and some variables
![Page 52: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/52.jpg)
Assessing Our Progress
433
27.78
16.2 12.5
10.3 8.995
1
10
100
1000
Naïve removedstruct from
loop
removedclamping
struct pointer block sizeoptimization
sqrt removal
Tim
e p
er
rota
tio
n [
s]
Forward Projector Performance
About 50x faster, but still need another 10x
![Page 53: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/53.jpg)
First Pass Optimization
This might be a good point to profile an entire rotation
Where we started 433 seconds / rotation
Where we arrived 9 seconds / rotation
![Page 54: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/54.jpg)
High-level Profile for Full Rotation
16 Projections (.5 s/rotation)
800 Projections (9 s/rotation)
The 16 projection experiment isn’t representative of the full experiment. Why the 18x difference?
![Page 55: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/55.jpg)
What’s different?
One rotation per N seconds, M equally spaced views Want to compute projection for each ray at each view location.
View
0
View
1
View
2
…
View
M
View
M-1
Vie
w M
-2
ray
chan
nel
dir
ecti
on
ray row direction
walk across IMAGE COLUMNS
walk across IMAGE ROWS
walk across IMAGE COLUMNS
3D Ray Tracing!
x
y
In-plane Geometry
ROTATE
-z ← Out of plane geometry → +z
Processing more views means moving further in Z and changing rotation angles.
1
![Page 56: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/56.jpg)
A Little Design of Experiment
0
20
40
60
80
100
120
140
160
180
200
150
250
350
450
550
650
750
850
950
1050
1150
0 100 200 300 400 500
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
Number of Views
Adjusting Total Number of Views
Execution Time
Load Efficiency
0
20
40
60
80
100
120
140
160
180
200
150
155
160
165
170
175
0 100 200 300 400 500
% lo
ad e
ffic
ien
cy
No
rmal
ize
d E
xecu
tio
n T
ime
[m
s]
Number of Views
Adjusting Total Number of Views
Execution Time
Load Efficiency
If table position and gantry angle are held constant, the number of views has an expected linear impact on performance.
![Page 57: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/57.jpg)
A Little Design of Experiment
0
20
40
60
80
100
120
140
160
180
200
150
200
250
300
350
400
450
0 20 40 60 80 100
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Location [mm]
Adjusting First View Location
Execution Time
Load Efficiency
0
20
40
60
80
100
120
140
160
180
200
150
160
170
180
190
200
210
220
0 2 4 6 8
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Angle [radians]
Adjusting Initial View Angle
Execution Time
Load Efficiency
Adjusting table position or gantry angle with a fixed number of views causes performance loss. Why?
![Page 58: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/58.jpg)
First View Location
0
20
40
60
80
100
120
140
160
180
200
150
200
250
300
350
400
450
0 20 40 60 80 100
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Location [mm]
Adjusting First View Location
Execution Time
Load Efficiency
1
2
3
4
5
6
1 6
1
2
3 4 5 6
The original 16 view test case (at position 1) wasn’t projecting much – many of its rays were completely outside of the image volume.
Positions 4-6 are more representative of actual performance.
First View
![Page 59: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/59.jpg)
Initial Rotation Angle
0
20
40
60
80
100
120
140
160
180
200
150
160
170
180
190
200
210
220
0 2 4 6 8
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Angle [radians]
Adjusting Initial View Angle
Execution Time
Load Efficiency
Load efficiency and execution time vary drastically with rotation angle. nvvp suggests that we check if our memory accesses are coalesced
![Page 60: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/60.jpg)
0 1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31 32 33 34
35 36 37 38 39 40 41
42 43 44 45 46 47 48
Memory Coalescing 101
• Threads will project each row in parallel • For row 2, the threads will collectively need
to read memory elements 15, 16, 17, 18, and 19 at the same time.
• Since these elements are adjacent, the access is said to be coalesced.
• How coalesced depends on alignment, the total number of bytes read, etc.
• Best case, these elements can be read in one transaction Fo
r ea
ch r
ow
, eac
h t
hre
ad w
ill in
par
alle
l pro
ject
a p
ixel
row 0
row 1
row 2
row 3
…
1
![Page 61: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/61.jpg)
0 1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31 32 33 34
35 36 37 38 39 40 41
42 43 44 45 46 47 48
Memory Not Coalescing 101
• Threads will project each column in parallel • For column 4, the threads will collectively
need to read memory elements 11, 18, 25, 32, and 39 at the same time.
• Since these elements are NOT adjacent, the access are likely not coalesced.
• How not coalesced depends on alignment, how far apart the elements are, etc.
• Worst case, these elements will be read in five transactions
For each column, each thread will in parallel project a pixel co
lum
n 6
colu
mn
5
colu
mn
4
colu
mn
3
…
The projector rotates 360 degrees, so our accesses will have periodically bad efficiency!
1
![Page 62: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/62.jpg)
Some Thoughts on Design of Experiments
0
20
40
60
80
100
120
140
160
180
200
150
200
250
300
350
400
450
0 20 40 60 80 100
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Location [mm]
Adjusting First View Location
Execution Time
Load Efficiency
0
20
40
60
80
100
120
140
160
180
200
150
160
170
180
190
200
210
220
0 2 4 6 8
% lo
ad e
ffic
ien
cy
Exe
cuti
on
Tim
e [
ms]
First View Angle [radians]
Adjusting Initial View Angle
Execution Time
Load Efficiency
But what are we going to do about that coalescing problem?
Make sure to test all key variables while optimizing to save on embarrassment later on.
![Page 63: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/63.jpg)
Revisiting the sampling problem
“Walking” across just rows
“Walking” across rows OR columns
Choose between sampling pattern with “if |cos(𝜃)| > |sin(𝜃)|”, where 𝜃 is the ray angle
Just two samples?
That’s more like it.
![Page 64: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/64.jpg)
Transposed Matrix
Instead of walking across columns, walk across rows of a transposed image
1
![Page 65: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/65.jpg)
Improvement using transposed matrix
Was: 9 seconds/rotation Now: 2.97 seconds / rotation
32 registers – disabled debugging features, now 100% occupancy
Another way to deal with overflow problems is to break up the whole experiment into parts!
![Page 66: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/66.jpg)
Tweaking Block Size
Was: 2.97 seconds/rotation Now: 1.8 seconds / rotation
32 registers – disabled debugging features
Changed from [16, 8, 1] to [16, 1, 8]
![Page 67: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/67.jpg)
Taking Tally
433
27.78
16.2 12.5 10.3 8.995
2.97
1.8
1
10
100
1000
Naïve removedstruct from
loop
removedclamping
structpointer
block sizeoptimization
sqrt removal transposedmatrix
block sizeoptimization
Tim
e p
er
rota
tio
n [
s]
Forward Projector Performance
Lets fix the first view location for the original benchmark
![Page 68: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/68.jpg)
Taking Correct Tally
733
40.5
22.5 16.4 15.7
3.34
2.02
1
10
100
1000
Naïve removedstruct from
loop
removedclamping
struct pointer sqrt removal+ block size
opt.
TransposedMatrix
block sizeoptimization
Forward Projector Performance
Off by 2x. What next?
Note on performance linearity: 16 views -> 2.1 s/ rotation 800 views -> 2.0 s/ rotation
![Page 69: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/69.jpg)
Guided Profile Analysis
nvvp says latency is the bottleneck
![Page 70: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/70.jpg)
Guided Profile Analysis
now nvvp is telling us that occupancy is the bottleneck
![Page 71: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/71.jpg)
Guided Profile Analysis
Profiler is giving us the run around. Guess it doesn’t know how to improve performance.
![Page 72: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/72.jpg)
Unguided Profile Analysis
The inner most loop is essentially 3D interpolation. What can be done to accelerate these computations?
![Page 73: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/73.jpg)
Texture Memory
Hardware accelerated 8-bit 2D/3D-interpolation
Morton-ordering like schemes are used in texture hardware
1
2
![Page 74: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/74.jpg)
Texture Memory
Texture hardware handles both interpolation computations and boundary checking.
![Page 75: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/75.jpg)
Improvement using textures
.475 seconds / rotation, .429 seconds/rotation with another block size tweak and .64 seconds/rotation including transfer times.
VICTORY
![Page 76: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/76.jpg)
So Sweet
733
40.5
22.5 16.4 15.7
3.34
2.02
0.64
0.1
1
10
100
1000
Naïve removed structfrom loop
removedclamping
struct pointer sqrt removal +block size opt.
TransposedMatrix
block sizeoptimization
image textures
Forward Projector Performance
![Page 77: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/77.jpg)
Verify Outputs
Difference b/w original and fully optimized Sinogram output (reformatted views)
Same results as naïve within +/- 0.7%, but in .6 s instead of 733 seconds. Also ~2800x faster than “reference” CPU implementation! (I think something is wrong with it)
![Page 78: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/78.jpg)
Further GPU Optimization Reading
• Asynchronous compute and transfer
• Shared memory
• Multiple GPUs
![Page 79: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels](https://reader034.vdocument.in/reader034/viewer/2022042923/5f70ee712710224556169659/html5/thumbnails/79.jpg)