massively parallel gpu-friendly algorithms for · pdf filemassively parallel gpu-friendly...
TRANSCRIPT
Massively Parallel GPU-friendly
Algorithms for PET
Szirmay-Kalos László, http://cg.iit.bme.hu,
Budapest, University of Technology and Economics
(GP)GPU: CUDA (OpenCL)
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Scalar
Processor
1
Shared memory
Scalar
Processor
M
Cache
Device memory
Instruction
Execution
Unit
• Massively parallel:
#threads > 104
• Independence:
synchronization and
write collisions should
be avoided
• SIMD: conditional
statements are not
welcome
• Coalesced memory
access
Maximum-likelihood reconstruction
x1 x2
x3 x4
Expected number of hits:
Maximize the probability of
the measurement data y:
A11
A13
A12
A14
xAy ~
414313
2121111~
xAxA
xAxAy
)}(~|Pr{maxarg xyyx
Computational challenges
• Numbers of LORs and voxels: hundred millions!
• System matrix A: 1016 elements (PetaBytes)
– Probability that a positron of a voxel is detected by a LOR
– Patient dependent
– Not sparse if accurate simulation is needed
– Do not store, estimate on-the-fly
• Matrix elements are high dimensional integrals
– Monte Carlo quadrature
– Reuse of computation
– High performance (parallel) computational platform
• Minimize the effect of estimation error
Direct physical simulationInput driven, scattering type algorithm• Thread = photon
• Photons scatter different number of times
• The same detector is hit: write collision
• Random memory access
• Cannot mimic the detectors
source detectors
GPU friendly approachOutput driven, gathering type
• Thread = importon
• SIMD: grouping importons
• No write collision: LOR-driven
• Cannot mimic the source
source detectors
Direct gamma photon contribution
Detector
module 1
Detector
module 2
voxel
LOR
vvx dd )(
5D Integration
• Accuracy for given
sample number
• Cost of a sample
l
Output- or LOR-driven sampling
Detector
module 1
Detector
module 2
LOR
lwuGD
Nlg
),(),(
2
ray
1
Pros:
• Gathering
• Thread coherence
• Texture coherence
• Uniform on detectors
• Low-cost samples
due to reuse
Cons:
• Cannot mimick activityl
l
u
w
Input or voxel-driven sampling
Detector
module 1
Detector
module 2
LOR
Pros:
• Can mimick activity
Cons:
• Write collisions
• Less coherence
vxD
ullxNlg
d)cos(
)(),(
2
voxel
2
l
u
Multiple Importance Sampling
voxel voxel
+
lwuGD
Nlg
),(),(
2
ray
1
),(),(),( 21
lglglg
vxD
ullxNlg
d)cos(
)(),(
2
voxel
2
Input-driven scattered photon
transport
• Monte Carlo simulation:
– Free path
– Absorption?
– Scattering direction
source absorption
voxelfetches
scattering
Ray marching
• Complexity grows with the resolution
• Slow in high resolution low density media
)exp(1 si
i
u
Mix virtual particles to obtain a density
that can be solved analytically
photon
Virtual
collision
Real
collision
Virtual particleReal material
particle
40963 effective resolution
64 billion sample points
Output-driven single-scattered
photon transport with reuse
1z
1s
2s
1. Scattering points 2. Ray marching between
scattering points and detectors
1s
2s
1z
2z
3. Combination
1s
2s
Multiple Importance Sampling
LOR-driven
unscattered
contribution
Voxel-driven
unscattered
contribution
Photon
transport
LOR-driven
single scatter
0 scattering 0 scattering
0, 1, 2, …
scattering 1 scattering
MIS
Detector response
photon
scattering
absorption
Electronics hits
Problem:
The domain is 4 dimensional.
Measured
values
voxels
Correction
Physics
Simulation
)()( xAy nn ~
Expected
values
Back projection
Monte Carlo simulation
or simplification
Random or
deterministic
estimation
1A
y
yA
x
xT
)(
T
)(
)(
n
n
n ~1
y~y
Ratio of
measured
and
computed
values
Backprojection with unbiased
forward projection
Contribution to xV
Samples
outlier1A
y
yA
x
xT
)(
T
)(
)(
n
n
n ~1
Ly~
Ly
Ly
LL yy ˆ/LL yy ~/
Ratio of
measured
and
computed
values
Reduce bias and outliers
• Averaging iteration:
• Metropolis iteration: Ignore outliers randomly
Acceptance with probability
nn / Ln
n
Ln
n
L yyy ˆ~)1(~ )1()(
}1,~/ˆmin{ )(n
LLL yya