![Page 1: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/1.jpg)
![Page 2: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/2.jpg)
[HKR HotChips-2007]
![Page 3: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/3.jpg)
6-12 weeks
![Page 4: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/4.jpg)
FFT
Cartesian Scan Data
(a)
Spiral scan data + Iterative recon:
Fast scan reduces artifacts, iterative reconstruction increases SNR.
Reconstruction requires a lot of computation.
Spiral Scan Data
Iterative
Reconstruction
(c)
Gridding
(b)
(b)
![Page 5: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/5.jpg)
Courtesy of Keith Thulborn and Ian Atkinson, Center for MR Research, University of Illinois at Chicago
![Page 6: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/6.jpg)
Compute Q
Acquire Data
Compute FHd
Find ρ
More than
99.5% of time
Haldar, et al, “Anatomically-constrained reconstruction from noisy data,” MR in Medicine.
Reconstruction of a 643 image used to
take days!
![Page 7: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/7.jpg)
Performance: 128 GFLOPS
Time: 1.2 minutes
![Page 8: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/8.jpg)
for (m = 0; m < M; m++) {
phi[m] = rPhi[m]*rPhi[m]
+ iPhi[m]*iPhi[m]
for (n = 0; n < N; n++) {
exp = 2*PI*(kx[m]*x[n] +
ky[m]*y[n] +
kz[m]*z[n])
rQ[n] += phi[m]*cos(exp)
iQ[n] += phi[m]*sin(exp)
}
}
6
![Page 9: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/9.jpg)
7
for (m = 0; m < M; m++) {
phi[m] = rPhi[m]*rPhi[m] +
iPhi[m]*iPhi[m]
for (n = 0; n < N; n++) {
exp = 2*PI*(kx[m]*x[n] +
ky[m]*y[n] +
kz[m]*z[n])
rQ[n] += phi[m]*cos(exp)
iQ[n] += phi[m]*sin(exp)
}
}
![Page 10: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/10.jpg)
8
for (m = 0; m < M; m++) {
phi[m] = rPhi[m]*rPhi[m] +
iPhi[m]*iPhi[m]
for (n = 0; n < N; n++) {
exp = 2*PI*(kx[m]*x[n] +
ky[m]*y[n] +
kz[m]*z[n])
rQ[n] += phi[m]*cos(exp)
iQ[n] += phi[m]*sin(exp)
}
}
for (n = 0; n < N; n++) {
for (m = 0; m < M; m++) {
phi[m] = rPhi[m]*rPhi[m]
+ iPhi[m]*iPhi[m]
exp = 2*PI*(kx[m]*x[n] +
ky[m]*y[n] +
kz[m]*z[n])
rQ[n] += phi[m]*cos(exp)
iQ[n] += phi[m]*sin(exp)
}
}
![Page 11: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/11.jpg)
9
for (n = 0; n < N; n++) {
for (m = 0; m < M; m++) {
phi[m] = rPhi[m]*rPhi[m]
+ iPhi[m]*iPhi[m]
exp = 2*PI*(kx[m]*x[n] +
ky[m]*y[n] +
kz[m]*z[n])
rQ[n] += phi[m]*cos(exp)
iQ[n] += phi[m]*sin(exp)
}
}
for (m = 0; m < M; m++) {
phi[m] = rPhi[m]*rPhi[m]
+ iPhi[m]*iPhi[m]
}
for (n = 0; n < N; n++) {
for (m = 0; m < M; m++) {
exp = 2*PI*(kx[m]*x[n] +
ky[m]*y[n] +
kz[m]*z[n])
rQ[n] += phi[m]*cos(exp)
iQ[n] += phi[m]*sin(exp)
}
}
![Page 12: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/12.jpg)
10
for (m = 0; m < M; m++) {
phi[m] = rPhi[m]*rPhi[m]
+ iPhi[m]*iPhi[m]
}
for (n = 0; n < N; n++) {
for (m = 0; m < M; m++) {
exp = 2*PI*(kx[m]*x[n] +
ky[m]*y[n] +
kz[m]*z[n])
rQ[n] += phi[m]*cos(exp)
iQ[n] += phi[m]*sin(exp)
}
}
}
}
![Page 13: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/13.jpg)
for (m = 0; m < M/32; m++) {
exp = 2*PI*(kx[m]*x[n] +
ky[m]*y[n] +
kz[m]*z[n])
rQ[n] += phi[m]*cos(exp)
iQ[n] += phi[m]*sin(exp)
}
12
![Page 14: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/14.jpg)
12
for (m = 31M/32; m < 32M/32; m++)
{
exp = 2*PI*(kx[m]*x[n] +
ky[m]*y[n] +
kz[m]*z[n])
rQ[n] += phi[m]*cos(exp)
iQ[n] += phi[m]*sin(exp)
}
![Page 15: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/15.jpg)
13
Q(float* x,y,z,rQ,iQ,kx,ky,kz,phi,
int startM,endM)
{
n = blockIdx.x*TPB + threadIdx.x
for (m = startM; m < endM; m++) {
exp = 2*PI*(kx[m]*x[n]
+ ky[m]*y[n]
+ kz[m]*z[n])
rQ[n] += phi[m] * cos(exp)
iQ[n] += phi[m] * sin(exp)
}
}
![Page 16: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/16.jpg)
14
![Page 17: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/17.jpg)
15
![Page 18: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/18.jpg)
16
![Page 19: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/19.jpg)
17
![Page 20: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/20.jpg)
18
![Page 21: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/21.jpg)
19
![Page 22: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/22.jpg)
20
A.N. Netravali and B.G. Haskell, Digital Pictures: Representation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995).
![Page 23: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/23.jpg)
21
![Page 24: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/24.jpg)
22
![Page 25: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/25.jpg)
Increase in per-thread performance, but fewer threads:
Lower overall performance 23
![Page 26: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/26.jpg)
24
![Page 27: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/27.jpg)
25
![Page 28: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/28.jpg)
26
![Page 29: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/29.jpg)
27
![Page 30: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/30.jpg)
28
![Page 31: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/31.jpg)
29
8X
![Page 32: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/32.jpg)
30 108X 228X 357X
![Page 33: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/33.jpg)
![Page 34: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/34.jpg)
![Page 35: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/35.jpg)
• Programmers are
doing too much
heavy lifting
• Too many memory
organizational
details are
exposed to the programmers
Sum of Absolute
Differences
S. Ryoo, et al, “Program Optimization Space Pruning for a Multithreaded GPU, ACM
/IEEE CGO, April 2008.
![Page 36: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/36.jpg)
Sum of Absolute
Differences
By selecting only
Pareto-optimal points,
we pruned the search space by 98% and still
found the optimal configuration
S. Ryoo, et al, “Program Optimization Space Pruning for a Multithreaded GPU, ACM
/IEEE CGO, April 2008.
![Page 37: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/37.jpg)
IA multi-core
& Larrabe NVIDIA GPU
NVIDIA
SDK 1.1
MCUDA/
OpenMP
CUDA-lite
CUDA-tune
CUDA-auto
1st generation CUDA programming
with explicit, hardwired thread
organizations and explicit
management of memory types and
data transfers
Parameterized CUDA programming using
auto-tuning and optimization space
pruning
Locality annotation programming to
eliminate need for explicit management of
memory types and data transfers
Implicitly parallel programming with data
structure and function property
annotations to enable auto parallelization
![Page 38: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/38.jpg)
![Page 39: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/39.jpg)
S.S. Stone, et al, “Accelerating Advanced MRI Reconstruction using
GPUs,” ACM Computing Frontier Conference 2008, Italy, May 2008.
10 kernels, less
than 1.5 min
after acceleration
![Page 40: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/40.jpg)
![Page 41: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/41.jpg)
![Page 42: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/42.jpg)
![Page 43: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/43.jpg)
Matrixmul(A[ ], B[ ], C[ ])
{
__shared__ Asub[ ][ ], Bsub[ ][ ];
int a,b,c;
float Csub;
int k;
…
for(…)
{
Asub[tx][ty] = A[a];
Bsub[tx][ty] = B[b];
__syncthreads();
for( k = 0; k < blockDim.x; k++ )
Csub += Asub[ty][k] + Bsub[k][tx];
__syncthreads();
}
…
}
![Page 44: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/44.jpg)
Matrixmul(A[ ], B[ ], C[ ])
{
__shared__ Asub[ ][ ], Bsub[ ][ ];
int a,b,c;
float Csub;
int k;
…
for(…)
{
for(ty=0; ty < blockDim.y; ty++)
for(tx=0; tx < blockDim.x; tx++)
{
Asub[tx][ty] = A[a];
Bsub[tx][ty] = B[b];
}
for(ty=0; ty < blockDim.y; ty++)
for(tx=0; tx < blockDim.x; tx++)
{
for( k = 0; k < blockDim.x; k++ )
Csub += Asub[ty][k] + Bsub[k][tx];
}
}
…
}
![Page 45: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/45.jpg)
• Consistent speed-up over hand-tuned single-thread code
• Best optimizations for GPU and CPU not always the same
*Over hand-optimized CPU
**Intel MKL, multi-core execution
![Page 46: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/46.jpg)
![Page 47: HC20.24.250.CUDA Application Development Experience...Compute Q Acquire Data Compute FHd Find ρ More than 99.5% of time Haldar, et al, “Anatomically-constrained reconstruction from](https://reader035.vdocument.in/reader035/viewer/2022070807/5f053c3f7e708231d411f435/html5/thumbnails/47.jpg)
Thank you! Any questions?