modern gpus (graphics processing units)
TRANSCRIPT
![Page 1: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/1.jpg)
1
Modern GPUs (Graphics Processing Units)
Powerful data‐parallel computation platform. High computation density, high memory bandwidth. Relatively low‐cost.
NVIDIA GTX 580512 cores1.6 Tera FLOPs1.5 GB memory200GB/s bandwidth$499
![Page 2: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/2.jpg)
2
GPU for Scientific Computing
[NVIDIA]
![Page 3: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/3.jpg)
Department of Computer Science
GPU‐based QUIC‐SVD Algorithm
![Page 4: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/4.jpg)
4
Introduction
Singular Value Decomposition (SVD) of large matrices. Low‐Rank Approximation. QUIC‐SVD: Fast SVD using cosine trees, by Michael Holmes, Alexander Gray and Charles Isbell, NIPS 2008.
![Page 5: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/5.jpg)
5
QUIC‐SVD: Fast SVD using Cosine Trees [Holmes et al. 2008]
![Page 6: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/6.jpg)
6
QUIC‐SVD: Fast SVD using Cosine Trees [Holmes et al. 2008]
Start from a root node that owns the entire matrix.
![Page 7: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/7.jpg)
7
QUIC‐SVD: Fast SVD using Cosine Trees
Select pivot row based on length square distribution;Compute inner product of every row with the pivot row;
[Holmes et al. 2008]
![Page 8: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/8.jpg)
8
QUIC‐SVD: Fast SVD using Cosine Trees
The inner product results determine how the rows arepartitioned to 2 subsets, each represented by a child node.
[Holmes et al. 2008]
![Page 9: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/9.jpg)
9
QUIC‐SVD: Fast SVD using Cosine Trees
Compute the row mean (centroid) of each subset;add it to the current basis set (keep orthogonalized).
[Holmes et al. 2008]
![Page 10: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/10.jpg)
10
QUIC‐SVD: Fast SVD using Cosine Trees [Holmes et al. 2008]
Repeat, until the estimated error is below a threshold. The final basis set then provides a good low‐rank approximation.
![Page 11: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/11.jpg)
11
GPU‐based Implementation of QUIC‐SVD
Most computation time is spent on splitting nodes. Computationally expensive steps:
• Computing vector inner products• Computing row means (centroids)• Basis orthogonalization
Key:find enough computation to keep the GPU busy!
![Page 12: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/12.jpg)
12
Parallelize Vector Inner Products and Row Means
Compute all inner products and row means in parallel. There are enough row vectors to keep the GPU busy.
![Page 13: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/13.jpg)
13
Parallelize Gram Schmidt Orthogonalization
Classical Gram Schmidt:Assume vectors are already orthgonalized, to orthogonalize a new vector with respect to them:
This is easy to parallelize (as each projection is independently calculated), but it has poor numerical stability due to rounding errors.
1 2 3, , ... nu u u uv
1 21 1 2
1 1 2 2
, , ,...
, , ,n
k nn n
v u v u v uu v u u u
u u u u u u
![Page 14: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/14.jpg)
14
Parallelize Gram Schmidt Orthogonalization
Modified Gram Schmidt:
This has good numerical stability, but is hard to parallelize, as the computation is sequential!
(1) (2)2 31(1) (2) (1) (3) (2)
1 2 31 1 2 2 3 3
( 1)( 1)
1
, ,,, , , ....
, , ,
,...
,
kkk
k kk k
v u v uv uv v u v v u v v u
u u u u u u
v uu v u
u u
![Page 15: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/15.jpg)
15
Parallelize Gram Schmidt Orthogonalization
Our approach: Blocked Gram Schmidt
This hybrid approach proves to be both numerically stable, and GPU‐friendly.
1 2(1)1 2
1 1 2 2
(1) (1) (1)1 2 2(2) (1)
1 2 21 1 2 2 2 2
(2) (2) (2)2 1 2 2(2)
1 2 1 22 1 2 1 2 2 2 2
, , ,...
, , ,
, , ,...
, , ,
, , ,...
, ,
mm
m m
m m mm m m
m m m m m m
m m kk m m
m m m m k
v u v u v uv v u u u
u u u u u u
v u v u v uv v u u u
u u u u u u
v u v u v uu v u u
u u u u u
, k
k
uu
![Page 16: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/16.jpg)
16
Partitioned QUIC‐SVD
As the advantage of exploiting the GPU is only obvious for large‐scale problems, we need to test our algorithm on large matrices (>10,000x10,000).
For dense matrices, this will soon become an out‐of‐core problem.
We modified our algorithm to use a matrix partitioning scheme, so that each partition can fit in GPU memory.
![Page 17: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/17.jpg)
17
Partitioned QUIC‐SVD
![Page 18: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/18.jpg)
18
Performance
We generate test data by multiplying two randomly generated left and right matrices, resulting in a low‐rank matrix.
We set the termination error in QUIC‐SVD to 10‐6. This forces the algorithm to produce accurate results.
We compare 1) Matlab’s svds; 2) CPU‐implementation of QUIC‐SVD; 3) GPU‐implementation of QUIC‐SVD.• CPU version considerably optimized using Intel MKL and tested on
an Intel Core i7• GPU version is tested on an NVIDIA 480 GTX
![Page 19: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/19.jpg)
19
Performance
![Page 20: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/20.jpg)
20
Performance
![Page 21: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/21.jpg)
21
Performance
![Page 22: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/22.jpg)
22
Performance
![Page 23: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/23.jpg)
23
Integration with Manifold Alignment Framework
![Page 24: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/24.jpg)
24
Conclusion and Future Work
Our initial effort has shown that a GPU‐based implementation of QUIC‐SVD can achieve reasonable speedup.
With additional optimizations, 10X speedup foreseeable.
Handle sparse matrices. Components from this project can become fundamental
building blocks for other algorithms: random projection trees, diffusion wavelets etc.
Design new algorithms that are suitable for data‐parallel computation.
![Page 25: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/25.jpg)
25
![Page 26: Modern GPUs (Graphics Processing Units)](https://reader030.vdocument.in/reader030/viewer/2022012811/61c1d5a01d38393cf1546c54/html5/thumbnails/26.jpg)
26
Programming on the GPU
General‐Purpose GPU Programming Language • CUDA, OpenCL, DirectCompute, BrookGPU …
[NVIDIA]