isaac lyngaas ([email protected]) john paige ([email protected]) advised by: srinath vadlamani...
TRANSCRIPT
![Page 1: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/1.jpg)
Isaac Lyngaas ([email protected])
John Paige ([email protected])Advised by: Srinath Vadlamani ([email protected])
& Doug Nychka ([email protected])
SIParCS, July 31, 2014
![Page 2: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/2.jpg)
Why use HPC with R?
Accelerating mKrig & Krig
Parallel Cholesky◦ Software Packages
Parallel Eigen Decomposition
Conclusions & Future Works
![Page 3: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/3.jpg)
Accelerate the ‘Fields’ Krig and mKrig functions
Survey of parallel linear algebra software
◦ Multicore (Shared Memory)◦ GPU◦ Xeon Phi
![Page 4: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/4.jpg)
Many developers & users in the field of Statistics◦ Readily available code base
Problem: R is slow for large size problems
![Page 5: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/5.jpg)
Bottleneck in Linear Algebra operations◦ mKrig – Cholesky Decomposition◦ Krig – Eigen Decomposition
R uses sequential algorithms
Strategy: Use C interoperable libraries to parallelize linear algebra◦ C functions callable through R environment
![Page 6: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/6.jpg)
Symmetric positive definite -> Triangular◦ A = LL^T◦ Nice properties for determinant
calculation
![Page 7: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/7.jpg)
PLASMA (Multicore Shared Memory)
◦ http://icl.cs.utk.edu/plasma/
MAGMA (GPU & Xeon Phi)◦ http://icl.cs.utk.edu/magma/
CULA (GPU)◦ http://www.culatools.com/
![Page 8: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/8.jpg)
Multicore (Shared Memory)
Block Scheduling◦ Determines what operations should be done
on which core
Block Size optimization◦ Dependent on Cache Memory
![Page 9: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/9.jpg)
05
Spe
ed
up
vs.
1 C
ore
1015
Plasma using 1 Node(# of Observations = 25000)
8
# of Cores
1 2 4 12 16
Speedup
Optimal Speedup
![Page 10: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/10.jpg)
67
500 40 Mb 1000
Block Size
1500
34
Tim
e(s
ec)
5
PLASMA on Dual Socket Sandy Bridge (# of Observations=15000, Core=16)
256 Kb
![Page 11: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/11.jpg)
0 10000 20000
# of Observations
30000 40000
100
200
300
400
500
600
PLASMA Optimal Block Sizes (Cores=16)
Opt
imal
Blo
ck s
ize
![Page 12: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/12.jpg)
Utilizes GPUs or Xeon Phi for parallelization◦ Multiple GPU & Multiple Xeon Phi
implementations available◦ 1 CPU drives one 1GPU
Block Scheduling◦ Similar to PLASMA
Block Size dependent on Accelerator Architecture
![Page 13: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/13.jpg)
CUDA Proprietary linear algebra package
Capable of doing Lapack operations using 1 GPU
API written in C
Dense & Spare operations available
![Page 14: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/14.jpg)
1 Node of Caldera or Pronghorn◦ 2 x 8 core Intel Xeon E5-2670 (Sandy
Bridge) processors per Node 64 GB RAM (~59 GB available) Cache Per Core: L1=32Kb, L2 =256Kb Cache Per Socket: L3=20Mb
◦ 2 x Nvidia Tesla M270Q GPU (Caldera) ~5.2 GB RAM per device 1 core drives 1 GPU
◦ 2 x Xeon Phi 5110P (Pronghorn) ~7.4 GB RAM per device
![Page 15: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/15.jpg)
• Serial R: ~3 GFLOP/sec
• Theoretical Peak Performance• 16 core Xeon SandyBridge: ~333
GFLOP/sec• 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec• 1 Xeon Phi 5110P: ~1,011 GFLOP/sec
0 10000 20000
# of Observations
30000 40000
010
0G
FLO
P/s
ec
200
300
400
Accelerated Hardware has Room for Improvement
Plasma (16 cores)
Magma 1 GPU
Magma 2 GPUs
Magma 1 MIC
Magma 2 MICs
CULA
![Page 16: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/16.jpg)
All Parallel Cholesky Implementations are Faster than Serial R
20000
# of Observations
Tim
e(s
ec)
0 10000 30000 40000
0.01
0.1
110
100
1000
Serial R
Plasma (16 Cores)
CULA
Magma 1 GPU
Magma 2 GPUs
Magma 1 Xeon Phi
Magma 2 Xeon Phis
• >100 Times Speedup over serial R when # of Observations = 10k
![Page 17: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/17.jpg)
• ~6 Times Speedup over serial R when # of Observations = 10k
0 2000 4000 6000 8000 10000
050
100
Tim
e(s
ec)
150
200
250
300
Eigendecomposition also Faster on Accelerated Hardware
# of Observations
Serial R
CULA
Magma 1 GPU
Magma 2 GPUs
![Page 18: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/18.jpg)
• Both times taken using MAGMA w/ 2 GPUs
0 2000 4000 6000 8000 10000
05
1015
2025
30Can Run ~30 Cholesky Decompositions per Eigen Decomposition
# of Observations
Tim
e E
ige
nd
eco
mp
ositi
on
/ T
ime
Cho
lesk
y
![Page 19: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/19.jpg)
• If we want to do 16 Cholesky decompositions in parallel, we are guaranteed better performance when speedup >16
0 5000
05
1015
2025
10000
# of Observations
15000 20000
Parallel Cholesky Beats Parallel R for Moderate to Large MatricesS
pee
dup
vs. P
aral
lel
R
Plasma
Magma 2 GPUs
![Page 20: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/20.jpg)
Using Caldera◦ Single Cholesky Decomposition
◦ Matrix Size < 20k use PLASMA (16 cores w/ optimal block size)
◦ Matrix Size 20k – 35k use MAGMA w/ 2 GPUs◦ Matrix Size > 35k use PLASMA (16 cores w/
optimal block size)
Dependent on computing resources available
![Page 21: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/21.jpg)
Explored Implementation on accelerated hardware◦ GPUs◦ Multicore (Shared Memory)◦ Xeon Phis
Installed third party linear algebra packages & programmed wrappers that call these packages from R◦ Installation instructions and programs available
through bitbucket repo for access contact Srinath Vadlamani
Future Work◦ Multicore Distributed Memory◦ Single Precision
![Page 22: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/22.jpg)
Douglas Nychka, Reinhard Furrer, and Stephan Sain. fields: Tools for spatial data, 2014b. URL: http://CRAN.R-project.org/package=fields. R package version 7.1.
Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, volume 180, page 012037. IOP Publishing, 2009.
Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Proc. Of VECPAR’10, Berkeley, CA, June22-25, 2010.
Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. PPAM 2013, Warsaw, Poland, September, 2013.
![Page 23: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,](https://reader037.vdocument.in/reader037/viewer/2022110205/56649c785503460f9492dcf7/html5/thumbnails/23.jpg)
xPOTRF
xPOTRF
xPOTRF
xPOTRF
xPOTRF
xTRSM
xTRSM xTRSM xTRSM
xTRSM xTRSM xTRSM
xTRSM xTRSM
xTRSM
xSYRK
xSYRK xSYRK xSYRK
xSYRK xSYRK xSYRK
xSYRK xSYRK
xSYRK
xGEMMxGEMM xGEMM xGEMM xGEMM xGEMM
xGEMM xGEMM xGEMM
xGEMM
xPOTRF xTRSM xSYRK xGEMM
0
1
2
3
0
1
2
3
0
1
2
3
0
1 2
FINAL
• http://www.netlib.org/lapack/lawnspdf/lawn223.pdf