accelerating ‘fields’ byintroduction accelerated version of ‘fields’: fieldsmagma [paige et...
TRANSCRIPT
Accelerating ‘fields’ by
revamping the Cholesky
decomposition Vinay Ramakrishnaiah
Raghu Raj P Kumar
John Paige
Dr. Dorit Hammerling
1
07/29/2015
Introduction What is ‘fields’?
Widely used R package for spatial statistics.
Literally thousands of users.
Developed by Dr. Douglas Nychka, director of The Institute for Mathematics Applied to Geosciences (IMAGe).
Major methods – thin plate splines, Kriging and Compact covariances for large data sets.
‘fields’ - single thread version on CPU.
http://burandtfieldmethods.blogspot.com/
2
Introduction
Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop.
Make it user friendly.
MAGMA: Matrix algebra on GPU and Multicore Architectures
Aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures. [http://icl.cs.utk.edu/magma/]
Biggest computational problem in Kriging
Evaluation of data likelihood function given the covariance.
Requires Cholesky Decomposition
3
Motivation
Unconventional behavior
http://www.nor-tech.com/solutions/hybrid-gpu-solutions-from-nor-tech/8-gpu-
server/
http://wccftech.com/nvidia-reportedly-preparing-geforce-titan-gpu-gk110/
http://www.aarp.org/money/investing/info-2014/5-bad-money-
moves.html#slide1
http://www.kaizen-news.com/focus-performance/
4
Performance
Motivation
Opposite impacts
http://www.bogotobogo.com/cplusplus/constructor.php
5
Motivation
There is always more room!
http://funny-pictures.funmunch.com/funny-picture-545.html
6
Approach
7
Identify performance
bottlenecks/cause for
unconventional behavior
Analyze the existing
code
See how to further
accelerate the code
Performance analysis - Profiling
NVIDIA visual profiler
Why profile GPU?
Why NVVP?
8
Performance analysis – Profiling
dpotrf_m 9 Non concurrent
memory transfers
Computations
CPU involved in
device to device data
exchange
Time----
Time----
Performance analysis – Profiling
dpotrf_mgpu 10
Asynchronous
simultaneous data
transfer
Computations
Time----
Performance analysis - Timing
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
0 5000 10000 15000 20000 25000 30000 35000
Tim
e in
seco
nd
s
Square matrix dimension N
Comparison of dpotrf_m and dpotrf_mgpu called from the R wrapper
dpotrf_mgpu (s)
dpotrf_m (s)
11
Program Flow 12 R Program
magmaChol(…)
Rwrapper.r
dyn.load(…)
.Call(…)
fieldsMagma.r
C function
MAGMA wrapper
CUDA calls
Compute
Overheads in R 13
0
2
4
6
8
10
12
14
16
18
20
magmaChol .Call dpotrf_m magmaChol .Call dpotrf_mgpu
dpotrf_m dpotrf_mgpu
Tim
e (
s)
Sample square matrix size N=32768
Overheads in R
overhead
overhead
OPTIMIZATION APPROACHES
Reduce function call overheads in R.
Optimize R environment for
‘fieldsMAGMA’.
Accelerate the underlying C function.
14
Reducing R overheads
Intel LD_PRELOAD
Single dynamic load
Move the dynamic loads to the highest
level of R call
15 R Program
magmaChol(…)
Rwrapper.r
dyn.load(…)
.Call(…)
fieldsMagma.r
C function
MAGMA wrapper
CUDA calls
Compute
Reducing R overheads 16
0
5
10
15
20
25
0 5000 10000 15000 20000 25000 30000 35000
Tim
e (
s)
Matrix dimension N
Reduced overheads
Before
After
Performance of double precision Cholesky
Decomposition on Intel Compilers
0
5
10
15
20
25
30
35
40
0 5000 10000 15000 20000 25000 30000 35000 40000 45000
Tim
e (
s)
Matrix Dimension N
Comparison of accelerated code (double precision) - Intel 2012 vs Intel 2015 Compiler
Intel 12.1.5
Intel 15.0.3
17
18 R Program
magmaChol(…)
Rwrapper.r
dyn.load(…)
.Call(…)
fieldsMagma.r
C function
MAGMA wrapper
CUDA calls
Compute
Pinned memory
http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/
19
For single precision calculations we obtained an improvement up to
30% for large sized matrices.
Mysteries of Deep
Copy
Deep copy performed in R acts as a overhead.
Used ‘perf’ to unravel the mystery!
20
Deep Copy In-place
Page Faults High Low
LLC Misses High Low
dTLB Misses Constant Constant
Abbreviations:
SP – Single Precision
DP – Double Precision
DPCPY – Deep copy
IP – In Place
Results 21
0
10
20
30
40
50
60
70
80
90
0 5000 10000 15000 20000 25000 30000 35000
Sp
ee
d u
p
Square matrix dimension N
Speed up of accelerated single precision versions with respect to CPU implementation
SP_DCPY_nGPU1
SP_DCPY_nGPU2
SP_IP_nGPU1
SP_IP_nGPU2
Results 22
0
10
20
30
40
50
60
70
80
90
0 5000 10000 15000 20000 25000 30000 35000
Sp
ee
d u
p
Square matrix dimension N
Speed up of accelerated double precision versions with respect to CPU implementation
DP_DPCPY_nGPU1
DP_DPCPY_nGPU2
DP_IP_nGPU1
DP_IP_nGPU2
Conclusion
Achieved an improvement in performance of about 40% - 60% over the previous version of fieldsMAGMA for large matrices.
Ability to allocate pinned memory makes significant difference.
Reduce R overheads.
Fine-tuned the R environment for fieldsMAGMA.
Detailed technical report.
23
Future work 24
Courtesy of NVIDIA Corporation
References Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J.,
Langou, J., et al. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA projects. Journel of Physics: Conference series , 180 (012037).
http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/. (n.d.).
https://software.intel.com/en-us/articles/optimizing-without-breaking-a-sweat. (n.d.).
Katzfuss, M., & Cressie, N. (2012). Bayesian hierarchical spatio-temporal smoothing for very large datasets (Vol. 23). Environmetrics.
Paige, J., Lyngaas, I., Ramakrishnaiah, V., Hammerling , D., Kumar, R., & Nychka, D. (2015). Incorporating MAGMA into the 'fields' spatial statistics package. Boulder: UCAR/NCAR.
Tomov, S., Dongarra, J., Volkov, V., & Demmel, J. MAGMA library. University of Tenessee and University of California, Knoxville, TN, and Berkeley, CA.
25
Thank you!
Questions?
26
Introduction
27
Initial timing results
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0
100
200
300
400
500
600
700
800
900
1000
0 2000 4000 6000 8000 10000 12000
Tim
e (
s)
GF
lop
s
Square matrix dimension N
Comparison of testing_dpotrf & testing_dpotrf_mgpu
testing_dpotrf_mgpu -Gflops
testing_dpotrf - Gflop/s
testing_dpotrf_mgpu -time (s)
testing_dpotrf - time (s)
28
Single precision calculations
0
2
4
6
8
10
12
14
16
18
20
0 5000 10000 15000 20000 25000 30000 35000
Tim
e (
s)
Square matrix dimension N
Using pinned memory for single precision
non-pinned
pinned
29
Results
0
5
10
15
20
25
30
0 5000 10000 15000 20000 25000 30000 35000
Tim
e (
s)
Square matrix dimension N
Comparison of execution times of different implementations
DP_DPCPY_nGPU1
DP_DPCPY_nGPU2
DP_IP_nGPU1
DP_IP_nGPU2
SP_DCPY_nGPU1
SP_DCPY_nGPU2
SP_IP_nGPU1
SP_IP_nGPU2
30
Results
0
1
2
3
4
5
6
2481018
26224981
756310311
1349117369
2320126633
Sp
eed
up
Square matrix dimension N
Performance improvement of the accelerated Kriging over CPU version
mKrig_1_DP mKrig_1_SP
mKrig_2_DP mKrig_2_SP
31
Results
0
0.5
1
1.5
2
2.5
3
3.5
4
2481018
26224981
756310311
1349117369
Sp
ee
d u
p
Square matrix dimension N
Performance improvement of the accelerated Kriging workflow over the CPU version
mKrigWF_1_DP
mKrigWF_2_SP
mKrigWF_2_DP
mKrigWF_1_SP
32