accelerating ‘fields’ byintroduction accelerated version of ‘fields’: fieldsmagma [paige et...

Accelerating ‘fields’ by

revamping the Cholesky

decomposition Vinay Ramakrishnaiah

Raghu Raj P Kumar

John Paige

Dr. Dorit Hammerling

1

07/29/2015

Introduction What is ‘fields’?

Widely used R package for spatial statistics.

Literally thousands of users.

Developed by Dr. Douglas Nychka, director of The Institute for Mathematics Applied to Geosciences (IMAGe).

Major methods – thin plate splines, Kriging and Compact covariances for large data sets.

‘fields’ - single thread version on CPU.

http://burandtfieldmethods.blogspot.com/

2

Introduction

Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop.

Make it user friendly.

MAGMA: Matrix algebra on GPU and Multicore Architectures

Aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures. [http://icl.cs.utk.edu/magma/]

Biggest computational problem in Kriging

Evaluation of data likelihood function given the covariance.

Requires Cholesky Decomposition

3

Motivation

Unconventional behavior

http://www.nor-tech.com/solutions/hybrid-gpu-solutions-from-nor-tech/8-gpu-

server/

http://wccftech.com/nvidia-reportedly-preparing-geforce-titan-gpu-gk110/

http://www.aarp.org/money/investing/info-2014/5-bad-money-

moves.html#slide1

http://www.kaizen-news.com/focus-performance/

4

Performance

Motivation

Opposite impacts

http://www.bogotobogo.com/cplusplus/constructor.php

5

Motivation

There is always more room!

http://funny-pictures.funmunch.com/funny-picture-545.html

6

Approach

7

Identify performance

bottlenecks/cause for

unconventional behavior

Analyze the existing

code

See how to further

accelerate the code

Performance analysis - Profiling

NVIDIA visual profiler

Why profile GPU?

Why NVVP?

8

Performance analysis – Profiling

dpotrf_m 9 Non concurrent

memory transfers

Computations

CPU involved in

device to device data

exchange

Time----

Time----

Performance analysis – Profiling

dpotrf_mgpu 10

Asynchronous

simultaneous data

transfer

Computations

Time----

Performance analysis - Timing

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

0 5000 10000 15000 20000 25000 30000 35000

Tim

e in

seco

nd

s

Square matrix dimension N

Comparison of dpotrf_m and dpotrf_mgpu called from the R wrapper

dpotrf_mgpu (s)

dpotrf_m (s)

11

Program Flow 12 R Program

magmaChol(…)

Rwrapper.r

dyn.load(…)

.Call(…)

fieldsMagma.r

C function

MAGMA wrapper

CUDA calls

Compute

Overheads in R 13

0

2

4

6

8

10

12

14

16

18

20

magmaChol .Call dpotrf_m magmaChol .Call dpotrf_mgpu

dpotrf_m dpotrf_mgpu

Tim

e (

s)

Sample square matrix size N=32768

Overheads in R

overhead

overhead

OPTIMIZATION APPROACHES

Reduce function call overheads in R.

Optimize R environment for

‘fieldsMAGMA’.

Accelerate the underlying C function.

14

Reducing R overheads

Intel LD_PRELOAD

Single dynamic load

Move the dynamic loads to the highest

level of R call

15 R Program

magmaChol(…)

Rwrapper.r

dyn.load(…)

.Call(…)

fieldsMagma.r

C function

MAGMA wrapper

CUDA calls

Compute

Reducing R overheads 16

0

5

10

15

20

25

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

s)

Matrix dimension N

Reduced overheads

Before

After

Performance of double precision Cholesky

Decomposition on Intel Compilers

0

5

10

15

20

25

30

35

40

0 5000 10000 15000 20000 25000 30000 35000 40000 45000

Tim

e (

s)

Matrix Dimension N

Comparison of accelerated code (double precision) - Intel 2012 vs Intel 2015 Compiler

Intel 12.1.5

Intel 15.0.3

17

18 R Program

magmaChol(…)

Rwrapper.r

dyn.load(…)

.Call(…)

fieldsMagma.r

C function

MAGMA wrapper

CUDA calls

Compute

Pinned memory

http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/

19

For single precision calculations we obtained an improvement up to

30% for large sized matrices.

Mysteries of Deep

Copy

Deep copy performed in R acts as a overhead.

Used ‘perf’ to unravel the mystery!

20

Deep Copy In-place

Page Faults High Low

LLC Misses High Low

dTLB Misses Constant Constant

Abbreviations:

SP – Single Precision

DP – Double Precision

DPCPY – Deep copy

IP – In Place

Results 21

0

10

20

30

40

50

60

70

80

90

0 5000 10000 15000 20000 25000 30000 35000

Sp

ee

d u

p


Speed up of accelerated single precision versions with respect to CPU implementation

SP_DCPY_nGPU1

SP_DCPY_nGPU2

SP_IP_nGPU1

SP_IP_nGPU2

Results 22

0

10

20

30

40

50

60

70

80

90

0 5000 10000 15000 20000 25000 30000 35000

Sp

ee

d u

p


Speed up of accelerated double precision versions with respect to CPU implementation

DP_DPCPY_nGPU1

DP_DPCPY_nGPU2

DP_IP_nGPU1

DP_IP_nGPU2

Conclusion

Achieved an improvement in performance of about 40% - 60% over the previous version of fieldsMAGMA for large matrices.

Ability to allocate pinned memory makes significant difference.

Reduce R overheads.

Fine-tuned the R environment for fieldsMAGMA.

Detailed technical report.

23

Future work 24

Courtesy of NVIDIA Corporation

References Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J.,

Langou, J., et al. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA projects. Journel of Physics: Conference series , 180 (012037).

http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/. (n.d.).

https://software.intel.com/en-us/articles/optimizing-without-breaking-a-sweat. (n.d.).

Katzfuss, M., & Cressie, N. (2012). Bayesian hierarchical spatio-temporal smoothing for very large datasets (Vol. 23). Environmetrics.

Paige, J., Lyngaas, I., Ramakrishnaiah, V., Hammerling , D., Kumar, R., & Nychka, D. (2015). Incorporating MAGMA into the 'fields' spatial statistics package. Boulder: UCAR/NCAR.

Tomov, S., Dongarra, J., Volkov, V., & Demmel, J. MAGMA library. University of Tenessee and University of California, Knoxville, TN, and Berkeley, CA.

25

Thank you!

Questions?

26

Introduction

27

Initial timing results

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0

100

200

300

400

500

600

700

800

900

1000

0 2000 4000 6000 8000 10000 12000

Tim

e (

s)

GF

lop

s


Comparison of testing_dpotrf & testing_dpotrf_mgpu

testing_dpotrf_mgpu -Gflops

testing_dpotrf - Gflop/s

testing_dpotrf_mgpu -time (s)

testing_dpotrf - time (s)

28

Single precision calculations

0

2

4

6

8

10

12

14

16

18

20

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

s)


Using pinned memory for single precision

non-pinned

pinned

29

Results

0

5

10

15

20

25

30

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

s)


Comparison of execution times of different implementations

DP_DPCPY_nGPU1

DP_DPCPY_nGPU2

DP_IP_nGPU1

DP_IP_nGPU2

SP_DCPY_nGPU1

SP_DCPY_nGPU2

SP_IP_nGPU1

SP_IP_nGPU2

30

Results

0

1

2

3

4

5

6

2481018

26224981

756310311

1349117369

2320126633

Sp

eed

up


Performance improvement of the accelerated Kriging over CPU version

mKrig_1_DP mKrig_1_SP

mKrig_2_DP mKrig_2_SP

31

Results

0

0.5

1

1.5

2

2.5

3

3.5

4

2481018

26224981

756310311

1349117369

Sp

ee

d u

p


Performance improvement of the accelerated Kriging workflow over the CPU version

mKrigWF_1_DP

mKrigWF_2_SP

mKrigWF_2_DP

mKrigWF_1_SP

32

accelerating ‘fields’ byintroduction accelerated version of ‘fields’: fieldsmagma [paige et...

Documents