accelerating ‘fields’ byintroduction accelerated version of ‘fields’: fieldsmagma [paige et...

32
Accelerating ‘fields’ by revamping the Cholesky decomposition Vinay Ramakrishnaiah Raghu Raj P Kumar John Paige Dr. Dorit Hammerling 1 07/29/2015

Upload: others

Post on 21-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Accelerating ‘fields’ by

revamping the Cholesky

decomposition Vinay Ramakrishnaiah

Raghu Raj P Kumar

John Paige

Dr. Dorit Hammerling

1

07/29/2015

Page 2: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Introduction What is ‘fields’?

Widely used R package for spatial statistics.

Literally thousands of users.

Developed by Dr. Douglas Nychka, director of The Institute for Mathematics Applied to Geosciences (IMAGe).

Major methods – thin plate splines, Kriging and Compact covariances for large data sets.

‘fields’ - single thread version on CPU.

http://burandtfieldmethods.blogspot.com/

2

Page 3: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Introduction

Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop.

Make it user friendly.

MAGMA: Matrix algebra on GPU and Multicore Architectures

Aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures. [http://icl.cs.utk.edu/magma/]

Biggest computational problem in Kriging

Evaluation of data likelihood function given the covariance.

Requires Cholesky Decomposition

3

Page 4: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Motivation

Unconventional behavior

http://www.nor-tech.com/solutions/hybrid-gpu-solutions-from-nor-tech/8-gpu-

server/

http://wccftech.com/nvidia-reportedly-preparing-geforce-titan-gpu-gk110/

http://www.aarp.org/money/investing/info-2014/5-bad-money-

moves.html#slide1

http://www.kaizen-news.com/focus-performance/

4

Performance

Page 5: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Motivation

Opposite impacts

http://www.bogotobogo.com/cplusplus/constructor.php

5

Page 6: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Motivation

There is always more room!

http://funny-pictures.funmunch.com/funny-picture-545.html

6

Page 7: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Approach

7

Identify performance

bottlenecks/cause for

unconventional behavior

Analyze the existing

code

See how to further

accelerate the code

Page 8: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Performance analysis - Profiling

NVIDIA visual profiler

Why profile GPU?

Why NVVP?

8

Page 9: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Performance analysis – Profiling

dpotrf_m 9 Non concurrent

memory transfers

Computations

CPU involved in

device to device data

exchange

Time----

Time----

Page 10: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Performance analysis – Profiling

dpotrf_mgpu 10

Asynchronous

simultaneous data

transfer

Computations

Time----

Page 11: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Performance analysis - Timing

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

0 5000 10000 15000 20000 25000 30000 35000

Tim

e in

seco

nd

s

Square matrix dimension N

Comparison of dpotrf_m and dpotrf_mgpu called from the R wrapper

dpotrf_mgpu (s)

dpotrf_m (s)

11

Page 12: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Program Flow 12 R Program

magmaChol(…)

Rwrapper.r

dyn.load(…)

.Call(…)

fieldsMagma.r

C function

MAGMA wrapper

CUDA calls

Compute

Page 13: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Overheads in R 13

0

2

4

6

8

10

12

14

16

18

20

magmaChol .Call dpotrf_m magmaChol .Call dpotrf_mgpu

dpotrf_m dpotrf_mgpu

Tim

e (

s)

Sample square matrix size N=32768

Overheads in R

overhead

overhead

Page 14: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

OPTIMIZATION APPROACHES

Reduce function call overheads in R.

Optimize R environment for

‘fieldsMAGMA’.

Accelerate the underlying C function.

14

Page 15: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Reducing R overheads

Intel LD_PRELOAD

Single dynamic load

Move the dynamic loads to the highest

level of R call

15 R Program

magmaChol(…)

Rwrapper.r

dyn.load(…)

.Call(…)

fieldsMagma.r

C function

MAGMA wrapper

CUDA calls

Compute

Page 16: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Reducing R overheads 16

0

5

10

15

20

25

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

s)

Matrix dimension N

Reduced overheads

Before

After

Page 17: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Performance of double precision Cholesky

Decomposition on Intel Compilers

0

5

10

15

20

25

30

35

40

0 5000 10000 15000 20000 25000 30000 35000 40000 45000

Tim

e (

s)

Matrix Dimension N

Comparison of accelerated code (double precision) - Intel 2012 vs Intel 2015 Compiler

Intel 12.1.5

Intel 15.0.3

17

Page 18: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

18 R Program

magmaChol(…)

Rwrapper.r

dyn.load(…)

.Call(…)

fieldsMagma.r

C function

MAGMA wrapper

CUDA calls

Compute

Page 19: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Pinned memory

http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/

19

For single precision calculations we obtained an improvement up to

30% for large sized matrices.

Page 20: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Mysteries of Deep

Copy

Deep copy performed in R acts as a overhead.

Used ‘perf’ to unravel the mystery!

20

Deep Copy In-place

Page Faults High Low

LLC Misses High Low

dTLB Misses Constant Constant

Abbreviations:

SP – Single Precision

DP – Double Precision

DPCPY – Deep copy

IP – In Place

Page 21: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Results 21

0

10

20

30

40

50

60

70

80

90

0 5000 10000 15000 20000 25000 30000 35000

Sp

ee

d u

p

Square matrix dimension N

Speed up of accelerated single precision versions with respect to CPU implementation

SP_DCPY_nGPU1

SP_DCPY_nGPU2

SP_IP_nGPU1

SP_IP_nGPU2

Page 22: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Results 22

0

10

20

30

40

50

60

70

80

90

0 5000 10000 15000 20000 25000 30000 35000

Sp

ee

d u

p

Square matrix dimension N

Speed up of accelerated double precision versions with respect to CPU implementation

DP_DPCPY_nGPU1

DP_DPCPY_nGPU2

DP_IP_nGPU1

DP_IP_nGPU2

Page 23: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Conclusion

Achieved an improvement in performance of about 40% - 60% over the previous version of fieldsMAGMA for large matrices.

Ability to allocate pinned memory makes significant difference.

Reduce R overheads.

Fine-tuned the R environment for fieldsMAGMA.

Detailed technical report.

23

Page 24: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Future work 24

Courtesy of NVIDIA Corporation

Page 25: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

References Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J.,

Langou, J., et al. Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA projects. Journel of Physics: Conference series , 180 (012037).

http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/. (n.d.).

https://software.intel.com/en-us/articles/optimizing-without-breaking-a-sweat. (n.d.).

Katzfuss, M., & Cressie, N. (2012). Bayesian hierarchical spatio-temporal smoothing for very large datasets (Vol. 23). Environmetrics.

Paige, J., Lyngaas, I., Ramakrishnaiah, V., Hammerling , D., Kumar, R., & Nychka, D. (2015). Incorporating MAGMA into the 'fields' spatial statistics package. Boulder: UCAR/NCAR.

Tomov, S., Dongarra, J., Volkov, V., & Demmel, J. MAGMA library. University of Tenessee and University of California, Knoxville, TN, and Berkeley, CA.

25

Page 26: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Thank you!

Questions?

26

Page 27: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Introduction

27

Page 28: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Initial timing results

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0

100

200

300

400

500

600

700

800

900

1000

0 2000 4000 6000 8000 10000 12000

Tim

e (

s)

GF

lop

s

Square matrix dimension N

Comparison of testing_dpotrf & testing_dpotrf_mgpu

testing_dpotrf_mgpu -Gflops

testing_dpotrf - Gflop/s

testing_dpotrf_mgpu -time (s)

testing_dpotrf - time (s)

28

Page 29: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Single precision calculations

0

2

4

6

8

10

12

14

16

18

20

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

s)

Square matrix dimension N

Using pinned memory for single precision

non-pinned

pinned

29

Page 30: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Results

0

5

10

15

20

25

30

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

s)

Square matrix dimension N

Comparison of execution times of different implementations

DP_DPCPY_nGPU1

DP_DPCPY_nGPU2

DP_IP_nGPU1

DP_IP_nGPU2

SP_DCPY_nGPU1

SP_DCPY_nGPU2

SP_IP_nGPU1

SP_IP_nGPU2

30

Page 31: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Results

0

1

2

3

4

5

6

2481018

26224981

756310311

1349117369

2320126633

Sp

eed

up

Square matrix dimension N

Performance improvement of the accelerated Kriging over CPU version

mKrig_1_DP mKrig_1_SP

mKrig_2_DP mKrig_2_SP

31

Page 32: Accelerating ‘fields’ byIntroduction Accelerated version of ‘fields’: fieldsMAGMA [Paige et al.] SIParCS, 2014 work – accelerate on a Laptop. Make it user friendly

Results

0

0.5

1

1.5

2

2.5

3

3.5

4

2481018

26224981

756310311

1349117369

Sp

ee

d u

p

Square matrix dimension N

Performance improvement of the accelerated Kriging workflow over the CPU version

mKrigWF_1_DP

mKrigWF_2_SP

mKrigWF_2_DP

mKrigWF_1_SP

32