isaac lyngaas ([email protected]) john paige ([email protected]) advised by: srinath vadlamani...

Isaac Lyngaas ([email protected])

John Paige ([email protected])Advised by: Srinath Vadlamani ([email protected])

& Doug Nychka ([email protected])

SIParCS, July 31, 2014

Why use HPC with R?

Accelerating mKrig & Krig

Parallel Cholesky◦ Software Packages

Parallel Eigen Decomposition

Conclusions & Future Works

Accelerate the ‘Fields’ Krig and mKrig functions

Survey of parallel linear algebra software

◦ Multicore (Shared Memory)◦ GPU◦ Xeon Phi

Many developers & users in the field of Statistics◦ Readily available code base

Problem: R is slow for large size problems

Bottleneck in Linear Algebra operations◦ mKrig – Cholesky Decomposition◦ Krig – Eigen Decomposition

R uses sequential algorithms

Strategy: Use C interoperable libraries to parallelize linear algebra◦ C functions callable through R environment

Symmetric positive definite -> Triangular◦ A = LL^T◦ Nice properties for determinant

calculation

PLASMA (Multicore Shared Memory)

◦ http://icl.cs.utk.edu/plasma/

MAGMA (GPU & Xeon Phi)◦ http://icl.cs.utk.edu/magma/

CULA (GPU)◦ http://www.culatools.com/

http://icl.cs.utk.edu/plasma/













http://icl.cs.utk.edu/magma/














http://www.culatools.com/












Multicore (Shared Memory)

Block Scheduling◦ Determines what operations should be done

on which core

Block Size optimization◦ Dependent on Cache Memory

05

Spe

ed

up

vs.

1 C

ore

1015

Plasma using 1 Node(# of Observations = 25000)

8

# of Cores

1 2 4 12 16

Speedup

Optimal Speedup

67

500 40 Mb 1000

Block Size

1500

34

Tim

e(s

ec)

5

PLASMA on Dual Socket Sandy Bridge (# of Observations=15000, Core=16)

256 Kb

0 10000 20000

# of Observations

30000 40000

100

200

300

400

500

600

PLASMA Optimal Block Sizes (Cores=16)

Opt

imal

Blo

ck s

ize

Utilizes GPUs or Xeon Phi for parallelization◦ Multiple GPU & Multiple Xeon Phi

implementations available◦ 1 CPU drives one 1GPU

Block Scheduling◦ Similar to PLASMA

Block Size dependent on Accelerator Architecture

CUDA Proprietary linear algebra package

Capable of doing Lapack operations using 1 GPU

API written in C

Dense & Spare operations available

1 Node of Caldera or Pronghorn◦ 2 x 8 core Intel Xeon E5-2670 (Sandy

Bridge) processors per Node 64 GB RAM (~59 GB available) Cache Per Core: L1=32Kb, L2 =256Kb Cache Per Socket: L3=20Mb

◦ 2 x Nvidia Tesla M270Q GPU (Caldera) ~5.2 GB RAM per device 1 core drives 1 GPU

◦ 2 x Xeon Phi 5110P (Pronghorn) ~7.4 GB RAM per device

• Serial R: ~3 GFLOP/sec

• Theoretical Peak Performance• 16 core Xeon SandyBridge: ~333

GFLOP/sec• 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec• 1 Xeon Phi 5110P: ~1,011 GFLOP/sec

0 10000 20000

# of Observations

30000 40000

010

0G

FLO

P/s

ec

200

300

400

Accelerated Hardware has Room for Improvement

Plasma (16 cores)

Magma 1 GPU

Magma 2 GPUs

Magma 1 MIC

Magma 2 MICs

CULA

All Parallel Cholesky Implementations are Faster than Serial R

20000

# of Observations

Tim

e(s

ec)

0 10000 30000 40000

0.01

0.1

110

100

1000

Serial R

Plasma (16 Cores)

CULA

Magma 1 GPU

Magma 2 GPUs

Magma 1 Xeon Phi

Magma 2 Xeon Phis

• >100 Times Speedup over serial R when # of Observations = 10k

• ~6 Times Speedup over serial R when # of Observations = 10k

0 2000 4000 6000 8000 10000

050

100

Tim

e(s

ec)

150

200

250

300

Eigendecomposition also Faster on Accelerated Hardware

# of Observations

Serial R

CULA

Magma 1 GPU

Magma 2 GPUs

• Both times taken using MAGMA w/ 2 GPUs

0 2000 4000 6000 8000 10000

05

1015

2025

30Can Run ~30 Cholesky Decompositions per Eigen Decomposition

# of Observations

Tim

e E

ige

nd

eco

mp

ositi

on

/ T

ime

Cho

lesk

y

• If we want to do 16 Cholesky decompositions in parallel, we are guaranteed better performance when speedup >16

0 5000

05

1015

2025

10000

# of Observations

15000 20000

Parallel Cholesky Beats Parallel R for Moderate to Large MatricesS

pee

dup

vs. P

aral

lel

R

Plasma

Magma 2 GPUs

Using Caldera◦ Single Cholesky Decomposition

◦ Matrix Size < 20k use PLASMA (16 cores w/ optimal block size)

◦ Matrix Size 20k – 35k use MAGMA w/ 2 GPUs◦ Matrix Size > 35k use PLASMA (16 cores w/

optimal block size)

Dependent on computing resources available

Explored Implementation on accelerated hardware◦ GPUs◦ Multicore (Shared Memory)◦ Xeon Phis

Installed third party linear algebra packages & programmed wrappers that call these packages from R◦ Installation instructions and programs available

through bitbucket repo for access contact Srinath Vadlamani

Future Work◦ Multicore Distributed Memory◦ Single Precision

Douglas Nychka, Reinhard Furrer, and Stephan Sain. fields: Tools for spatial data, 2014b. URL: http://CRAN.R-project.org/package=fields. R package version 7.1.

Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, volume 180, page 012037. IOP Publishing, 2009.

Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Proc. Of VECPAR’10, Berkeley, CA, June22-25, 2010.

Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. PPAM 2013, Warsaw, Poland, September, 2013.

http://CRAN.R-project.org/package=fields















xPOTRF

xPOTRF

xPOTRF

xPOTRF

xPOTRF

xTRSM

xTRSM xTRSM xTRSM

xTRSM xTRSM xTRSM

xTRSM xTRSM

xTRSM

xSYRK

xSYRK xSYRK xSYRK

xSYRK xSYRK xSYRK

xSYRK xSYRK

xSYRK

xGEMMxGEMM xGEMM xGEMM xGEMM xGEMM

xGEMM xGEMM xGEMM

xGEMM

xPOTRF xTRSM xSYRK xGEMM

0

1

2

3

0

1

2

3

0

1

2

3

0

1 2

FINAL

• http://www.netlib.org/lapack/lawnspdf/lawn223.pdf

http://www.netlib.org/lapack/lawnspdf/lawn223.pdf




isaac lyngaas ([email protected]) john paige ([email protected]) advised by: srinath vadlamani...

Documents