programming with big data in r

Introduction pbdR Benchmarks Demos

Programming with Big Data in R

Drew Schmidt and George Ostrouchov

http://r-pbd.org/

November 19, 2013

Drew Schmidt and George Ostrouchov Programming with Big Data in R

http://r-pbd.org/


The pbdR Core Team

Wei-Chen Chen1

George Ostrouchov1,2

Pragneshkumar Patel1

Drew Schmidt1

SupportThis work used resources of National Institute for Computational Sciences at the University of Tennessee,Knoxville, which is supported by the Office of Cyberinfrastructure of the U.S. National Science Foundation underAward No. ARRA-NSF-OCI-0906324 for NICS-RDAV center. This work used resources of the Newton HPCProgram at the University of Tennessee, Knoxville. This work also used resources of the Oak Ridge LeadershipComputing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S.Department of Energy under Contract No. DE-AC05-00OR22725.

1University of Tennessee. Supported in part by the project “NICS Remote Data Analysis and Visualization Center”funded by the Office of Cyberinfrastructure of the U.S. National Science Foundation under Award No.ARRA-NSF-OCI-0906324 for NICS-RDAV center.

2Oak Ridge National Laboratory. Supported in part by the project “Visual Data Exploration and Analysis ofUltra-large Climate Data” funded by U.S. DOE Office of Science under Contract No. DE-AC05-00OR22725.

Drew Schmidt and George Ostrouchov Programming with Big Data in R


Contents

1 Introduction

2 pbdR

3 Benchmarks

4 Demos

Drew Schmidt and George Ostrouchov Elevating R to Supercomputers


What is R?

What is R?

lingua franca for data analytics and statisticalcomputing.

Part programming language, part dataanalysis package.

Dialect of S (Bell Labs).

Syntax designed for data.

Drew Schmidt and George Ostrouchov Elevating R to Supercomputers 1 / 30


What is R?

Who uses R?



What is R?

Language Paradigms



What is R?

Why R?

1 People love it.

2 Highly extensible (CRAN ≈ 5000 actively maintainedpackages.

3 Wealth of diverse analytical methods.

4 HPC community has growing need for data analytics.



What is R?

Problems with R

1 Slow.

2 If you don’t know what you’re doing, it’s really slow.

3 Data science community has growing data size problem.

4 Chokes on big data.

5 Parallelism: “batteries not included”



What is R?

In spite of all this. . .



R and HPC

R Interfaces to Native Tools

Interconnection Network

PROC+ cache

PROC+ cache

PROC+ cache

PROC+ cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE+ cache

CORE+ cache

CORE+ cache

CORE+ cache

Network

Shared Memory Local Memory

GPUor

MIC

Co-Processor

GPU: Graphical Processing Unit

MIC: Many Integrated Core

Focus on who owns what data and what communication is needed

Focus on which tasks can be parallel

Same Task on Blocks of data

SocketsMPI

HADOOP

snowRmpi

pbdMPI

snow + multicore = parallel

CUDAOpenCL

OpenACC

OpenMPOpenACC

OpenMPThreads

fork

.C.CallRcpp

OpenCLinline

multicore(fork)



R and HPC

R Interfaces to Scalable Math Libraries

PLASMA

HiPLAR

Interconnection Network

PROC+ cache

PROC+ cache

PROC+ cache

PROC+ cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE+ cache

CORE+ cache

CORE+ cache

CORE+ cache

Network

Shared Memory Local Memory

GPUor

MIC

Co-Processor

GPU: Graphical Processing Unit

MIC: Many Integrated Core

Focus on who owns what data and what communication is needed

Focus on which tasks can be parallel

Same Task on Blocks of data

SocketsMPI

Hadoop

OpenMPThreads

fork

CUDAOpenCL

OpenACC

OpenMPOpenACC

multicore(fork) snow + multicore = parallel

ScaLAPACKPBLASBLACS

MAGMA

PETSc

Trilinos

DPLASMA

CUBLAS

HiPLARM

MKLACMLLibSci

.C.CallRcpp

OpenCLinline

pbdDMATpbdDMAT

magma

snowRmpi

pbdMPI



Contents

2 pbdRThe pbdR Project



The pbdR Project

Programming with Big Data in R (pbdR)

Productivity, Portability, Performance

Freea R packages.

Bridging high-performance C withhigh-productivity of R

Distributed data details implicitlymanaged.

Methods have syntax identical to R.

aMPL, BSD, and GPL licensed



The pbdR Project

pbdR Packages



The pbdR Project

pbdR Example Syntax

1 x <- x[-1, 2:5]

2 x <- log(abs(x) + 1)

3 xtx <- t(x) %*% x

4 ans <- svd(solve(xtx))

The above runs on 1 core with R or 10,000 cores with pbdR



The pbdR Project

Profiling with pbdPROF

1. Rebuild pbdR packages

R CMD INSTALL

pbdMPI_0.2 -1.tar.gz \

--configure -args= \

"--enable -pbdPROF"

2. Run code

mpirun -np 64 Rscript

my_script.R

3. Analyze results

1 library(pbdPROF)

2 prof <- read.prof(

"profiler_output.mpiP")

3 plot(prof)

Publication-quality graphs



The pbdR Project

pbdR on HPC Resources

University of Tennessee

Kraken (XSEDE)

Nautilus

Darter

Newton

Oak Ridge National Lab

Titan

Lens

Chester

Sith

Other Resources

Stampede, TACC (XSEDE)

tara, UMBC

Hopper, NERSC

Edison, NERSC

If you are interested in installing pbdR: [email protected]


[email protected]


The pbdR Project

Exploration of Climate Data with pbdR

Behavior of extreme precipitation across atmospheric layers of moisture, temperature, and wind

IMD

W.-C. Chen, G. Ostrouchov, D. Pugmire, Prabhat, M. Wehner. A Parallel EM Algorithm for Model-Based Clustering Applied to the Exploration of Large Spatio-Temporal Data. Technometrics (online, in print: Fall 2013).

Exploration of Climate Data with pbdR

•  2013 INCITE “AYribu.ng Changes in the Risk of Extreme Weather and Climate” Michael Wehner, PI

•  High resolu0on atmospheric model, CAM 5.1 (~25 km resolu0on) •  Resolves processes responsible for extreme precipita0on and storms

•  R and pbdR used for scalable analy.cs and graphics •  Advanced clustering capability (EM algorithm for Gaussian mixture

models)



Contents

1 Introduction

2 pbdR

3 Benchmarks

4 Demos



Benchmarks

Non-Optimal Choices Throughout

1 Only libre software used (no MKL, ACML, etc.).

2 1 core = 1 MPI process.

3 No tuning for data distribution.



Benchmarks

Benchmark Data

1 Measure wallclock time for covariance and linear regression.

2 Random normal N(100, 10000).

3 Local problem size of ≈ 43.4MiB.

4 Three sets: 500, 1000, and 2000 columns.

5 Several runs at different core sizes within each set.



Covariance

Covariance Code

cov(xn×p) =1

n − 1

n∑

i=1

(xi − µx) (xi − µx)T

1 x <- ddmatrix("rnorm", nrow=n, ncol=p, mean=mean , sd=sd)

2

3 cov.x <- cov(x)



Covariance

cov()

0.04

21.3442.68 85.35 170.7 341.41

0.04

21.3442.68 85.35 170.7 341.41

0.04

21.3442.68 85.35 170.7341.41

4

8

12

16

1 504 1008 2016 4032 80641 504 1008 2016 4032 80641 504 1008 2016 4032 8064Cores

Run

Tim

e (S

econ

ds)

Predictors 500 1000 2000

Calculating cov(x) With Fixed Local Size of ~43.4 MiB



Linear Model Fitting

Linear Model Code

Find β such that

y = Xβ + ε

When X is full rank,

β = (XTX)−1XTy

1 x <- ddmatrix("rnorm", nrow=n, ncol=p, mean=mean , sd=sd)

2 beta_true <- ddmatrix("runif", nrow=p, ncol =1)

3

4 y <- x %*% beta_true

5

6 beta_est <- lm.fit(x=x, y=y)$coefficients




lm.fit()

21.34

42.6885.35170.7 341.41

1016.09

21.3442.6885.35 170.7

341.411016.09

21.3442.68

85.35

170.7341.41

1016.09

25

50

75

100

125

504

1008

2016

4032

8064

2400

050

410

0820

1640

3280

64

2400

050

410

0820

1640

3280

64

2400

0

Cores

Run

Tim

e (S

econ

ds)

Predictors 500 1000 2000

Fitting y~x With Fixed Local Size of ~43.4 MiB




Data Generation

0

200

400

600

800

1 504 1008 2016 4032 8064 24000 48000Cores

Dat

a G

ener

atio

n (G

iB p

er S

econ

d)



Contents

4 DemospbdMPIpbdDMATRandSVD



pbdMPI

MPI Operations: The Gang’s All Here

Communicator wrangling: init(), finalize()

Rank query: comm.rank(), comm.size()

Reduction: reduce(x, op=’sum’), allreduce(x)

Gather: gather(x), allgather(x)

Broadcast: bcast(x)

Barrier: barrier()

Send/Receive: send(x), recv()



pbdMPI

MPI Operations for R Users

Printing: comm.print(x), comm.cat(x)

RNG Seeds: comm.set.seed(diff=T)

Task Subsetting: get.jid(n)

*ply:pbdApply(X, MARGIN, FUN, ...)

pbdLapply(X, FUN, ...)

pbdSapply(X, FUN, ...)



pbdMPI

Example 1: Monte Carlo Simulation

Sample N uniform observations (xi , yi ) in the unit square[0, 1]× [0, 1]. Then

π ≈ 4

(# Inside Circle

# Total

)= 4

(# Blue

# Blue + # Red

)

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0



pbdDMAT

Distributed Matrices

Most problems in data science are matrix algebra problems, so:

Distributed matrices =⇒ Bigger data



pbdDMAT

DMAT: 2-dimensional Block-Cyclic with 6 Processors

x =

x11 x12 x13 x14 x15 x16 x17 x18 x19

x21 x22 x23 x24 x25 x26 x27 x28 x29

x31 x32 x33 x34 x35 x36 x37 x38 x39

x41 x42 x43 x44 x45 x46 x47 x48 x49

x51 x52 x53 x54 x55 x56 x57 x58 x59

x61 x62 x63 x64 x65 x66 x67 x68 x69

x71 x72 x73 x74 x75 x76 x77 x78 x79

x81 x82 x83 x84 x85 x86 x87 x88 x89

x91 x92 x93 x94 x95 x96 x97 x98 x99

9×9

Processor grid =

∣∣∣∣0 1 23 4 5

∣∣∣∣ =

∣∣∣∣(0,0) (0,1) (0,2)(1,0) (1,1) (1,2)

∣∣∣∣



pbdDMAT

Understanding DMAT: Local View

x11 x12 x17 x18

x21 x22 x27 x28

x51 x52 x57 x58

x61 x62 x67 x68

x91 x92 x97 x98

5×4

x13 x14 x19

x23 x24 x29

x53 x54 x59

x63 x64 x69

x93 x94 x99

5×3

x15 x16

x25 x26

x55 x56

x65 x66

x95 x96

5×2

x31 x32 x37 x38

x41 x42 x47 x48

x71 x72 x77 x78

x81 x82 x87 x88

4×4

x33 x34 x39

x43 x44 x49

x73 x74 x79

x83 x84 x89

4×3

x35 x36

x45 x46

x75 x76

x85 x86

4×2

Processor grid =

∣∣∣∣0 1 23 4 5

∣∣∣∣ =

∣∣∣∣(0,0) (0,1) (0,2)(1,0) (1,1) (1,2)

∣∣∣∣



RandSVD

Randomized SVD1

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

PROBABILISTIC ALGORITHMS FOR MATRIX APPROXIMATION 227

Prototype for Randomized SVDGiven an m × n matrix A, a target number k of singular vectors, and anexponent q (say, q = 1 or q = 2), this procedure computes an approximaterank-2k factorization UΣV ∗, where U and V are orthonormal, and Σ isnonnegative and diagonal.Stage A:1 Generate an n× 2k Gaussian test matrix Ω.2 Form Y = (AA∗)qAΩ by multiplying alternately with A and A∗.3 Construct a matrix Q whose columns form an orthonormal basis for

the range of Y .Stage B:4 Form B = Q∗A.5 Compute an SVD of the small matrix: B = UΣV ∗.6 Set U = QU .Note: The computation of Y in step 2 is vulnerable to round-off errors.When high accuracy is required, we must incorporate an orthonormalizationstep between each application of A and A∗; see Algorithm 4.4.

The theory developed in this paper provides much more detailed informationabout the performance of the proto-algorithm.

• When the singular values of A decay slightly, the error ‖A − QQ∗A‖ doesnot depend on the dimensions of the matrix (sections 10.2–10.3).

• We can reduce the size of the bracket in the error bound (1.8) by combiningthe proto-algorithm with a power iteration (section 10.4). For an example,see section 1.6 below.

• For the structured random matrices we mentioned in section 1.4.1, relatederror bounds are in force (section 11).

• We can obtain inexpensive a posteriori error estimates to verify the qualityof the approximation (section 4.3).

1.6. Example: Randomized SVD. We conclude this introduction with a shortdiscussion of how these ideas allow us to perform an approximate SVD of a large datamatrix, which is a compelling application of randomized matrix approximation [113].

The two-stage randomized method offers a natural approach to SVD compu-tations. Unfortunately, the simplest version of this scheme is inadequate in manyapplications because the singular spectrum of the input matrix may decay slowly. Toaddress this difficulty, we incorporate q steps of a power iteration, where q = 1 orq = 2 usually suffices in practice. The complete scheme appears in the box labeledPrototype for Randomized SVD. For most applications, it is important to incorporateadditional refinements, as we discuss in sections 4 and 5.

The Randomized SVD procedure requires only 2(q + 1) passes over the matrix,so it is efficient even for matrices stored out-of-core. The flop count satisfies

TrandSVD = (2q + 2) k Tmult +O(k2(m+ n)),

where Tmult is the flop count of a matrix–vector multiply with A or A∗. We have thefollowing theorem on the performance of this method in exact arithmetic, which is aconsequence of Corollary 10.10.

Theorem 1.2. Suppose that A is a real m × n matrix. Select an exponent qand a target number k of singular vectors, where 2 ≤ k ≤ 0.5minm,n. Execute the

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

244 N. HALKO, P. G. MARTINSSON, AND J. A. TROPP

Algorithm 4.3: Randomized Power IterationGiven an m× n matrix A and integers and q, this algorithm computes anm× orthonormal matrix Q whose range approximates the range of A.1 Draw an n× Gaussian random matrix Ω.2 Form the m× matrix Y = (AA∗)qAΩ via alternating application

of A and A∗.3 Construct an m× matrix Q whose columns form an orthonormal

basis for the range of Y , e.g., via the QR factorization Y = QR.Note: This procedure is vulnerable to round-off errors; see Remark 4.3. Therecommended implementation appears as Algorithm 4.4.

Algorithm 4.4: Randomized Subspace IterationGiven an m× n matrix A and integers and q, this algorithm computes anm× orthonormal matrix Q whose range approximates the range of A.1 Draw an n× standard Gaussian matrix Ω.2 Form Y0 = AΩ and compute its QR factorization Y0 = Q0R0.3 for j = 1, 2, . . . , q

4 Form Yj = A∗Qj−1 and compute its QR factorization Yj = QjRj .

5 Form Yj = AQj and compute its QR factorization Yj = QjRj .6 end7 Q = Qq.

Algorithm 4.3 targets the fixed-rank problem. To address the fixed-precisionproblem, we can incorporate the error estimators described in section 4.3 to obtainan adaptive scheme analogous with Algorithm 4.2. In situations where it is critical toachieve near-optimal approximation errors, one can increase the oversampling beyondour standard recommendation = k + 5 all the way to = 2k without changingthe scaling of the asymptotic computational cost. A supporting analysis appears inCorollary 10.10.

Remark 4.3. Unfortunately, when Algorithm 4.3 is executed in floating-pointarithmetic, rounding errors will extinguish all information pertaining to singularmodes associated with singular values that are small compared with ‖A‖. (Roughly,if machine precision is µ, then all information associated with singular values smallerthan µ1/(2q+1) ‖A‖ is lost.) This problem can easily be remedied by orthonormalizingthe columns of the sample matrix between each application of A and A∗. The result-ing scheme, summarized as Algorithm 4.4, is algebraically equivalent to Algorithm 4.3when executed in exact arithmetic [93, 125]. We recommend Algorithm 4.4 becauseits computational costs are similar to those of Algorithm 4.3, even though the formeris substantially more accurate in floating-point arithmetic.

4.6. An Accelerated Technique for General Dense Matrices. This section de-scribes a set of techniques that allow us to compute an approximate rank- factor-ization of a general dense m× n matrix in roughly O(mn log()) flops, in contrast tothe asymptotic cost O(mn) required by earlier methods. We can tailor this schemefor the real or complex case, but we focus on the conceptually simpler complex case.These algorithms were introduced in [138]; similar techniques were proposed in [119].

The first step toward this accelerated technique is to observe that the bottleneckin Algorithm 4.1 is the computation of the matrix product AΩ. When the test matrix

Serial R1 randSVD <− f u n c t i o n (A, k , q=3)2 3 ## Stage A4 Omega <− matrix(rnorm(n*2*k),5 nrow=n, ncol=2*k)6 Y <− A %∗% Omega7 Q <− qr .Q( qr (Y) )8 At <− t (A)9 f o r ( i i n 1 : q )

10 11 Y <− At %∗% Q12 Q <− qr .Q( qr (Y) )13 Y <− A %∗% Q14 Q <− qr .Q( qr (Y) )15 1617 ## Stage B18 B <− t (Q) %∗% A19 U <− La . svd (B) $u20 U <− Q %∗% U21 U[ , 1 : k ]22

1Halko N, Martinsson P-G and Tropp J A 2011 Finding structure with randomness: probabilistic algorithms for

constructing approximate matrix decompositions SIAM Rev. 53 217–88



RandSVD

Randomized SVD

Serial R1 randSVD <− f u n c t i o n (A, k , q=3)2 3 ## Stage A4 Omega <− matrix(rnorm(n*2*k),5 nrow=n, ncol=2*k)6 Y <− A %∗% Omega7 Q <− qr .Q( qr (Y) )8 At <− t (A)9 f o r ( i i n 1 : q )


Parallel pbdR1 randSVD <− f u n c t i o n (A, k , q=3)2 3 ## Stage A4 Omega <− ddmatrix(”rnorm”,5 nrow=n, ncol=2*k)6 Y <− A %∗% Omega7 Q <− qr .Q( qr (Y) )8 At <− t (A)9 f o r ( i i n 1 : q )




RandSVD

Randomized SVD on ≈ 765 MiB

1

2

4

8

16

32

64

128

1 2 4 8 16 32 64 128Cores

Spe

edup

Algorithm full randomized

30 Singular Vectors from a 100,000 by 1,000 Matrix

5

10

15

1 2 4 8 16 32 64 128Cores

Spe

edup

30 Singular Vectors from a 100,000 by 1,000 MatrixSpeedup of Randomized vs. Full SVD



RandSVD

Thanks for coming!

Questions?

http://r-pbd.org/

Be sure to come to our R BoF: Wednesday 5:30-7:00 room 404


http://r-pbd.org/

programming with big data in r

Documents