parallel external memory algorithms applied to generalized linear models

Parallel External Memory Algorithms

applied to Generalized Linear

Models

Lee E. Edlefsen, Ph.D.

Chief Scientist

JSM 2012

1

Revolution Confidential Introduction and overview

2

For the past several decades the rising tide of technology has allowed the same data analysis code to handle the increase in sizes of typical data sets. That era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of RAM, and of hard drives.

To deal with this, statistical software must be able to use multiple cores and computers. Parallel external memory algorithms (PEMA’s) provide a foundation for such software.

Revolution Confidential Introduction and overview – (2)

External memory algorithms (EMA’s) are those that

do not require all data to be in RAM, and are widely

available.

Parallel implementations of EMA’s allow them to

run on multiple cores and computers, and to

process unlimited rows of data.

This paper describes a general approach to

efficiently parallelizing EMA’s, using an R and C++

implementation of generalized linear models (GLM)

as a detailed example.

Revolution R Enterprise 3

Revolution Confidential Introduction and overview – (3)

This paper discusses:

the arrangement of code for “automatic” parallelization

the efficient use of cores

the efficient use of multiple computers (nodes)

The approach presented is independent of the

distributed computing platform (MPI, Hadoop, MPP

database appliances)

The paper includes billion row benchmarks

showing linear scaling with rows and nodes, and

demonstrating that extremely high performance is

achievable


Revolution Confidential High Performance Computing vs High

Performance Analytics

HPA is HPC + Data

High Performance Computing is CPU centric

Lots of processing on small amounts of data

Focus is on cores

High Performance Analytics is data centric

Less processing per amount of data

Focus is on feeding data to the cores

On disk I/O, data locality

On efficient threading, data management in RAM


Revolution Confidential High Performance Analytics in RevoScaleR

Extremely high performance data management

and data analysis

Scales from small local data to huge distributed

data

Scales from laptop to cluster to cloud

Based on a platform that “automatically” and

efficiently parallelizes and distributes a broad class

of predictive analytic algorithms

This platform implements the approach to parallel

external memory algorithms I will describe


Revolution Confidential External memory algorithms

External memory algorithms are those that allow computations to be split into pieces so that not all data has to be in memory at one time

Such algorithms process data a “chunk” at a time, storing intermediate results from each chunk and combining them at the end

Each chunk must produce an intermediate result that can be combined with other intermediate results to give the final result

Such algorithms are widely available for data management and predictive analytics

7

Revolution Confidential Parallel external memory algorithms

(PEMA’S)

PEMA’s are external memory algorithms that have

been parallelized

Such algorithms process data a chunk at a time in

parallel, storing intermediate results from each

chunk and combining them at the end

External memory algorithms that are not “inherently

sequential” can be parallelized

Results for one chunk of data cannot depend upon prior

results

Data dependence (lags, leads) is OK


Revolution Confidential Generalized Linear Models (GLM)

The generalized linear model can be thought of as

a generalization of linear regression

It extends linear regression to handle dependent

variables that are generated from exponential

distribution functions, including Gaussian, Poisson,

logistic, gamma, binomial, multinomial, and

tweedie

Generalized linear models are widely used in a

variety of fields and industries


Revolution Confidential GLM overview

The dependent variable Y is generated from a

distribution in the exponential family

The expected value of Y is related to a linear

predictor of the data X and parameters β through

the inverse of a “link” function g():

E(Y) = mu = g-1(Xβ)

The variance of Y is typically a function V() of the

mean mu:

Var(Y) = varmu = V(mu)


Revolution Confidential GLM Estimation

The parameters of GLM models can be estimated

using maximum likelihood

Iteratively reweighted least squares (IRLS) is

commonly used to obtain the maximum likelihood

estimates

Each iteration of IRLS requires at least one pass

through the data, generating a vector of weights

and a “new” dependent variable and then doing a

weighted least squares regression


Revolution Confidential IRLS for GLM

Given an estimate of the parameters β and the

data X, IRLS requires the computation of a “weight”

variable W and a “new” dependent variable Z:

eta = Xβ

mu = linkinv(eta)

Z = (y-mu)/mu_eta, where mu_eta is the partial of mu with respect to eta

W = sqrt(mu_eta*mu_eta)/varmu

The next β is then computed by regressing Z on X,

weighted by W

If the estimation has not converged, the steps are

repeated


Revolution Confidential In-memory implementations

The glm() function in R provides a beautiful and

efficient in-memory implementation

However, nearly every computational line of code

involves processing all rows of data

There is no easy way to directly convert an

implementation like this into an implementation that

can handle data too big to fit into memory and that

can use multiple cores and multiple computers

However, it can be accomplished by arranging the

same computations into separate functions that

accomplish separate tasks


Revolution Confidential

Example external memory algorithm for the

mean of a variable

Initialization function: total=0, count=0

ProcessData function: for each block of x; total =

sum(x), count=length(x)

UpdateResults function: total12 = total1 + total2

ProcessResults function: mean = combined total /

combined count

14

Revolution Confidential A formalization of PEMA’s

Arrange the code into 4 functions:

1. Initialize(): does any necessary initialization

2. ProcessData(): takes a chunk of data and

produces an intermediate result (IR); this is the only

function run in parallel; it must assume it does not have

all data; it must produce no side-effects

3. UpdateResults(): takes two IR’s and produces

another IR that is equivalent to the IR that would have

been produced by combing the two corresponding chunks of data and calling ProcessData()

4. ProcessResults(): takes any given IR and

converts it into a “final results” (FR) form


Revolution Confidential An external memory algorithm for GLM

Initialization function: set intermediate values to 0

ProcessData function: for given β and chunk of

data X, compute Z, W and M, the weighted cross

products matrix of X and Z for this chunk

eta = Xβ, mu = linkinv(eta) Z = (y-mu)/mu_eta, W = sqrt(mu_eta*mu_eta)/varmu

M = [X*W Z*W]’[X*W Z*W]

UpdateResults function:

M12 = M1 + M2

ProcessResults function:

β = Solve(M) (solves a set of linear equations)

Check for convergence and repeat if necessary


Revolution Confidential A C++ and R implementation of GLM

C++ “analysis” objects

Have 4 virtual PEMA methods, among others

Have member variables for intermediate results

and for maintaining local state

Know how to copy themselves (including ability

to not copy some members, for efficiency)

Have ability to call into R during ProcessData()

R “family” objects for glm

Contain methods for computing Z, W (eta, mu,

etc)



GLM in C++ and R: Multiple Cores

On each computer, a master analysis object makes

a copy of itself for all usable threads (cores)

except one

The remaining thread is assigned to handle all I/O

In a master loop over the data, the I/O object reads

a chunk of data

In parallel (after the first read), portions of the

previously-read chunk are (virtually) passed to the

ProcessData() methods of the other objects



GLM in C++ and R: Multiple Cores – (2)

For each chunk of data, Z,W are computed (in R

or C++; if in R, only on 1 thread at a time is allowed); Xβ and M are computed in C++

After all data has been consumed, the master

analysis object loops over all of the thread-specific

objects and updates itself (using UpdateResults()),

resulting in the intermediate results object that

corresponds to all of the data processed on this

computer

If other computers are being used, this computer

sends it intermediate results to the “master” node


Revolution Confidential GLM in C++ and R: Multiple MPI Nodes

A “master node” sends a copy of the analysis

object, or instructions on how to create one, to

each computer (node) on a cluster/grid, and the

steps described above are carried out

Each node reads and processes its portion of the

data (the more local the data the better)

Worker nodes do not communicate with each other

Worker nodes do not communicate with the master

node except for sending their results


Revolution Confidential GLM in C++ and R: Multiple MPI Nodes (2)

When each node has its final IR object, it sends it

to the master node

The master node gathers and combines all

intermediate results using UpdateResults()

When it has the final intermediate results, it calls ProcessResults() to get next estimate of β

The master node checks for convergence, and

repeats all of the steps if necessary


Revolution Confidential Implementation in RevoScaleR

The package RevoScaleR, which is part of

Revolution R Enterprise, contains an

implementation of GLM and other algorithms based

on this approach

The algorithms are internally threaded

They can currently use MPI or RPC for inter-

process communication

Supports Platform LSF and HPC Server

schedulers

We are currently working on supporting Hadoop


Revolution Confidential Some features of this implementation

Handles an arbitrarily large number of rows in a

fixed amount of memory

Scales linearly with the number of rows

Scales (approximately) linearly with the number

of nodes

Scales well with the number of cores per node

Scales well with the number of parameters

Works on commodity hardware

Extremely high performance



Scalability of linear regression with rows 1 million - 1 billion rows, 443 betas

(4 cores)

0

200

400

600

800

1000

1200

0 200 400 600 800 1000 1200

Time (secs)~ 1.1 million rows/second



Scalability of glm (logit) with rows 1 million - 1 billion rows, 443 betas

(4 cores)

0

500

1000

1500

2000

2500

3000

3500

4000

0 200 400 600 800 1000 1200

Time (secs)


Revolution Confidential Scalability with nodes: glm (logit) Big (1B rows) and Small (124M rows) data

Big (443 params) and Small (7 params) models (4 cores per node)


Big Data, Big Model

(Super scaling)

Small Data, Small Model

Big Data, Small Model

Small Data, Big Model

Linear scaling reference

5 iterations per model

Revolution Confidential Timing comparisons

glm() in CRAN R vs rxGlm in RevoScaleR

SAS’s new HPA functionality vs rxGlm



28


HPA Benchmarking comparison* – Logistic Regression

Rows of data 1 billion 1 billion

Parameters “just a few” 7

Time 80 seconds 44 seconds

Data location In memory On disk

Nodes 32 5

Cores 384 20

RAM 1,536 GB 80 GB

29

Revolution R is faster on the same amount of data, despite using approximately a

20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-

loading data into RAM.

*As published by SAS in HPC Wire, April 21, 2011

Revolution Confidential Conclusion

PEMA’s provide a systematic approach to scalable

analytic algorithms

Algorithms implemented in this way can handle

unlimited numbers of rows on a single core in a

fixed amount of RAM

Such algorithms scale well with rows and nodes,

and scale well with cores up to a point

Work on commodity hardware

Work on different distributed computing platforms

Extremely high performance is possible


Revolution Confidential Thank you!

R-Core Team

R Package Developers

R Community

Revolution R Enterprise Customers and Beta

Testers

Colleagues at Revolution Analytics

Contact:

[email protected]

31

parallel external memory algorithms applied to generalized linear models

Technology

chunk of data

data centric

data locality

algorithms process data

feeding data

small local data

data analysis codeto

size of data sets