model building with revoscaler: using r and hadoop for statistical computation

Model Building with RevoScaleRUsing R and Hadoop for Statistical Computation

Joseph Rickert, Revolution Analytics

Strata and Hadoop World 2013

2

Model Buliding with RevoScaleRAgenda:The three realms of dataWhat is RevoScaleR?RevoScaleR working beside HadoopRevoScaleR running within HadoopRun some code

The 3 Realms of Data

Bridging the gaps between architectures

4

The 3 Realms of Data

1011

Number of rows

106

>1012

Architectural complexity

DataIn

Memory

Data in a File

The realm of “chunking”

Data in

Multiple

Files

The realm of massive data

RevoScaleR

Revolution R Enterprise

6

RevoScaleR An R package ships exclusively with


Implements Parallel External Memory Algorithms (PEMAs)

Provides functions to:

– Import, Clean, Explore and Transform Data

– Statistical Analysis and Predictive Analytics

– Enable distributed computing

Scales from small local data to huge distributed data

The same code works on small and big data, and on workstation, server, cluster, Hadoop

R+

CR

AN

Rev

oR

DistributedR

RevoScaleR

ConnectR

DeployRDevelopR


Parallel External Memory Algorithms (PEMA’s) Built on a platform (DistributeR)

that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms

Process data a chunk at a time in parallel across cores and nodes:

1. Initialize

2. Process Chunk

3. Aggregate

4. Finalize

Revolution R Enterprise 7

R+

CR

AN

Rev

oR

DistributedR

RevoScaleR

ConnectR

DeployRDevelopR

8

RevoScaleR PEMAs

Covariance, Correlation, Sum of Squares Multiple Linear Regression Generalized Linear Models:

All exponential family distributions, Tweedie distribution.

Standard link functions user defined distributions & link

functions. Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models

Histogram Line Plot Lorenz Curve ROC Curves

K-Means

Statistical Modeling

Decision Trees Decision Forests

Predictive Models Cluster AnalysisData Visualization

Classification

Machine Learning

Simulation

Parallel random number generators for Monte Carlo

Variable Selection

Stepwise Regression PCA

GLM comparison using in-memory data: glm() and ScaleR’s rxGlm()


10

PEMAs: Optimized for Performance Arbitrarily large number of

rows in a fixed amount of memory

Scales linearly with the number of rows with the number of nodes

Scales well with the number of cores

per node with the number of

parameters

Efficient Computational algorithms Memory management: minimize

copying File format: fast access by row

and column Heavy use of C++ Models

pre-analyzed to detect and remove duplicate computations and points of failure (singularities)

Handle categorical variables efficiently

R+

CR

AN

Rev

oR

DistributedR

RevoScaleR

ConnectR

DeployRDevelopR

Write Once. Deploy Anywhere.

DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE 11

In the Cloud Microsoft Azure BurstAmazon AWS

Workstations & ServersDesktopServerLinux

Clustered Systems Platform LSFMicrosoft HPC

EDW IBMTeradata

Hadoop HortonworksCloudera

RRE in Hadoop

12

or

beside inside

Revolution R Enterprise BESIDE Architecture

Use Hadoop for data storage and data preparation

Use RevoScaleR on a connected server for predictive modeling

Use Hadoop for model deployment

14

A Simple Goal: Hadoop As An R Engine.Run Revolution R Enterprise code In Hadoop without change

Provide RevoScaleR Pre-Parallelized

Algorithms

Eliminate:

The Need To “Think In MapReduce”

Data Movement

Hadoop


INSIDEArchitecture

Use RevoScaleR inside Hadoop for:

• Data preparation

• Model building

• Custom small-data parallel programming

• Model deployment

• Late 2013: Big-data predictive models with ScaleR

Name Node

Data NodeData Node Data NodeData Node Data Node

Job Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

MapReduce

HDFS

16

Name Node


Job Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

MapReduce

HDFS

RRE in Hadoop

17

Name Node


Job Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

MapReduce

HDFS

RRE in Hadoop

RevoScaleR on Hadoop Each pass through the data is one MapReduce

job Prediction (Scoring), Transformation, Simulation:

– Map tasks store results in HDFS or return to client

Statistics, Model Building, Visualization:– Map tasks produce “intermediate result objects” that are

aggregated by a Reduce task

– Master process decides if another pass through the data is required

Data can be cached or stored in XDF binary format for increased speed, especially on iterative algorithmsRevolution R Enterprise 18

Let’s run some code.

Backup slides

Sample code: logit on workstation

# Specify local data source

airData <- myLocalDataSource

# Specify model formula and parameters

rxLogit( ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData )

21

Sample code for logit on Hadoop# Change the “compute context”rxSetComputeContext(myHadoopCluster)

# Change the data source if necessary

airData <- myHadoopDataSource

# Otherwise, the code is the same

rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData)

22

Demo rxLinMod in Hadoop - Launching


Demo rxLinMod in Hadoop - In Progress


Demo rxLinMod in Hadoop - Completed


model building with revoscaler: using r and hadoop for statistical computation

Technology