model building with revoscaler: using r and hadoop for statistical computation

25
Model Building with RevoScaleR Using R and Hadoop for Statistical Computation Joseph Rickert, Revolution Analytics Strata and Hadoop World 2013

Upload: revolution-analytics

Post on 26-Jan-2015

110 views

Category:

Technology


3 download

DESCRIPTION

Slides from Joseph Rickert's presentation at Strata NYC 2013 "Using R and Hadoop for Statistical Computation at Scale" http://strataconf.com/stratany2013/public/schedule/detail/30632

TRANSCRIPT

Page 1: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Model Building with RevoScaleRUsing R and Hadoop for Statistical Computation

Joseph Rickert, Revolution Analytics

Strata and Hadoop World 2013

Page 2: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

2

Model Buliding with RevoScaleRAgenda:The three realms of dataWhat is RevoScaleR?RevoScaleR working beside HadoopRevoScaleR running within HadoopRun some code

Page 3: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

The 3 Realms of Data

Bridging the gaps between architectures

Page 4: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

4

The 3 Realms of Data

1011

Number of rows

106

>1012

Architectural complexity

DataIn

Memory

Data in a File

The realm of “chunking”

Data in

Multiple

Files

The realm of massive data

Page 5: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

RevoScaleR

Revolution R Enterprise

Page 6: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

6

RevoScaleR An R package ships exclusively with

Revolution R Enterprise

Implements Parallel External Memory Algorithms (PEMAs)

Provides functions to:

– Import, Clean, Explore and Transform Data

– Statistical Analysis and Predictive Analytics

– Enable distributed computing

Scales from small local data to huge distributed data

The same code works on small and big data, and on workstation, server, cluster, Hadoop

R+

CR

AN

Rev

oR

DistributedR

RevoScaleR

ConnectR

DeployRDevelopR

Revolution R Enterprise

Page 7: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Parallel External Memory Algorithms (PEMA’s) Built on a platform (DistributeR)

that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms

Process data a chunk at a time in parallel across cores and nodes:

1. Initialize

2. Process Chunk

3. Aggregate

4. Finalize

Revolution R Enterprise 7

R+

CR

AN

Rev

oR

DistributedR

RevoScaleR

ConnectR

DeployRDevelopR

Page 8: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

8

RevoScaleR PEMAs

Covariance, Correlation, Sum of Squares Multiple Linear Regression Generalized Linear Models:

All exponential family distributions, Tweedie distribution.

Standard link functions user defined distributions & link

functions. Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models

Histogram Line Plot Lorenz Curve ROC Curves

K-Means

Statistical Modeling

Decision Trees Decision Forests

Predictive Models Cluster AnalysisData Visualization

Classification

Machine Learning

Simulation

Parallel random number generators for Monte Carlo

Variable Selection

Stepwise Regression PCA

Page 9: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

GLM comparison using in-memory data: glm() and ScaleR’s rxGlm()

Revolution R Enterprise 9

Page 10: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

10

PEMAs: Optimized for Performance Arbitrarily large number of

rows in a fixed amount of memory

Scales linearly with the number of rows with the number of nodes

Scales well with the number of cores

per node with the number of

parameters

Efficient Computational algorithms Memory management: minimize

copying File format: fast access by row

and column Heavy use of C++ Models

pre-analyzed to detect and remove duplicate computations and points of failure (singularities)

Handle categorical variables efficiently

Page 11: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

R+

CR

AN

Rev

oR

DistributedR

RevoScaleR

ConnectR

DeployRDevelopR

Write Once. Deploy Anywhere.

DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE 11

In the Cloud Microsoft Azure BurstAmazon AWS

Workstations & ServersDesktopServerLinux

Clustered Systems Platform LSFMicrosoft HPC

EDW IBMTeradata

Hadoop HortonworksCloudera

Page 12: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

RRE in Hadoop

12

or

beside inside

Page 13: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Revolution R Enterprise BESIDE Architecture

Use Hadoop for data storage and data preparation

Use RevoScaleR on a connected server for predictive modeling

Use Hadoop for model deployment

Page 14: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

14

A Simple Goal: Hadoop As An R Engine.Run Revolution R Enterprise code In Hadoop without change

Provide RevoScaleR Pre-Parallelized

Algorithms

Eliminate:

The Need To “Think In MapReduce”

Data Movement

Hadoop

Page 15: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Revolution R Enterprise

INSIDEArchitecture

Use RevoScaleR inside Hadoop for:

• Data preparation

• Model building

• Custom small-data parallel programming

• Model deployment

• Late 2013: Big-data predictive models with ScaleR

Name Node

Data NodeData Node Data NodeData Node Data Node

Job Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

MapReduce

HDFS

Page 16: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

16

Name Node

Data NodeData Node Data NodeData Node Data Node

Job Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

MapReduce

HDFS

RRE in Hadoop

Page 17: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

17

Name Node

Data NodeData Node Data NodeData Node Data Node

Job Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

MapReduce

HDFS

RRE in Hadoop

Page 18: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

RevoScaleR on Hadoop Each pass through the data is one MapReduce

job Prediction (Scoring), Transformation, Simulation:

– Map tasks store results in HDFS or return to client

Statistics, Model Building, Visualization:– Map tasks produce “intermediate result objects” that are

aggregated by a Reduce task

– Master process decides if another pass through the data is required

Data can be cached or stored in XDF binary format for increased speed, especially on iterative algorithmsRevolution R Enterprise 18

Page 19: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Let’s run some code.

Page 20: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Backup slides

Page 21: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Sample code: logit on workstation

# Specify local data source

airData <- myLocalDataSource

# Specify model formula and parameters

rxLogit( ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData )

21

Page 22: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Sample code for logit on Hadoop# Change the “compute context”rxSetComputeContext(myHadoopCluster)

# Change the data source if necessary

airData <- myHadoopDataSource

# Otherwise, the code is the same

rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData)

22

Page 23: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Demo rxLinMod in Hadoop - Launching

Revolution R Enterprise 23

Page 24: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Demo rxLinMod in Hadoop - In Progress

Revolution R Enterprise 24

Page 25: Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Demo rxLinMod in Hadoop - Completed

Revolution R Enterprise 25