model building with revoscaler: using r and hadoop for statistical computation
DESCRIPTION
Slides from Joseph Rickert's presentation at Strata NYC 2013 "Using R and Hadoop for Statistical Computation at Scale" http://strataconf.com/stratany2013/public/schedule/detail/30632TRANSCRIPT
Model Building with RevoScaleRUsing R and Hadoop for Statistical Computation
Joseph Rickert, Revolution Analytics
Strata and Hadoop World 2013
2
Model Buliding with RevoScaleRAgenda:The three realms of dataWhat is RevoScaleR?RevoScaleR working beside HadoopRevoScaleR running within HadoopRun some code
The 3 Realms of Data
Bridging the gaps between architectures
4
The 3 Realms of Data
1011
Number of rows
106
>1012
Architectural complexity
DataIn
Memory
Data in a File
The realm of “chunking”
Data in
Multiple
Files
The realm of massive data
RevoScaleR
Revolution R Enterprise
6
RevoScaleR An R package ships exclusively with
Revolution R Enterprise
Implements Parallel External Memory Algorithms (PEMAs)
Provides functions to:
– Import, Clean, Explore and Transform Data
– Statistical Analysis and Predictive Analytics
– Enable distributed computing
Scales from small local data to huge distributed data
The same code works on small and big data, and on workstation, server, cluster, Hadoop
R+
CR
AN
Rev
oR
DistributedR
RevoScaleR
ConnectR
DeployRDevelopR
Revolution R Enterprise
Parallel External Memory Algorithms (PEMA’s) Built on a platform (DistributeR)
that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms
Process data a chunk at a time in parallel across cores and nodes:
1. Initialize
2. Process Chunk
3. Aggregate
4. Finalize
Revolution R Enterprise 7
R+
CR
AN
Rev
oR
DistributedR
RevoScaleR
ConnectR
DeployRDevelopR
8
RevoScaleR PEMAs
Covariance, Correlation, Sum of Squares Multiple Linear Regression Generalized Linear Models:
All exponential family distributions, Tweedie distribution.
Standard link functions user defined distributions & link
functions. Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models
Histogram Line Plot Lorenz Curve ROC Curves
K-Means
Statistical Modeling
Decision Trees Decision Forests
Predictive Models Cluster AnalysisData Visualization
Classification
Machine Learning
Simulation
Parallel random number generators for Monte Carlo
Variable Selection
Stepwise Regression PCA
GLM comparison using in-memory data: glm() and ScaleR’s rxGlm()
Revolution R Enterprise 9
10
PEMAs: Optimized for Performance Arbitrarily large number of
rows in a fixed amount of memory
Scales linearly with the number of rows with the number of nodes
Scales well with the number of cores
per node with the number of
parameters
Efficient Computational algorithms Memory management: minimize
copying File format: fast access by row
and column Heavy use of C++ Models
pre-analyzed to detect and remove duplicate computations and points of failure (singularities)
Handle categorical variables efficiently
R+
CR
AN
Rev
oR
DistributedR
RevoScaleR
ConnectR
DeployRDevelopR
Write Once. Deploy Anywhere.
DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE 11
In the Cloud Microsoft Azure BurstAmazon AWS
Workstations & ServersDesktopServerLinux
Clustered Systems Platform LSFMicrosoft HPC
EDW IBMTeradata
Hadoop HortonworksCloudera
RRE in Hadoop
12
or
beside inside
Revolution R Enterprise BESIDE Architecture
Use Hadoop for data storage and data preparation
Use RevoScaleR on a connected server for predictive modeling
Use Hadoop for model deployment
14
A Simple Goal: Hadoop As An R Engine.Run Revolution R Enterprise code In Hadoop without change
Provide RevoScaleR Pre-Parallelized
Algorithms
Eliminate:
The Need To “Think In MapReduce”
Data Movement
Hadoop
Revolution R Enterprise
INSIDEArchitecture
Use RevoScaleR inside Hadoop for:
• Data preparation
• Model building
• Custom small-data parallel programming
• Model deployment
• Late 2013: Big-data predictive models with ScaleR
Name Node
Data NodeData Node Data NodeData Node Data Node
Job Tracker
Task Tracker
Task Tracker
Task Tracker
Task Tracker
Task Tracker
MapReduce
HDFS
16
Name Node
Data NodeData Node Data NodeData Node Data Node
Job Tracker
Task Tracker
Task Tracker
Task Tracker
Task Tracker
Task Tracker
MapReduce
HDFS
RRE in Hadoop
17
Name Node
Data NodeData Node Data NodeData Node Data Node
Job Tracker
Task Tracker
Task Tracker
Task Tracker
Task Tracker
Task Tracker
MapReduce
HDFS
RRE in Hadoop
RevoScaleR on Hadoop Each pass through the data is one MapReduce
job Prediction (Scoring), Transformation, Simulation:
– Map tasks store results in HDFS or return to client
Statistics, Model Building, Visualization:– Map tasks produce “intermediate result objects” that are
aggregated by a Reduce task
– Master process decides if another pass through the data is required
Data can be cached or stored in XDF binary format for increased speed, especially on iterative algorithmsRevolution R Enterprise 18
Let’s run some code.
Backup slides
Sample code: logit on workstation
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit( ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData )
21
Sample code for logit on Hadoop# Change the “compute context”rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData)
22
Demo rxLinMod in Hadoop - Launching
Revolution R Enterprise 23
Demo rxLinMod in Hadoop - In Progress
Revolution R Enterprise 24
Demo rxLinMod in Hadoop - Completed
Revolution R Enterprise 25