r with distributed systems

R with Distributed Systems

R with Distributed System

• RHIPE - R and Hadoop Integrated Processing Environment– http://www.stat.purdue.edu/~sguha/rhipe/

• Ricardo: Integrating R and Hadoop, SIGMOD 2010• Segue

– http://code.google.com/p/segue/

• Hadoop InteractiVE – https://r-forge.r-project.org/projects/rhadoop/

• Big Data Analysis with Revolution R Enterprise– Revolution R Enterprise– http://www.revolutionanalytics.com/– The RevoScaleR package provides a mechanism for scaling the R language to handle very

large data sets.

• Elastic-R– https://www.elastic-r.org

• Biopara– http://hedwig.mgh.harvard.edu/biostatistics/node/20– http://hedwig.mgh.harvard.edu/biostatistics/files/biopara/biopara.html

• RIOT: I/O-Efficient Numerical Computing without SQL, CIDR 2009– R adopted a relational database as a backend, not Hadoop

Ricardo: Integrating R and Hadoop

Sudipto Das*, Yannis Sismanis**, Kevin S Beyer**, Rainer Gemulla**, Peter J. Haas**, John McPherson**

* UC Santa Barbara** IBM Almaden Research Center

SIGMOD 2010

Deep Analytics on Big Data

• Enterprises collect huge amounts of data– Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, …– User interaction data and history– Click and Transaction logs

• Deep analysis critical for competitive edge– Understanding/Modeling data– Recommendations to users– Ad placement

• Challenge: Enable Deep Analysis and Understanding over massive data volumes– Exploiting data to its full potential

Motivating Examples

• Data Exploration/Model Evaluation/Outlier Detection• Personalized Recommendations

– For each individual customer/product– Many applications to Netflix, Amazon, eBay, iTunes, …

• Difficulty: Discern particular customer preferences – Sampling loses Competitive advantage

• Application Scenario: Movie Recommendations, Netflix– Millions of Customers– Hundreds of thousands of Movies– Billions of Movie Ratings

Big Data and Deep Analytics – The Gap

• R, SPSS, SAS – A Statistician’s toolbox– Rich statistical, modeling, visualization functionality– Operate on small data amounts entirely in memory– Extensions for data handling cumbersome

• Hadoop – Scalable Data Management Systems– Scalable, Fault-Tolerant, Elastic, …– “Magnetic”: easy to store data– Limited deep analytics: mostly descriptive analytics

Filling the Gap: Existing Approaches

• Reducing Data size by Sampling– Approximations might result in losing competitive advantage– Loses important features of the long tail of data distributions [Cohen et al.,

VLDB 2009]

• Scaling out R– Efforts from statistics community to parallel and distributed variants [S-

NOW, Rmpi]– Main memory based in most cases– Re-implementing DBMS and distributed processing functionality

• Deep Analysis within a DBMS– Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [A-

pache Mahout]– Not Sustainable – missing out from R’s community development and rich

libraries

Ricardo: Bridging the Gap

• David Ricardo, famous economist from 19th century– “Comparative Advantage”

• Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06]– Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA– Recommender Systems/Latent Factorization [in the paper]

• Large-part includes joins, group bys, distributive aggregations– Hadoop + Jaql: excellent scalability to large-scale data management

• Small-part includes matrix/vector operations– R: excellent support for numerically stable matrix inversions, factoriza-

tions, optimizations, eigenvector decompositions,etc.

• Ricardo: Establishes “trade” between R and Hadoop/Jaql

R in a Nutshell

• R supports Rich statistical functionality

Jaql in a Nutshell

• Scalable Descriptive Analysis using Hadoop• Jaql a representative declarative interface

• JSON View of the data:

• Jaql Example:

Ricardo: The Trading Architecture

• Complexity of Trade between R and Hadoop– Simple Trading: Data Exploration– Complex Trading: Data Modeling

Simple Trading: Exploratory Analytics

• Gain insights about data• Example - top-k outliers for a model

– Identify data items on which the model performed most poorly

• Helpful for improving accuracy of model• The trade:

– Use complex statistical models using rich R functionality– Parallelize processing over entire data using Hadoop/Jaql

Complex Trading: Latent Factors

• SVD-like matrix factorization• Minimize Square Error: Σi,j (piqj - rij)2

• The trade: – Use complex statistical models in R– Parallelize aggregate computations using Hadoop/Jaql

p

q

Latent Factor Models with Ricardo

• Goal– Minimize Square Error: e = Σi,j (piqj - rij)2

– Numerical methods needed (large, sparse matrix)

• Pseudocode1. Start with initial guess of parameters pi and qj

2. Compute error & gradient• e.g., de/dpi = Σj 2qj (piqj - rij)

• (Data intensive, but parallelizable)

3. Update parameters• R implements many different optimization algorithms

4. Repeat steps 2 and 3 until convergence.

• R code– optim( c(p,q), fe, fde, method="L-BFGS-B" )

Computing the Model

i j rij

i pi j qj

Movie Rat-ings

Movie Parameters

Customer Parameters

3 way join to match

rij, pi, and qj,then aggregate

e = Σi,j (piqj - rij)2

Similarly compute the gradients

Aggregation In Jaql/Hadoop

res = jaqlTable(channel, " ratings

hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q } )

hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*, c.p } )

transform { $.*, diff: $.rating - $.p*$.q }

expand [ { value: pow($.diff, 2.0) }, { $.i, value: -2.0 * $.diff * $.p }, { $.j, value: -2.0 * $.diff * $.q } ]

group by g={ $.i, $.j } into { g.*, gradient: sum($[*].value) }

")i j gradient---- ---- --------null null 3252351 null 212 null 357…null 1 9null 2 64…

Result in R

Experimental Evaluation

• 50 nodes at EC2• Each node: 8 cores, 7GB Memory, 320GB Disk• Total: 400 cores, 320GB Memory, 70TB Disk Space

Number of Rating Tuples Data Size in GB

500 Million 104.33

1 Billion 208.68

3 Billion 625.99

5 Billion 1043.23

Result

• Leveraging Hadoop’s Scalability

• Leveraging R’s Rich Functionality– optim( c(p,q), fe, fde, method=“CG" )– optim( c(p,q), fe, fde, method="L-BFGS-B" )

Extending the Trade: R – Jaql – R

• Invoking R through Jaql – distributed statistical computation• Example: Augment model with changing customer preferences with

time• Time series model for each customer incorporated into global model

Conclusion

• Scaled Latent Factor Models to Terabytes of data• Provided a bridge for other algorithms with Summation Form can be

mapped and scaled– Many Algorithms have Summation Form– Decompose into “large part” and “small part”– [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression,

neural network, PCA, ICA, EM, SVM

• Future & Current Work– Tighter language integration– More algorithms– Performance tuning

RHIPE - R and Hadoop Integrated Processing Environment

Saptarshi Guha

RHIPE

• R package• INSTALL

– Set an environment variable $HADOOP that points to the Hadoop installa-tion directory.

– It is expected that $HADOOP\bin contains the Hadoop shell executable hadoop

• This needs to be installed on all the computers: the one you run your R environment and all the task computers.

• Use RHIPE is much easier if your filesystem layout (i.e location of R, Hadoop, libraries etc) is identical across all computers.

Tests

• In R

– should work successfully

– should successfully write the list to the HDFS

– should return a list of length 3 each element a list of 2 objects.

Tests (cont’d)

• A quick run of this should also work

R and Hadoop Integrated Programming Environment

• The R and Hadoop Integrated Programming Environment is R package– compute across massive data sets– create subsets– apply routines to subsets– produce displays on subsets across a cluster of computers– using the Hadoop DFS and Hadoop MapReduce framework.

• Use Hadoop Streaming– Users can write MapReduce programs in other languages e.g. Python, Ruby,

Perl which is then deployed over the cluster.– Hadoop Streaming then transfers the input data from Hadoop to the user

program and vice versa.

R and Hadoop Integrated Programming Environment

• RHIPE is just that.– RHIPE consist of several functions to interact with the HDFS

• e.g. save data sets, read data created by RHIPE MapReduce, delete files.

– Commands in R• Compose and launch MapReduce jobs from R using the command rhmr and rhex.• Monitor the status using rhstatus which returns an R object.• Stop jobs using rhkill

– Compute side effect files.• The output of parallel computations may include the creation of PDF files, R data sets, CVS

files etc.• These will be copied by RHIPE to a central location on the HDFS removing the need for the

user to copy them from the compute nodes or setting up a network file system.

– Data sets that are created by RHIPE can be read using other languages such as Java, Perl, Python and C.• The serialization format used by RHIPE (converting R objects to binary data) uses Googles

Protocol Buffers which is very fast and creates compact representations for R objects. Ideal for massive data sets.

– Data sets created using RHIPE are key-value pairs.• A key is mapped to a value. A MapReduce computations iterates over the key, value pairs in

parallel. If the output of a RHIPE job creates unique keys the output can be treated as a ex-ternal-memory associative dictionary. RHIPE can thus be used as a medium scale (millions of keys) disk based dictionary, which is useful for loading R objects into R.

Example: Airline Dataset

• Copying the Data to the HDFS

Example: Airline Dataset (cont’d)

• rhstatus


• Job


• Demonstration of using Hadoop as a Queryable Database

• Demonstration of using Hadoop as a Queryable Database– Top 20 cities by total volume of flights.

Example: Transforming Text Data

• Text data

– The carrier name is column 9.– Southwest carrier code is WN, Delta is DL.– Only those rows with column 9 equal to WN or DL will be saved.

Example: Transforming Text Data (cont’d)

• The output– 1

– 2

r with distributed systems

Documents