r with distributed systems

33
R with Distributed Systems

Upload: manjit

Post on 11-Jan-2016

29 views

Category:

Documents


1 download

DESCRIPTION

R with Distributed Systems. R with Distributed System. RHIPE - R and Hadoop Integrated Processing Environment http://www.stat.purdue.edu/~sguha/rhipe / Ricardo: Integrating R and Hadoop , SIGMOD 2010 Segue http://code.google.com/p/segue/ Hadoop InteractiVE - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: R with Distributed Systems

R with Distributed Systems

Page 2: R with Distributed Systems

R with Distributed System

• RHIPE - R and Hadoop Integrated Processing Environment– http://www.stat.purdue.edu/~sguha/rhipe/

• Ricardo: Integrating R and Hadoop, SIGMOD 2010• Segue

– http://code.google.com/p/segue/

• Hadoop InteractiVE – https://r-forge.r-project.org/projects/rhadoop/

• Big Data Analysis with Revolution R Enterprise– Revolution R Enterprise– http://www.revolutionanalytics.com/– The RevoScaleR package provides a mechanism for scaling the R language to handle very

large data sets.

• Elastic-R– https://www.elastic-r.org

• Biopara– http://hedwig.mgh.harvard.edu/biostatistics/node/20– http://hedwig.mgh.harvard.edu/biostatistics/files/biopara/biopara.html

• RIOT: I/O-Efficient Numerical Computing without SQL, CIDR 2009– R adopted a relational database as a backend, not Hadoop

Page 3: R with Distributed Systems

Ricardo: Integrating R and Hadoop

Sudipto Das*, Yannis Sismanis**, Kevin S Beyer**, Rainer Gemulla**, Peter J. Haas**, John McPherson**

* UC Santa Barbara** IBM Almaden Research Center

SIGMOD 2010

Page 4: R with Distributed Systems

Deep Analytics on Big Data

• Enterprises collect huge amounts of data– Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, …– User interaction data and history– Click and Transaction logs

• Deep analysis critical for competitive edge– Understanding/Modeling data– Recommendations to users– Ad placement

• Challenge: Enable Deep Analysis and Understanding over massive data volumes– Exploiting data to its full potential

Page 5: R with Distributed Systems

Motivating Examples

• Data Exploration/Model Evaluation/Outlier Detection• Personalized Recommendations

– For each individual customer/product– Many applications to Netflix, Amazon, eBay, iTunes, …

• Difficulty: Discern particular customer preferences – Sampling loses Competitive advantage

• Application Scenario: Movie Recommendations, Netflix– Millions of Customers– Hundreds of thousands of Movies– Billions of Movie Ratings

Page 6: R with Distributed Systems

Big Data and Deep Analytics – The Gap

• R, SPSS, SAS – A Statistician’s toolbox– Rich statistical, modeling, visualization functionality– Operate on small data amounts entirely in memory– Extensions for data handling cumbersome

• Hadoop – Scalable Data Management Systems– Scalable, Fault-Tolerant, Elastic, …– “Magnetic”: easy to store data– Limited deep analytics: mostly descriptive analytics

Page 7: R with Distributed Systems

Filling the Gap: Existing Approaches

• Reducing Data size by Sampling– Approximations might result in losing competitive advantage– Loses important features of the long tail of data distributions [Cohen et al.,

VLDB 2009]

• Scaling out R– Efforts from statistics community to parallel and distributed variants [S-

NOW, Rmpi]– Main memory based in most cases– Re-implementing DBMS and distributed processing functionality

• Deep Analysis within a DBMS– Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [A-

pache Mahout]– Not Sustainable – missing out from R’s community development and rich

libraries

Page 8: R with Distributed Systems

Ricardo: Bridging the Gap

• David Ricardo, famous economist from 19th century– “Comparative Advantage”

• Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06]– Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA– Recommender Systems/Latent Factorization [in the paper]

• Large-part includes joins, group bys, distributive aggregations– Hadoop + Jaql: excellent scalability to large-scale data management

• Small-part includes matrix/vector operations– R: excellent support for numerically stable matrix inversions, factoriza-

tions, optimizations, eigenvector decompositions,etc.

• Ricardo: Establishes “trade” between R and Hadoop/Jaql

Page 9: R with Distributed Systems

R in a Nutshell

• R supports Rich statistical functionality

Page 10: R with Distributed Systems

Jaql in a Nutshell

• Scalable Descriptive Analysis using Hadoop• Jaql a representative declarative interface

• JSON View of the data:

• Jaql Example:

Page 11: R with Distributed Systems

Ricardo: The Trading Architecture

• Complexity of Trade between R and Hadoop– Simple Trading: Data Exploration– Complex Trading: Data Modeling

Page 12: R with Distributed Systems

Simple Trading: Exploratory Analytics

• Gain insights about data• Example - top-k outliers for a model

– Identify data items on which the model performed most poorly

• Helpful for improving accuracy of model• The trade:

– Use complex statistical models using rich R functionality– Parallelize processing over entire data using Hadoop/Jaql

Page 13: R with Distributed Systems

Complex Trading: Latent Factors

• SVD-like matrix factorization• Minimize Square Error: Σi,j (piqj - rij)2

• The trade: – Use complex statistical models in R– Parallelize aggregate computations using Hadoop/Jaql

p

q

Page 14: R with Distributed Systems

Latent Factor Models with Ricardo

• Goal– Minimize Square Error: e = Σi,j (piqj - rij)2

– Numerical methods needed (large, sparse matrix)

• Pseudocode1. Start with initial guess of parameters pi and qj

2. Compute error & gradient• e.g., de/dpi = Σj 2qj (piqj - rij)

• (Data intensive, but parallelizable)

3. Update parameters• R implements many different optimization algorithms

4. Repeat steps 2 and 3 until convergence.

• R code– optim( c(p,q), fe, fde, method="L-BFGS-B" )

Page 15: R with Distributed Systems

Computing the Model

i j rij

i pi j qj

Movie Rat-ings

Movie Parameters

Customer Parameters

3 way join to match

rij, pi, and qj,then aggregate

e = Σi,j (piqj - rij)2

Similarly compute the gradients

Page 16: R with Distributed Systems

Aggregation In Jaql/Hadoop

res = jaqlTable(channel, " ratings

hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q } )

hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*, c.p } )

transform { $.*, diff: $.rating - $.p*$.q }

expand [ { value: pow($.diff, 2.0) },                   { $.i, value: -2.0 * $.diff * $.p },                   { $.j, value: -2.0 * $.diff * $.q } ]

group by g={ $.i, $.j }   into { g.*, gradient: sum($[*].value) }

")i j gradient---- ---- --------null null 3252351 null 212 null 357…null 1 9null 2 64…

Result in R

Page 17: R with Distributed Systems

Experimental Evaluation

• 50 nodes at EC2• Each node: 8 cores, 7GB Memory, 320GB Disk• Total: 400 cores, 320GB Memory, 70TB Disk Space

Number of Rating Tuples Data Size in GB

500 Million 104.33

1 Billion 208.68

3 Billion 625.99

5 Billion 1043.23

Page 18: R with Distributed Systems

Result

• Leveraging Hadoop’s Scalability

• Leveraging R’s Rich Functionality– optim( c(p,q), fe, fde, method=“CG" )– optim( c(p,q), fe, fde, method="L-BFGS-B" )

Page 19: R with Distributed Systems

Extending the Trade: R – Jaql – R

• Invoking R through Jaql – distributed statistical computation• Example: Augment model with changing customer preferences with

time• Time series model for each customer incorporated into global model

Page 20: R with Distributed Systems

Conclusion

• Scaled Latent Factor Models to Terabytes of data• Provided a bridge for other algorithms with Summation Form can be

mapped and scaled– Many Algorithms have Summation Form– Decompose into “large part” and “small part”– [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression,

neural network, PCA, ICA, EM, SVM

• Future & Current Work– Tighter language integration– More algorithms– Performance tuning

Page 21: R with Distributed Systems

RHIPE - R and Hadoop Integrated Processing Environment

Saptarshi Guha

Page 22: R with Distributed Systems

RHIPE

• R package• INSTALL

– Set an environment variable $HADOOP that points to the Hadoop installa-tion directory.

– It is expected that $HADOOP\bin contains the Hadoop shell executable hadoop

• This needs to be installed on all the computers: the one you run your R environment and all the task computers.

• Use RHIPE is much easier if your filesystem layout (i.e location of R, Hadoop, libraries etc) is identical across all computers.

Page 23: R with Distributed Systems

Tests

• In R

– should work successfully

– should successfully write the list to the HDFS

– should return a list of length 3 each element a list of 2 objects.

Page 24: R with Distributed Systems

Tests (cont’d)

• A quick run of this should also work

Page 25: R with Distributed Systems

R and Hadoop Integrated Programming Environment

• The R and Hadoop Integrated Programming Environment is R package– compute across massive data sets– create subsets– apply routines to subsets– produce displays on subsets across a cluster of computers– using the Hadoop DFS and Hadoop MapReduce framework.

• Use Hadoop Streaming– Users can write MapReduce programs in other languages e.g. Python, Ruby,

Perl which is then deployed over the cluster.– Hadoop Streaming then transfers the input data from Hadoop to the user

program and vice versa.

Page 26: R with Distributed Systems

R and Hadoop Integrated Programming Environment

• RHIPE is just that.– RHIPE consist of several functions to interact with the HDFS

• e.g. save data sets, read data created by RHIPE MapReduce, delete files.

– Commands in R• Compose and launch MapReduce jobs from R using the command rhmr and rhex.• Monitor the status using rhstatus which returns an R object.• Stop jobs using rhkill

– Compute side effect files.• The output of parallel computations may include the creation of PDF files, R data sets, CVS

files etc.• These will be copied by RHIPE to a central location on the HDFS removing the need for the

user to copy them from the compute nodes or setting up a network file system.

– Data sets that are created by RHIPE can be read using other languages such as Java, Perl, Python and C.• The serialization format used by RHIPE (converting R objects to binary data) uses Googles

Protocol Buffers which is very fast and creates compact representations for R objects. Ideal for massive data sets.

– Data sets created using RHIPE are key-value pairs.• A key is mapped to a value. A MapReduce computations iterates over the key, value pairs in

parallel. If the output of a RHIPE job creates unique keys the output can be treated as a ex-ternal-memory associative dictionary. RHIPE can thus be used as a medium scale (millions of keys) disk based dictionary, which is useful for loading R objects into R.

Page 27: R with Distributed Systems

Example: Airline Dataset

• Copying the Data to the HDFS

Page 28: R with Distributed Systems

Example: Airline Dataset (cont’d)

• rhstatus

Page 29: R with Distributed Systems

Example: Airline Dataset (cont’d)

• Job

Page 30: R with Distributed Systems

Example: Airline Dataset (cont’d)

• Demonstration of using Hadoop as a Queryable Database

Page 31: R with Distributed Systems

• Demonstration of using Hadoop as a Queryable Database– Top 20 cities by total volume of flights.

Page 32: R with Distributed Systems

Example: Transforming Text Data

• Text data

– The carrier name is column 9.– Southwest carrier code is WN, Delta is DL.– Only those rows with column 9 equal to WN or DL will be saved.

Page 33: R with Distributed Systems

Example: Transforming Text Data (cont’d)

• The output– 1

– 2