r with distributed systems
DESCRIPTION
R with Distributed Systems. R with Distributed System. RHIPE - R and Hadoop Integrated Processing Environment http://www.stat.purdue.edu/~sguha/rhipe / Ricardo: Integrating R and Hadoop , SIGMOD 2010 Segue http://code.google.com/p/segue/ Hadoop InteractiVE - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/1.jpg)
R with Distributed Systems
![Page 2: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/2.jpg)
R with Distributed System
• RHIPE - R and Hadoop Integrated Processing Environment– http://www.stat.purdue.edu/~sguha/rhipe/
• Ricardo: Integrating R and Hadoop, SIGMOD 2010• Segue
– http://code.google.com/p/segue/
• Hadoop InteractiVE – https://r-forge.r-project.org/projects/rhadoop/
• Big Data Analysis with Revolution R Enterprise– Revolution R Enterprise– http://www.revolutionanalytics.com/– The RevoScaleR package provides a mechanism for scaling the R language to handle very
large data sets.
• Elastic-R– https://www.elastic-r.org
• Biopara– http://hedwig.mgh.harvard.edu/biostatistics/node/20– http://hedwig.mgh.harvard.edu/biostatistics/files/biopara/biopara.html
• RIOT: I/O-Efficient Numerical Computing without SQL, CIDR 2009– R adopted a relational database as a backend, not Hadoop
![Page 3: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/3.jpg)
Ricardo: Integrating R and Hadoop
Sudipto Das*, Yannis Sismanis**, Kevin S Beyer**, Rainer Gemulla**, Peter J. Haas**, John McPherson**
* UC Santa Barbara** IBM Almaden Research Center
SIGMOD 2010
![Page 4: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/4.jpg)
Deep Analytics on Big Data
• Enterprises collect huge amounts of data– Amazon, eBay, Netflix, iTunes, Yahoo, Google, VISA, …– User interaction data and history– Click and Transaction logs
• Deep analysis critical for competitive edge– Understanding/Modeling data– Recommendations to users– Ad placement
• Challenge: Enable Deep Analysis and Understanding over massive data volumes– Exploiting data to its full potential
![Page 5: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/5.jpg)
Motivating Examples
• Data Exploration/Model Evaluation/Outlier Detection• Personalized Recommendations
– For each individual customer/product– Many applications to Netflix, Amazon, eBay, iTunes, …
• Difficulty: Discern particular customer preferences – Sampling loses Competitive advantage
• Application Scenario: Movie Recommendations, Netflix– Millions of Customers– Hundreds of thousands of Movies– Billions of Movie Ratings
![Page 6: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/6.jpg)
Big Data and Deep Analytics – The Gap
• R, SPSS, SAS – A Statistician’s toolbox– Rich statistical, modeling, visualization functionality– Operate on small data amounts entirely in memory– Extensions for data handling cumbersome
• Hadoop – Scalable Data Management Systems– Scalable, Fault-Tolerant, Elastic, …– “Magnetic”: easy to store data– Limited deep analytics: mostly descriptive analytics
![Page 7: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/7.jpg)
Filling the Gap: Existing Approaches
• Reducing Data size by Sampling– Approximations might result in losing competitive advantage– Loses important features of the long tail of data distributions [Cohen et al.,
VLDB 2009]
• Scaling out R– Efforts from statistics community to parallel and distributed variants [S-
NOW, Rmpi]– Main memory based in most cases– Re-implementing DBMS and distributed processing functionality
• Deep Analysis within a DBMS– Port statistical functionality into a DBMS [Cohen et al., VLDB 2009], [A-
pache Mahout]– Not Sustainable – missing out from R’s community development and rich
libraries
![Page 8: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/8.jpg)
Ricardo: Bridging the Gap
• David Ricardo, famous economist from 19th century– “Comparative Advantage”
• Deep Analytics decomposable in “large part” and “small part” [Chu et al., NIPS ‘06]– Linear/logistic regression, k-means clustering, Naïve Bayes, SVMs, PCA– Recommender Systems/Latent Factorization [in the paper]
• Large-part includes joins, group bys, distributive aggregations– Hadoop + Jaql: excellent scalability to large-scale data management
• Small-part includes matrix/vector operations– R: excellent support for numerically stable matrix inversions, factoriza-
tions, optimizations, eigenvector decompositions,etc.
• Ricardo: Establishes “trade” between R and Hadoop/Jaql
![Page 9: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/9.jpg)
R in a Nutshell
• R supports Rich statistical functionality
![Page 10: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/10.jpg)
Jaql in a Nutshell
• Scalable Descriptive Analysis using Hadoop• Jaql a representative declarative interface
• JSON View of the data:
• Jaql Example:
![Page 11: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/11.jpg)
Ricardo: The Trading Architecture
• Complexity of Trade between R and Hadoop– Simple Trading: Data Exploration– Complex Trading: Data Modeling
![Page 12: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/12.jpg)
Simple Trading: Exploratory Analytics
• Gain insights about data• Example - top-k outliers for a model
– Identify data items on which the model performed most poorly
• Helpful for improving accuracy of model• The trade:
– Use complex statistical models using rich R functionality– Parallelize processing over entire data using Hadoop/Jaql
![Page 13: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/13.jpg)
Complex Trading: Latent Factors
• SVD-like matrix factorization• Minimize Square Error: Σi,j (piqj - rij)2
• The trade: – Use complex statistical models in R– Parallelize aggregate computations using Hadoop/Jaql
p
q
![Page 14: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/14.jpg)
Latent Factor Models with Ricardo
• Goal– Minimize Square Error: e = Σi,j (piqj - rij)2
– Numerical methods needed (large, sparse matrix)
• Pseudocode1. Start with initial guess of parameters pi and qj
2. Compute error & gradient• e.g., de/dpi = Σj 2qj (piqj - rij)
• (Data intensive, but parallelizable)
3. Update parameters• R implements many different optimization algorithms
4. Repeat steps 2 and 3 until convergence.
• R code– optim( c(p,q), fe, fde, method="L-BFGS-B" )
![Page 15: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/15.jpg)
Computing the Model
i j rij
i pi j qj
Movie Rat-ings
Movie Parameters
Customer Parameters
3 way join to match
rij, pi, and qj,then aggregate
e = Σi,j (piqj - rij)2
Similarly compute the gradients
![Page 16: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/16.jpg)
Aggregation In Jaql/Hadoop
res = jaqlTable(channel, " ratings
hashJoin( fn(r) r.j, moviePars, fn(m) m.j, fn(r, m) { r.*, m.q } )
hashJoin( fn(r) r.i, custPars, fn(c) c.i, fn(r, c) { r.*, c.p } )
transform { $.*, diff: $.rating - $.p*$.q }
expand [ { value: pow($.diff, 2.0) }, { $.i, value: -2.0 * $.diff * $.p }, { $.j, value: -2.0 * $.diff * $.q } ]
group by g={ $.i, $.j } into { g.*, gradient: sum($[*].value) }
")i j gradient---- ---- --------null null 3252351 null 212 null 357…null 1 9null 2 64…
Result in R
![Page 17: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/17.jpg)
Experimental Evaluation
• 50 nodes at EC2• Each node: 8 cores, 7GB Memory, 320GB Disk• Total: 400 cores, 320GB Memory, 70TB Disk Space
Number of Rating Tuples Data Size in GB
500 Million 104.33
1 Billion 208.68
3 Billion 625.99
5 Billion 1043.23
![Page 18: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/18.jpg)
Result
• Leveraging Hadoop’s Scalability
• Leveraging R’s Rich Functionality– optim( c(p,q), fe, fde, method=“CG" )– optim( c(p,q), fe, fde, method="L-BFGS-B" )
![Page 19: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/19.jpg)
Extending the Trade: R – Jaql – R
• Invoking R through Jaql – distributed statistical computation• Example: Augment model with changing customer preferences with
time• Time series model for each customer incorporated into global model
![Page 20: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/20.jpg)
Conclusion
• Scaled Latent Factor Models to Terabytes of data• Provided a bridge for other algorithms with Summation Form can be
mapped and scaled– Many Algorithms have Summation Form– Decompose into “large part” and “small part”– [Chu et al. NIPS ‘06]: LWLR, Naïve Bayes, GDA, k-means, logistic regression,
neural network, PCA, ICA, EM, SVM
• Future & Current Work– Tighter language integration– More algorithms– Performance tuning
![Page 21: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/21.jpg)
RHIPE - R and Hadoop Integrated Processing Environment
Saptarshi Guha
![Page 22: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/22.jpg)
RHIPE
• R package• INSTALL
– Set an environment variable $HADOOP that points to the Hadoop installa-tion directory.
– It is expected that $HADOOP\bin contains the Hadoop shell executable hadoop
• This needs to be installed on all the computers: the one you run your R environment and all the task computers.
• Use RHIPE is much easier if your filesystem layout (i.e location of R, Hadoop, libraries etc) is identical across all computers.
![Page 23: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/23.jpg)
Tests
• In R
– should work successfully
– should successfully write the list to the HDFS
– should return a list of length 3 each element a list of 2 objects.
![Page 24: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/24.jpg)
Tests (cont’d)
• A quick run of this should also work
![Page 25: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/25.jpg)
R and Hadoop Integrated Programming Environment
• The R and Hadoop Integrated Programming Environment is R package– compute across massive data sets– create subsets– apply routines to subsets– produce displays on subsets across a cluster of computers– using the Hadoop DFS and Hadoop MapReduce framework.
• Use Hadoop Streaming– Users can write MapReduce programs in other languages e.g. Python, Ruby,
Perl which is then deployed over the cluster.– Hadoop Streaming then transfers the input data from Hadoop to the user
program and vice versa.
![Page 26: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/26.jpg)
R and Hadoop Integrated Programming Environment
• RHIPE is just that.– RHIPE consist of several functions to interact with the HDFS
• e.g. save data sets, read data created by RHIPE MapReduce, delete files.
– Commands in R• Compose and launch MapReduce jobs from R using the command rhmr and rhex.• Monitor the status using rhstatus which returns an R object.• Stop jobs using rhkill
– Compute side effect files.• The output of parallel computations may include the creation of PDF files, R data sets, CVS
files etc.• These will be copied by RHIPE to a central location on the HDFS removing the need for the
user to copy them from the compute nodes or setting up a network file system.
– Data sets that are created by RHIPE can be read using other languages such as Java, Perl, Python and C.• The serialization format used by RHIPE (converting R objects to binary data) uses Googles
Protocol Buffers which is very fast and creates compact representations for R objects. Ideal for massive data sets.
– Data sets created using RHIPE are key-value pairs.• A key is mapped to a value. A MapReduce computations iterates over the key, value pairs in
parallel. If the output of a RHIPE job creates unique keys the output can be treated as a ex-ternal-memory associative dictionary. RHIPE can thus be used as a medium scale (millions of keys) disk based dictionary, which is useful for loading R objects into R.
![Page 27: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/27.jpg)
Example: Airline Dataset
• Copying the Data to the HDFS
![Page 28: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/28.jpg)
Example: Airline Dataset (cont’d)
• rhstatus
![Page 29: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/29.jpg)
Example: Airline Dataset (cont’d)
• Job
![Page 30: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/30.jpg)
Example: Airline Dataset (cont’d)
• Demonstration of using Hadoop as a Queryable Database
![Page 31: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/31.jpg)
• Demonstration of using Hadoop as a Queryable Database– Top 20 cities by total volume of flights.
![Page 32: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/32.jpg)
Example: Transforming Text Data
• Text data
– The carrier name is column 9.– Southwest carrier code is WN, Delta is DL.– Only those rows with column 9 equal to WN or DL will be saved.
![Page 33: R with Distributed Systems](https://reader036.vdocument.in/reader036/viewer/2022062423/56814421550346895db0be99/html5/thumbnails/33.jpg)
Example: Transforming Text Data (cont’d)
• The output– 1
– 2