r hadoop integration

25
R INTEGRATION WITH HADOOP NGUYEN PHAN DZUNG MARCH 2016

Upload: dzung-nguyen

Post on 15-Apr-2017

96 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: R Hadoop integration

R INTEGRATION WITH HADOOPNGUYEN PHAN DZUNG

MARCH 2016

Page 2: R Hadoop integration

AGENDA- Objectives

- Contents:• Introduction of R• Implementation of R integration with

Hadoop• When to use R in combination with Hadoop• Examples using Hadoop

- Q&A- References

Page 3: R Hadoop integration

Security Classification: Internal

Objectives

3

• Understand R• Understand when to use R in

combination with Hadoop• Understand the implementation of

integration

Page 4: R Hadoop integration

Introduction of R

Page 5: R Hadoop integration

R integration with Hadoop 5Security Classification: Internal

Introduction of R – What is R?

• Software for Statistical Data Analysis• Based on S• Programming Environment• Interpreted Language• Data Storage, Analysis, Graphing• Free and Open Source Software

Page 6: R Hadoop integration

R integration with Hadoop 6Security Classification: Internal

Introduction of R – Why R?

• Free and Open Source• Strong User Community• Highly extensible, flexible• Implementation of high end statistical methods• Flexible graphics and intelligent defaultsBut ..• Steep learning curve• Slow for large datasets

Page 7: R Hadoop integration

R integration with Hadoop 7Security Classification: Internal

Introduction of R – A little bit of demo

Command to demo.txt

Page 8: R Hadoop integration

R integration with Hadoop

Page 9: R Hadoop integration

R integration with Hadoop 9Security Classification: Internal

R integration with Hadoop – Integration purposes

• Use Hadoop to execute R code• Use R to access data stored in Hadoop

Page 10: R Hadoop integration

R integration with Hadoop 10Security Classification: Internal

R integration with Hadoop – When to use?No Factor Mantra Guideline

1 R's natural strength Use R for statisticalcomputing

Consider integrating when your project can be solved using code available in R, or when it is not easily solved in other languages

2 Hadoop's natural strength

Use Hadoop fordistributed storage &batch computing

Consider integrating when your problem requires lots of storage or when it could benefit from parallelization

3 Coding effort Work smart, not hard R and Hadoop are tools, not "cure-all" panaceas. Consider not integrating if it is easier to solve your problem with other tools

4 Processing time Work smart, not hard Although some problems can benefit from parallelization, consider not integrating if the gains are negligible since this can help you reduce the complexity of your project

Page 11: R Hadoop integration

R integration with Hadoop 11Security Classification: Internal

R integration with Hadoop – Example applicationsNo

Scenario UseR/

Hadoop?

Why? Example

1 Analyzing small data stored in Hadoop

Y R can quickly download data analyze it locally

Want to analyze summary datasets derived from map reduce jobs done in Hadoop

2 Extracting complexfeatures from large data stored in Hadoop

Y R has more built-in and contributed functions that analyze data than many standard programming languages

 

R is a natural language to use to write an algorithm or classifier that extracts information about objects contained in images

3 Applying predictionand classificationmodels to datasets

Y R is better at modeling than many standard programming languages

Using a logistic regression model to generate predictions in a large dataset

4 Implementing an"iteration-based"machinelearning algorithm

Maybe 1) Other languages may be faster than R for your analysis2) Hadoop reads and writes a lot of data to disks, other "big data" tools, like Spark (and SparkR) are designed for speed in these scenarios by working in memory

Training a k-means classification algorithm or logistic regression on a large dataset

5 Simple preprocessingof large data stored in Hadoop

N Standard programming languages are much faster than R at executing many basic text and image processingtasks

Pre-processing twitter tweets for use in a natural language processing project

Page 12: R Hadoop integration

R integration with Hadoop 12Security Classification: Internal

R integration with Hadoop – How? – RHadoop (1)

Page 13: R Hadoop integration

R integration with Hadoop 13Security Classification: Internal

R integration with Hadoop – How? – RHadoop (2)

rhdfs:• Manipulate HDFS directly from R• Mimic as much of the HDFS Java API as possible• Examples:

– Read a HDFS text file into a data frame.– Serialize/Deserialize a model to HDFS– Write an HDFS file to local storage

• rhdfs/pkg/inst/unitTests• rhdfs/pkg/inst/examples

Page 14: R Hadoop integration

R integration with Hadoop 14Security Classification: Internal

R integration with Hadoop – How? – RHadoop (3)

rhbase:• Manipulate HBASE tables and their content• Uses Thrift C++ API as the mechanism tocommunicate to HBASE• Examples:

– Create a data frame from a collection of rowsand columns in an HBASE table– Update an HBASE table with values from a dataframe

Page 15: R Hadoop integration

R integration with Hadoop 15Security Classification: Internal

R integration with Hadoop – How? – RHadoop (4)

rmr:• Designed to be the simplest and most elegant way towrite MapReduce programs• Gives the R programmer the tools necessary to

performdata analysis in a way that is “R” like• Provides an abstraction layer to hide the

implementationdetails

Page 16: R Hadoop integration

R integration with Hadoop 16Security Classification: Internal

R integration with Hadoop – How? – RHive

Page 17: R Hadoop integration

R integration with Hadoop 17Security Classification: Internal

R integration with Hadoop – How? – BigR

Page 18: R Hadoop integration

R integration with Hadoop 18Security Classification: Internal

R integration with Hadoop – How? – Ricardo

Page 19: R Hadoop integration

R integration with Hadoop 19Security Classification: Internal

R integration with Hadoop – How? – SparkR

Page 20: R Hadoop integration

R integration with Hadoop 20Security Classification: Internal

R integration with Hadoop – How? – RevoR ScaleR

Page 21: R Hadoop integration

R integration with Hadoop 21Security Classification: Internal

R integration with Hadoop – How? – ORCH

Page 22: R Hadoop integration

R integration with Hadoop 22Security Classification: Internal

R integration with Hadoop – How? – MS HDInsight

Page 23: R Hadoop integration

Q & A

Page 24: R Hadoop integration

Security Classification: Internal

References

Big data and Hadoop introduction 24

- http://cran-rproject.org- http://revolutionanalytics.com

- Hadoop for dummies

R – a brief introduction

Gilberto Câmara

Page 25: R Hadoop integration

Thank you for your attention!