big data analysis with rhadoop
DESCRIPTION
Speaking of big data analysis, what comes to mind is possibly using HDFS and MapReduce within Hadoop. But to write a MapReduce program, one must face the problem of learning how to write native java. One might wonder is it possible to use R, the most popular language adapted by data scientist, to implement MapReduce program? And through the integration or R and Hadoop, is it truly one can unleash the power of parallel computing and big data analysis? This slide introduces how to install RHadoop step by step, and introduces how to write a MapReduce program through R. What is more, this slide will discuss whether RHadoop is really a light for big data analysis, or just another method to write MapReduce Program. Please mail me if you found any problem toward the slide. EMAIL: [email protected] 談到巨量資料,通常大家腦海中聯想到的就是使用Hadoop 的 MapReduce 和HDFS,但是撰寫MapReduce,則就必須要學會撰寫Java 或透過Thrift 接口才能撰寫。但R是否有辦法運行在Hadoop 上呢 ? 而使用R + Hadoop,是否就真的能結合R強大的分析功能,分析巨量資料呢 ? 本次講題將介紹如何Step by step 在Hadoop 上安裝RHadoop相關套件,並介紹如何撰寫R的MapReduce 程式。更重要的是,此次將探討使用RHadoop 是否為巨量資料分析找到一盞明燈? 或者只是另一套實作方法而已?TRANSCRIPT
![Page 1: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/1.jpg)
Big Data Analysis With RHadoop
David Chiu (Yu-Wei, Chiu)
@ML/DM Monday
2014/03/17
![Page 2: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/2.jpg)
About Me
Co-Founder of NumerInfo
Ex-Trend Micro Engineer
ywchiu-tw.appspot.com
![Page 3: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/3.jpg)
R + Hadoop
http://cdn.overdope.com/wp-content/uploads/2014/03/dragon-ball-z-x-hmn-alns-fusion-1.jpg
![Page 4: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/4.jpg)
Scaling RHadoop enables R to do parallel computing
Do not have to learn new languageLearning to use Java takes time
Why Using RHadoop
![Page 5: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/5.jpg)
Rhadoop Architecture
rhdfs
rhbase
rmr2
HDFS
MapReduce
Hbase ThriftGateway
Streaming API
Hbase
R
![Page 6: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/6.jpg)
Enable developer to write Mapper/Reducer in any scripting language(R, python, perl)
Mapper, reducer, and optional combiner processes are written to read from standard input and to write to standard output
Streaming Job would have additional overhead of starting a scripting VM
Streaming v.s. Native Java.
![Page 7: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/7.jpg)
Writing MapReduce Using R
mapreduce function Mapreduce(input output, map, reduce…)
Changelogrmr 3.0.0 (2014/02/10): 10X faster than rmr 2.3.0rmr 2.3.0 (2013/10/07): support plyrmr
rmr2
![Page 8: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/8.jpg)
Access HDFS From R
Exchange data from R dataframe and HDFS
rhdfs
![Page 9: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/9.jpg)
Exchange data from R to Hbase
Using Thrift API
rhbase
![Page 10: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/10.jpg)
Perform common data manipulation operations, as found in plyr and reshape2
It provides a familiar plyr-like interface while hiding many of the mapreduce details
plyr: Tools for splitting, applying and combining data
NEW! plyrmr
![Page 11: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/11.jpg)
RHadoopInstallation
![Page 12: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/12.jpg)
R and related packages should be installed on each tasknode of the cluster
A Hadoop cluster, CDH3 and higher or Apache 1.0.2 and higher but limited to mr1, not mr2. Compatibility with mr2 from Apache 2.2.0 or HDP2
Prerequisites
![Page 13: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/13.jpg)
Getting Ready (Cloudera VM)
Downloadhttp://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html
This VM runs CentOS 6.2 CDH4.4R 3.0.1 Java 1.6.0_32
![Page 14: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/14.jpg)
CDH 4.4
![Page 15: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/15.jpg)
Get RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads
![Page 16: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/16.jpg)
Installing rmr2 dependencies
Make sure the package is installed system wise
$ sudo R
> install.packages(c("codetools", "R", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava“, “caTools”))
![Page 17: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/17.jpg)
Install rmr2
$ wget --no-check-certificate https://raw.github.com/RevolutionAnalytics/rmr2/3.0.0/build/rmr2_3.0.0.tar.gz
$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
![Page 18: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/18.jpg)
Installing…
![Page 19: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/19.jpg)
http://cran.r-project.org/src/contrib/Archive/Rcpp/
Downgrade Rcpp
![Page 20: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/20.jpg)
$ wget --no-check-certificate http://cran.r-project.org/src/contrib/Archive/Rcpp/Rcpp_0.11.0.tar.gz
$sudo R CMD INSTALL Rcpp_0.11.0.tar.gz
Install Rcpp_0.11.0
![Page 21: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/21.jpg)
$ sudo R CMD INSTALL rmr2_3.0.0.tar.gz
Install rmr2 again
![Page 22: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/22.jpg)
Install RHDFS
$ wget -no-check-certificate https://raw.github.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz
$ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL rhdfs_1.0.8.tar.gz
![Page 23: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/23.jpg)
Enable hdfs
> Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
> Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.4.0.jar")
> library(rmr2)
> library(rhdfs)
> hdfs.init()
![Page 24: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/24.jpg)
Javareconf error
$ sudo R CMD javareconf
![Page 25: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/25.jpg)
javareconf with correct JAVA_HOME
$ echo $JAVA_HOME
$ sudo JAVA_HOME=/usr/java/jdk1.6.0_32 R CMD javareconf
![Page 26: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/26.jpg)
MapReduceWith RHadoop
![Page 27: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/27.jpg)
MapReduce
mapreduce(input, output, map, reduce)
Like sapply, lapply, tapply within R
![Page 28: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/28.jpg)
Hello World – For Hadoop
http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
![Page 29: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/29.jpg)
Move File Into HDFS
# Put data into hdfs
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.4.0.jar") library(rmr2) library(rhdfs) hdfs.init() hdfs.mkdir(“/user/cloudera/wordcount/data”)hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data")
$ hadoop fs –mkdir /user/cloudera/wordcount/data$ hadoop fs –put wc_input.txt /user/cloudera/word/count/data
![Page 30: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/30.jpg)
Wordcount Mapper
map <- function(k,lines) { words.list <- strsplit(lines, '\\s') words <- unlist(words.list) return( keyval(words, 1) ) }
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one);
} } }
#Mapper
![Page 31: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/31.jpg)
Wordcount Reducer
reduce <- function(word, counts) { keyval(word, sum(counts)) }
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
#Reducer
![Page 32: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/32.jpg)
Call Wordcount
hdfs.root <- 'wordcount' hdfs.data <- file.path(hdfs.root, 'data') hdfs.out <- file.path(hdfs.root, 'out')
wordcount <- function (input, output=NULL) { mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce) } out <- wordcount(hdfs.data, hdfs.out)
![Page 33: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/33.jpg)
Read data from HDFS
results <- from.dfs(out)
results$key[order(results$val, decreasing = TRUE)][1:10]
$ hadoop fs –cat /user/cloudera/wordcount/out/part-00000 | sort –k 2 –nr | head –n 10
![Page 34: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/34.jpg)
MapReduce Benchmark
> a.time <- proc.time() > small.ints2=1:100000 > result.normal = sapply(small.ints2, function(x) x^2) > proc.time() - a.time
> b.time <- proc.time() > small.ints= to.dfs(1:100000) > result = mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2)) > proc.time() - b.time
V.S.
![Page 35: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/35.jpg)
sapply
Elapsed 0.982 second
![Page 36: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/36.jpg)
mapreduce
Elapsed 102.755 seconds
![Page 37: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/37.jpg)
HDFS stores your files as data chunk distributed on multiple datanodes
M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers.
It takes time for mapper and reducer being spawned on these distributed system.
Hadoop Latency
![Page 38: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/38.jpg)
Its not possible to apply built in machine learning method on MapReduce Program
kcluster= kmeans((mydata, 4, iter.max=10)
Kmeans Clustering
![Page 39: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/39.jpg)
kmeans = function(points, ncenters, iterations = 10, distfun = NULL) { if(is.null(distfun)) distfun = function(a,b) norm(as.matrix(a-b), type = 'F') newCenters = kmeans.iter( points, distfun, ncenters = ncenters) # interatively choosing new centers for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters) } newCenters}
Kmeans in MapReduce Style
![Page 40: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/40.jpg)
kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs(mapreduce(input = points, map = if (is.null(centers)) { #give random point as sample function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { #find center of minimum distance distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL,
apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}
Kmeans in MapReduce Style
![Page 41: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/41.jpg)
One More Thing…plyrmr
![Page 42: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/42.jpg)
Perform common data manipulation operations, as found in plyr and reshape2
It provides a familiar plyr-like interface while hiding many of the mapreduce details
plyr: Tools for splitting, applying and combining data
NEW! plyrmr
![Page 43: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/43.jpg)
Installation plyrmr dependencies
$ yum install libxml2-devel$ sudo yum install curl-devel$ sudo R
> Install.packages(c(“ Rcurl”, “httr”), dependencies = TRUE> Install.packages(“devtools”, dependencies = TRUE)> library(devtools)> install_github("pryr", "hadley")> Install.packages(c(“ R.methodsS3”, “hydroPSO”), dependencies = TRUE)
![Page 44: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/44.jpg)
$ wget -no-check-certificate https://raw.github.com/RevolutionAnalytics/plyrmr/master/build/plyrmr_0.1.0.tar.gz
$ sudo R CMD INSTALL plyrmr_0.1.0.tar.gz
Installation plyrmr
![Page 45: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/45.jpg)
> data(mtcars) > head(mtcars) > transform(mtcars, carb.per.cyl = carb/cyl)
> library(plyrmr) > output(input(mtcars), "/tmp/mtcars") > as.data.frame(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl)) > output(transform(input("/tmp/mtcars"), carb.per.cyl = carb/cyl), "/tmp/mtcars.out")
Transform in plyrmr
![Page 46: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/46.jpg)
where( select( mtcars, carb.per.cyl = carb/cyl, .replace = FALSE), carb.per.cyl >= 1)
select and where
https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
![Page 47: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/47.jpg)
as.data.frame( select( group( input("/tmp/mtcars"), cyl), mean.mpg = mean(mpg)))
Group by
https://github.com/RevolutionAnalytics/plyrmr/blob/master/docs/tutorial.md
![Page 48: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/48.jpg)
Reference
https://github.com/RevolutionAnalytics/RHadoop/wiki
http://www.slideshare.net/RevolutionAnalytics/rhadoop-r-meets-hadoop
http://www.slideshare.net/Hadoop_Summit/enabling-r-on-hadoop
![Page 50: Big Data Analysis With RHadoop](https://reader038.vdocument.in/reader038/viewer/2022103013/540dda638d7f72927e8b4a2d/html5/thumbnails/50.jpg)