Download - R Integration Hadoop on Ubuntu
-
R-Hadoop Integration on Ubuntu:- This manual is direct for R and Hadoop integration on Ubuntu 12.04
Pre-requisites:-
We assume, that the user would have following up and running before starting R and Hadoop integration
Ubuntu 12.04
Hadoop 1.x +
If you do not have the Hadoop preinstalled on your Ubuntu machine, please follow the Single-node-cluster-(pseudo-distributed-mode-cluster.pdf guide present in your LMS under Module-7, to set-up the environment for R integration with Hadoop.
Once Hadoop installation is done, make sure that all the processes are running:
Note: R integration with Hadoop has issues when it comes to java-openjdk. To resolve it, we need to have oracle-java6 installed on the machine.
To install oracle-java6 please follow the following steps:
Give the command:
sudo apt-get update
-
Click Yes to accept the agreement.
-
Edit the .bashrc file:
# Set Hadoop-related environment variables
export CONF=/home/user/hadoop-1.2.0/conf
# Set JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$/home/user/hadoop-1.2.0/bin
Note: Please add the exact location of the specified files from your system.
Make sure JAVA_HOME is set to the correct java location.
-
Installing RHadoop RHadoop has mainly following three R packages:
rmr2
rhdfs
rhbase
rmr2 package provides Hadoop MapReduce functionality in R, rhdfs provides HDFS file operations in R and rhbase provides HBase connectivity from R.
Step #1: Update the sources.list.
sudo gedit /etc/apt/sources.list
Adding the line:
deb http://cran.cnr.berkeley.edu/bin/linux/ubuntu/ precise/
-
Step #2: sudo apt-get update
Step #3: Install r-base package.
sudo apt-get install r-base
-
Checking the version of R:
-
Download the following packages from: http://cran.cnr.berkeley.edu/
bitops
rhdfs
digest
rJava
functional
RJSONIO
plyr
rmr2
Rcpp
stringr
reshape2
The installation requires the corresponding tar.gz archives to be downloaded.
If the downloaded files are in Downloads, give the following command:
To untar the zipped file:
-
Then we can run R CMD INSTALL command with sudo privileges.
Rcpp Package
RJSONIO Package
digest Package
-
functional package
stringr package
plyr package
-
bitops package
reshape2 package
rmr2 package
-
Before installing rJava package we need to follow the following
steps:
sudo JAVA_HOME=/usr/lib/jvm/java-6-oracle/jre R CMD javareconf
-
rJava package
sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz
sudo HADOOP_CMD=/home/istvan/hadoop/bin/hadoop R CMD INSTALL rhdfs
rhdfs_1.0.5.tar.gz
-
Make sure that the following packages are installed:
Getting started with RHadoop
In principle, RHadoop MapReduce is a similar operation to R lapply function that applies a
function over a list or vector.
Without mapreduce function we could write a simple R code to double all the numbers from 1 to 100:
> ints = 1:100 > doubleInts = sapply(ints, function(x) 2*x) > head(doubleInts) [1] 2 4 6 8 10 12
With RHadoop rmr package we could use mapreduce function to implement the same calculations see doubleInts.R script:
-
Sys.setenv(HADOOP_HOME="/home/vikas/hadoop") Sys.setenv(HADOOP_CMD="/home/vikas/hadoop/bin/hadoop") library(rmr2) library(rhdfs) ints = to.dfs(1:100) calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v)) from.dfs(calc) $val