r integration hadoop on ubuntu

Upload: tavpritesh-sethi

Post on 17-Oct-2015

93 views

Category:

Documents


0 download

DESCRIPTION

R integration

TRANSCRIPT

  • R-Hadoop Integration on Ubuntu:- This manual is direct for R and Hadoop integration on Ubuntu 12.04

    Pre-requisites:-

    We assume, that the user would have following up and running before starting R and Hadoop integration

    Ubuntu 12.04

    Hadoop 1.x +

    If you do not have the Hadoop preinstalled on your Ubuntu machine, please follow the Single-node-cluster-(pseudo-distributed-mode-cluster.pdf guide present in your LMS under Module-7, to set-up the environment for R integration with Hadoop.

    Once Hadoop installation is done, make sure that all the processes are running:

    Note: R integration with Hadoop has issues when it comes to java-openjdk. To resolve it, we need to have oracle-java6 installed on the machine.

    To install oracle-java6 please follow the following steps:

    Give the command:

    sudo apt-get update

  • Click Yes to accept the agreement.

  • Edit the .bashrc file:

    # Set Hadoop-related environment variables

    export CONF=/home/user/hadoop-1.2.0/conf

    # Set JAVA_HOME

    export JAVA_HOME=/usr/lib/jvm/java-6-oracle

    # Add Hadoop bin/ directory to PATH

    export PATH=$PATH:$/home/user/hadoop-1.2.0/bin

    Note: Please add the exact location of the specified files from your system.

    Make sure JAVA_HOME is set to the correct java location.

  • Installing RHadoop RHadoop has mainly following three R packages:

    rmr2

    rhdfs

    rhbase

    rmr2 package provides Hadoop MapReduce functionality in R, rhdfs provides HDFS file operations in R and rhbase provides HBase connectivity from R.

    Step #1: Update the sources.list.

    sudo gedit /etc/apt/sources.list

    Adding the line:

    deb http://cran.cnr.berkeley.edu/bin/linux/ubuntu/ precise/

  • Step #2: sudo apt-get update

    Step #3: Install r-base package.

    sudo apt-get install r-base

  • Checking the version of R:

  • Download the following packages from: http://cran.cnr.berkeley.edu/

    bitops

    rhdfs

    digest

    rJava

    functional

    RJSONIO

    plyr

    rmr2

    Rcpp

    stringr

    reshape2

    The installation requires the corresponding tar.gz archives to be downloaded.

    If the downloaded files are in Downloads, give the following command:

    To untar the zipped file:

  • Then we can run R CMD INSTALL command with sudo privileges.

    Rcpp Package

    RJSONIO Package

    digest Package

  • functional package

    stringr package

    plyr package

  • bitops package

    reshape2 package

    rmr2 package

  • Before installing rJava package we need to follow the following

    steps:

    sudo JAVA_HOME=/usr/lib/jvm/java-6-oracle/jre R CMD javareconf

  • rJava package

    sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz

    sudo HADOOP_CMD=/home/istvan/hadoop/bin/hadoop R CMD INSTALL rhdfs

    rhdfs_1.0.5.tar.gz

  • Make sure that the following packages are installed:

    Getting started with RHadoop

    In principle, RHadoop MapReduce is a similar operation to R lapply function that applies a

    function over a list or vector.

    Without mapreduce function we could write a simple R code to double all the numbers from 1 to 100:

    > ints = 1:100 > doubleInts = sapply(ints, function(x) 2*x) > head(doubleInts) [1] 2 4 6 8 10 12

    With RHadoop rmr package we could use mapreduce function to implement the same calculations see doubleInts.R script:

  • Sys.setenv(HADOOP_HOME="/home/vikas/hadoop") Sys.setenv(HADOOP_CMD="/home/vikas/hadoop/bin/hadoop") library(rmr2) library(rhdfs) ints = to.dfs(1:100) calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v)) from.dfs(calc) $val