data science bootcamp day2

Post on 12-Feb-2017

32 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Science Bootcamp, Day 2Presented By: Chetan Khatri, Volunteer Teaching assistant, Data Science Lab.

Guidance By: Prof. Devji D. Chhanga, University of Kachchh.

AgendaUnderstanding Git.

Understanding Apache Maven.

Hello World Java Program with Apache Maven.

Understanding of Hadoop Administrative Commands.

WordCount Hadoop Program on Hadoop Cluster with Maven.

Git with Github●Github: Repository storage where you can store your source code

and share with team member work interactively.

● Installation: sudo apt-get install git

●Steps TODO:

1. Create Repository

2. Clone - Copy someone else's repository

3. Commit - Ready to submit your code to repository.

4. Push - Request to submit your code to repository.

5. https://github.com - create account and create repository

Let’s have Demo with Git ●Create Repository at Github named hadoopdemo

●Cloning Repository: git clone https://github.com/dskskv/hadoopdemo.git

●Configure github with your credentials:

git config --global user.email "you@example.com"

git config --global user.name "Your Name"

Add individual file: sudo git add README.md

for adding every files: sudo git add .

Let’s have Demo with Git (Conti…)

commit command - sudo git commit -m "Comment anything"

Submit request to github repository with whatever has been added:

sudo git origin master

pull - is to get latest updated code from repository

Example : git pull https://github.com/dskskv/hadoopdemo.git

Git Branches:

Are Different Modules of the Repository, Such as Development, Test, Production phase of the software development.

Master branch has always updated code.

For example, internal, final are exams , you can create as branch in github.

Understanding Apache MavenApache Maven is Build Tool for Java, where you can use Other

Artifacts(Jar files written by someone else) and build your Jar file which contains all other’s added before.

Maven Life Cycle:Create Maven Project

Update Maven Project

Write Java Code

Maven Clean

Maven Build (For building your Jar file)

Understanding Hadoop Administrative Commands1. Cloning github cccs936 repositorygit clone https://github.com/dskskv/CCCS936.git2. Start Hadoop Clustersbin/start-dfs.shsbin/start-yarn.sh3. Check Hadoop Versionhadoop version4. Check all the options under hadoop commandhadoop5. Create Directory as "dskskv" at HDFShadoop fs -mkdir /dskskv

Understanding Hadoop Administrative Commands6. List out the contents of dskskv object inside HDFShadoop fs -ls /dskskv

7. Create Text filesudo gedit inputfile.txt

8. Put text file inside HDFS blockhadoop fs -put inputfile.txt /dskskv

9. Read the content of HDFS textfile objecthadoop fs -cat /dskskv/inputfile.txt

Understanding Hadoop Administrative Commands10. hadoop deprecated, use hdfs also for the same operations.hdfs dfs -mkdir /chetanhdfs dfs -put inputfile.txt /chetanhdfs dfs -cat /chetan/inputfile.txt

11. Deleting file from HDFShadoop fs -rm /dskskv/inputfile.txt

12. Deleting Directory from HDFShadoop fs -rm -r /dskskv

WordCount Hadoop Program on Hadoop Cluster with Maven1) Login as a Hadoop User:su hduser2) Start hadoop deamon servicessbin/start-dfs.shsbin/start-yarn.sh3) Check whether all deamon services are up or notjps4) Create directory in HDFS, Note: make sure wherever you are in the console , Hadoop user should have previlegies to access it.hadoop fs -mkdir /input5) Transfer textfile to HDFShadoop fs -put inputfile.txt /input

WordCount Hadoop Program on Hadoop Cluster with Maven6) Check whether file is transferred successfullyhadoop fs -ls /input

7) execute hadoop job by providing Hadoop Program executable Jar file and input directory path where text file is there and output directory path where you are looking to store process data.

hadoop jar WordCountDSKSKV-0.0.1-SNAPSHOT.jar /input /output

8) Check Processed Directory has processed files ?hadoop fs -ls /output

9) Read your desired output from Hadoop Job.hadoop fs -cat /output/part-r-00000

top related