data science bootcamp day2

11
Data Science Bootcamp, Day 2 Presented By: Chetan Khatri, Volunteer Teaching assistant, Data Science Lab. Guidance By: Prof. Devji D. Chhanga, University of Kachchh.

Upload: chetan-khatri

Post on 12-Feb-2017

32 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data science bootcamp day2

Data Science Bootcamp, Day 2Presented By: Chetan Khatri, Volunteer Teaching assistant, Data Science Lab.

Guidance By: Prof. Devji D. Chhanga, University of Kachchh.

Page 2: Data science bootcamp day2

AgendaUnderstanding Git.

Understanding Apache Maven.

Hello World Java Program with Apache Maven.

Understanding of Hadoop Administrative Commands.

WordCount Hadoop Program on Hadoop Cluster with Maven.

Page 3: Data science bootcamp day2

Git with Github●Github: Repository storage where you can store your source code

and share with team member work interactively.

● Installation: sudo apt-get install git

●Steps TODO:

1. Create Repository

2. Clone - Copy someone else's repository

3. Commit - Ready to submit your code to repository.

4. Push - Request to submit your code to repository.

5. https://github.com - create account and create repository

Page 4: Data science bootcamp day2

Let’s have Demo with Git ●Create Repository at Github named hadoopdemo

●Cloning Repository: git clone https://github.com/dskskv/hadoopdemo.git

●Configure github with your credentials:

git config --global user.email "[email protected]"

git config --global user.name "Your Name"

Add individual file: sudo git add README.md

for adding every files: sudo git add .

Page 5: Data science bootcamp day2

Let’s have Demo with Git (Conti…)

commit command - sudo git commit -m "Comment anything"

Submit request to github repository with whatever has been added:

sudo git origin master

pull - is to get latest updated code from repository

Example : git pull https://github.com/dskskv/hadoopdemo.git

Git Branches:

Are Different Modules of the Repository, Such as Development, Test, Production phase of the software development.

Master branch has always updated code.

For example, internal, final are exams , you can create as branch in github.

Page 6: Data science bootcamp day2

Understanding Apache MavenApache Maven is Build Tool for Java, where you can use Other

Artifacts(Jar files written by someone else) and build your Jar file which contains all other’s added before.

Maven Life Cycle:Create Maven Project

Update Maven Project

Write Java Code

Maven Clean

Maven Build (For building your Jar file)

Page 7: Data science bootcamp day2

Understanding Hadoop Administrative Commands1. Cloning github cccs936 repositorygit clone https://github.com/dskskv/CCCS936.git2. Start Hadoop Clustersbin/start-dfs.shsbin/start-yarn.sh3. Check Hadoop Versionhadoop version4. Check all the options under hadoop commandhadoop5. Create Directory as "dskskv" at HDFShadoop fs -mkdir /dskskv

Page 8: Data science bootcamp day2

Understanding Hadoop Administrative Commands6. List out the contents of dskskv object inside HDFShadoop fs -ls /dskskv

7. Create Text filesudo gedit inputfile.txt

8. Put text file inside HDFS blockhadoop fs -put inputfile.txt /dskskv

9. Read the content of HDFS textfile objecthadoop fs -cat /dskskv/inputfile.txt

Page 9: Data science bootcamp day2

Understanding Hadoop Administrative Commands10. hadoop deprecated, use hdfs also for the same operations.hdfs dfs -mkdir /chetanhdfs dfs -put inputfile.txt /chetanhdfs dfs -cat /chetan/inputfile.txt

11. Deleting file from HDFShadoop fs -rm /dskskv/inputfile.txt

12. Deleting Directory from HDFShadoop fs -rm -r /dskskv

Page 10: Data science bootcamp day2

WordCount Hadoop Program on Hadoop Cluster with Maven1) Login as a Hadoop User:su hduser2) Start hadoop deamon servicessbin/start-dfs.shsbin/start-yarn.sh3) Check whether all deamon services are up or notjps4) Create directory in HDFS, Note: make sure wherever you are in the console , Hadoop user should have previlegies to access it.hadoop fs -mkdir /input5) Transfer textfile to HDFShadoop fs -put inputfile.txt /input

Page 11: Data science bootcamp day2

WordCount Hadoop Program on Hadoop Cluster with Maven6) Check whether file is transferred successfullyhadoop fs -ls /input

7) execute hadoop job by providing Hadoop Program executable Jar file and input directory path where text file is there and output directory path where you are looking to store process data.

hadoop jar WordCountDSKSKV-0.0.1-SNAPSHOT.jar /input /output

8) Check Processed Directory has processed files ?hadoop fs -ls /output

9) Read your desired output from Hadoop Job.hadoop fs -cat /output/part-r-00000