Download - Hands-on Hadoop: An intro for Web developers
![Page 1: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/1.jpg)
Hands-on Hadoop: An intro for Web developers
Erik Eldridge
Engineer/Evangelist
Yahoo! Developer Network
Photo credit: http://www.flickr.com/photos/exfordy/429414926/sizes/l/
![Page 2: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/2.jpg)
Goals
• Gain familiarity with Hadoop
• Approach Hadoop from a web dev's perspective
![Page 3: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/3.jpg)
Prerequisites
• VMWare– Hadoop will be demonstrated using a
VMWare virtual machine– I’ve found the use of a virtual machine to
be the easiest way to get started with Hadoop
• Curl installed
![Page 4: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/4.jpg)
Setup VM
• Download VM from YDN– http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup
• Note:– user name: hadoop-user– password: hadoop
• Launch vm• Log in• Note ip of machine
![Page 5: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/5.jpg)
Start Hadoop
• Run the util to launch hadoop: $ ~/start-hadoop
• If it's already running, we'll get an error like"172.16.83.132: datanode running as process 6752. Stop it first.172.16.83.132: secondarynamenode running as process 6845. Stop it first...."
![Page 6: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/6.jpg)
Saying “hi” to Hadoop
• Call hadoop command line util: $ hadoop
• Hadoop should have been launched on boot. verify this is the case: $ hadoop dfs -ls /
![Page 7: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/7.jpg)
Saying “hi” to Hadoop
• If hadoop has not been started, you'll see something like:"09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 0 time(s).09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 1 time(s)...”
• If hadoop has been launched, the dfs -ls command should show the contents of hdfs
• Before continuing, view all the hadoop utilities and sample files: $ ls
![Page 8: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/8.jpg)
Install Apache
• Why? In the interest of creating a relevant example, I'm going to work on Apache access logs
• Update apt-get so it can find apache2: $ sudo apt-get update
• Install apache2 so we can generate access log data: $ sudo apt-get install apache2
![Page 9: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/9.jpg)
Generate data
• Jump into the directory containing the apache logs: $ cd /var/log/apache2
• Show the top n lines of the access log: $ tail -f -n 10 access.log
![Page 10: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/10.jpg)
Generate data
• Put this script, or something similar, in an executable file on your local machine
• Edit the IP address to that of your VM
![Page 11: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/11.jpg)
Generate data
• Set executable permissions on the file:$ chmod +x generate.sh
• Run the file: $ ./generate.sh
• Note log data in tail output in VM
![Page 12: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/12.jpg)
Exploring HDFS
• Show home dir structure: – $ hadoop dfs -ls /user
• Create a directory: – $ hadoop dfs -mkdir /user/foo
![Page 13: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/13.jpg)
Exploring HDFS
• Attempt to re-create new dir and note error: – $ hadoop dfs -mkdir /user/foo
• Create a destination directory using implicit path: – $ hadoop dfs -mkdir bar
• Auto-create nested destination directories: – $ hadoop dfs -mkdir dir1/dir2/dir3
• Remove dir: – $ hadoop dfs -rmr /user/foo
• Remove dir: – $ hadoop dfs -rmr bar dir1
• Try to re-remove dir and note error: – $ hadoop dfs -rmr bar
![Page 14: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/14.jpg)
Browse HDFS using web UI
• Open http://{VM IP address}:50030 in browser
![Page 15: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/15.jpg)
Import access log data
• Load access log into hdfs: – $ hadoop dfs -put
/var/log/apache2/access.log input/access.log
• Verify it's in there: – $ hadoop dfs -ls input/access.log
• View the contents: – $ hadoop dfs -cat input/access.log
![Page 16: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/16.jpg)
Count words in data using Hadoop Streaming
• Hadoop Streaming refers to the ability to use an arbitrary language to define a job’s map and reduce processes
![Page 17: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/17.jpg)
Python wordcount mapper
Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python#Map:_mapper.py
![Page 18: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/18.jpg)
Python wordcount reducer
Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python#Map:_mapper.py
![Page 19: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/19.jpg)
Test run mapper and reducer
• $ cat data | mapper.py \| sort | reducer.py
![Page 20: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/20.jpg)
Run Hadoop
• Stream data through these two files, saving the output back to HDFS:$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input /data/ydata/ydata-ysm-keyphrase-bid-imp-click-v1_0 \ -output /user/{username}/output \ -mapper /home/{username}/mapper.py \ -reducer /home/{username}/reducer.py
![Page 21: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/21.jpg)
View output
• View output files: – $ hadoop dfs -ls output/mapReduceOut
• Note multiple output files ("part-00000", "part-00001", etc)
• View output file contents: – $ hadoop dfs -cat
output/mapReduceOut/part-00000
![Page 22: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/22.jpg)
Pig
• Pig is a higher-level interface for hadoop– Interactive shell Grunt– Declarative, SQL-like language, Pig Latin– Pig engine compiles Pig Latin into MapReduce– Extensible via Java files
• "writing mapreduce routines, is like coding in assembly”
• Pig, Hive, etc.
![Page 23: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/23.jpg)
Exploring Pig
• Pig is already on the VM
• Launch pig w/ connection to cluster: – $ java -cp pig/pig.jar:$HADOOPSITEPATH
org.apache.pig.Main
• View contents of HDFS from grunt: – > ls
![Page 24: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/24.jpg)
Pig word count
Save this script in a file, e.g, wordcount.pig:myinput = LOAD 'input/access.log' USING
TextLoader();words = FOREACH myinput GENERATE
FLATTEN(TOKENIZE(\$0));grouped = GROUP words BY \$0;counts = FOREACH grouped GENERATE group,
COUNT(words);ordered = ORDER counts BY \$0;STORE ordered INTO 'output/pigOut' USING
PigStorage();
![Page 25: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/25.jpg)
Perform word count w/ Pig
• Run this script:$ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -f wordcount.pig
• View output$ hadoop dfs -cat output/pigOut/part-00000
![Page 26: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/26.jpg)
Resources
• Apache Hadoop Site– hadoop.apache.org
• Apache Pig Site– hadoop.apache.org/pig/
• YDN Hadoop Tutorial – developer.yahoo.com/hadoop/tutorial/
module3.html#vm-setup
• Michael G Noll’s tutorial:– www.michael-noll.com/wiki/
Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
![Page 27: Hands-on Hadoop: An intro for Web developers](https://reader033.vdocument.in/reader033/viewer/2022052821/554a0788b4c905507a8b55cf/html5/thumbnails/27.jpg)
Thank you
• Follow me on Twitter:– http://twitter.com/erikeldridge
• Find these slides on Slideshare:– http://slideshare.net/erikeldridge
• Feedback? Suggestions?– http://speakerrate.com/erikeldridge