jian wang based on “meet hadoop! open source grid computing” by devaraj das yahoo! inc....
TRANSCRIPT
Jian Wang
Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das
Yahoo! Inc. Bangalore & Apache Software Foundation
Need to process 10TB datasets On 1 node:
◦ scanning @ 50MB/s = 2.3 days On 1000 node cluster:
◦ scanning @ 50MB/s = 3.3 min
Need Efficient, Reliable and Usable framework◦Google File System (GFS) paper◦Google's MapReduce paper
Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system◦ Files are divided into large blocks and distributed
across the cluster (64MB)◦ Blocks replicated to handle hardware failure◦ Current block replication is 3 (configurable)◦ It cannot be directly mounted by an existing operating system.
Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30
Master-Slave Architecture
HDFS Master “Namenode” (irkm-1)◦ Accepts MR jobs submitted by users◦ Assigns Map and Reduce tasks to Tasktrackers◦ Monitors task and tasktracker status, re-executes
tasks upon failure HDFS Slaves “Datanodes” (irkm-1 to irkm-6)
◦ Run Map and Reduce tasks upon instruction from the Jobtracker
◦ Manage storage and transmission of intermediate output
Hadoop is locally “installed” on each machine◦ Version 0.19.2
◦ Installed location is in /home/tmp/hadoop
◦ Slave nodes store their data in /tmp/hadoop-${user.name} (configurable)
If it is the first time that you use it, you need to format the namenode:◦ - log to irkm-1◦ - cd /home/tmp/hadoop◦ - bin/hadoop namenode –format
Basically we see most commands look similar ◦ bin/hadoop “some command” options◦ If you just type hadoop you get all possible
commands (including undocumented)
hadoop dfs◦ [-ls <path>]◦ [-du <path>]◦ [-cp <src> <dst>]◦ [-rm <path>]◦ [-put <localsrc> <dst>]◦ [-copyFromLocal <localsrc> <dst>]◦ [-moveFromLocal <localsrc> <dst>]◦ [-get [-crc] <src> <localdst>]◦ [-cat <src>]◦ [-copyToLocal [-crc] <src> <localdst>]◦ [-moveToLocal [-crc] <src> <localdst>]◦ [-mkdir <path>]◦ [-touchz <path>]◦ [-test -[ezd] <path>]◦ [-stat [format] <path>]◦ [-help [cmd]]
bin/start-all.sh – starts all slave nodes and master node
bin/stop-all.sh – stops all slave nodes and master node
Run jps to check the status
Log to irkm-1 rm –fr /tmp/hadoop/$userID cd /home/tmp/hadoop bin/hadoop dfs –ls bin/hadoop dfs –copyFromLocal example
example
After that bin/hadoop dfs –ls
Mapper.py
Reducer.py
bin/hadoop dfs -ls
bin/hadoop dfs –copyFromLocal example example
bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file wordcount-py.example/mapper.py -mapper wordcount-py.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output
bin/hadoop dfs -cat java-output/part-00000
bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local
Hadoop job tracker◦ http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp
Hadoop task tracker◦ http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp
Hadoop dfs checker◦ http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp