jian wang based on “meet hadoop! open source grid computing” by devaraj das yahoo! inc....

Jian Wang

Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das

Yahoo! Inc. Bangalore & Apache Software Foundation

Need to process 10TB datasets On 1 node:

◦ scanning @ 50MB/s = 2.3 days On 1000 node cluster:

◦ scanning @ 50MB/s = 3.3 min

Need Efficient, Reliable and Usable framework◦Google File System (GFS) paper◦Google's MapReduce paper

Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system◦ Files are divided into large blocks and distributed

across the cluster (64MB)◦ Blocks replicated to handle hardware failure◦ Current block replication is 3 (configurable)◦ It cannot be directly mounted by an existing operating system.

Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30

Master-Slave Architecture

HDFS Master “Namenode” (irkm-1)◦ Accepts MR jobs submitted by users◦ Assigns Map and Reduce tasks to Tasktrackers◦ Monitors task and tasktracker status, re-executes

tasks upon failure HDFS Slaves “Datanodes” (irkm-1 to irkm-6)

◦ Run Map and Reduce tasks upon instruction from the Jobtracker

◦ Manage storage and transmission of intermediate output

Hadoop is locally “installed” on each machine◦ Version 0.19.2

◦ Installed location is in /home/tmp/hadoop

◦ Slave nodes store their data in /tmp/hadoop-${user.name} (configurable)

If it is the first time that you use it, you need to format the namenode:◦ - log to irkm-1◦ - cd /home/tmp/hadoop◦ - bin/hadoop namenode –format

Basically we see most commands look similar ◦ bin/hadoop “some command” options◦ If you just type hadoop you get all possible

commands (including undocumented)

hadoop dfs◦ [-ls <path>]◦ [-du <path>]◦ [-cp <src> <dst>]◦ [-rm <path>]◦ [-put <localsrc> <dst>]◦ [-copyFromLocal <localsrc> <dst>]◦ [-moveFromLocal <localsrc> <dst>]◦ [-get [-crc] <src> <localdst>]◦ [-cat <src>]◦ [-copyToLocal [-crc] <src> <localdst>]◦ [-moveToLocal [-crc] <src> <localdst>]◦ [-mkdir <path>]◦ [-touchz <path>]◦ [-test -[ezd] <path>]◦ [-stat [format] <path>]◦ [-help [cmd]]

bin/start-all.sh – starts all slave nodes and master node

bin/stop-all.sh – stops all slave nodes and master node

Run jps to check the status

Log to irkm-1 rm –fr /tmp/hadoop/$userID cd /home/tmp/hadoop bin/hadoop dfs –ls bin/hadoop dfs –copyFromLocal example

example

After that bin/hadoop dfs –ls

Mapper.py

Reducer.py

bin/hadoop dfs -ls

bin/hadoop dfs –copyFromLocal example example

bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file wordcount-py.example/mapper.py -mapper wordcount-py.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output

bin/hadoop dfs -cat java-output/part-00000

bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local

Hadoop job tracker◦ http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp

Hadoop task tracker◦ http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp

Hadoop dfs checker◦ http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp

jian wang based on “meet hadoop! open source grid computing” by devaraj das yahoo! inc....

Documents