ruby on hadoop
DESCRIPTION
Introduction to Hadoop, as well as a brief overview of the Wukong and wukon-hadoop gemsTRANSCRIPT
Ruby on HadoopTuesday, January 8, 13
Introduction
Hi.I’m Ted O’Meara
...and I just quit my job last week.
@tomearatedomeara.com
Tuesday, January 8, 13
MapReduceTuesday, January 8, 13
History of MapReduce
•First implemented by Google
•Used in CouchDB, Hadoop, etc.
•Helps to “distill” data into a concentrated result set
Tuesday, January 8, 13
What is MapReduce?
Tuesday, January 8, 13
What is MapReduce?
input = ["deer", "bear", "river", "car", "car", "river", "deer", "car", "bear"]
input.map! { |x| [x, 1] }
sum = 0input.each do |x| sum += x[1]end
Tuesday, January 8, 13
Hadoop BreakdownTuesday, January 8, 13
History of Hadoop
•Doug Cutting @ Yahoo!•It is a Toy Elephant•It is also a framework for
distributed computing•It is a distributed filesystem
Tuesday, January 8, 13
Network Topology
Tuesday, January 8, 13
Hadoop Cluster
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
JobTracker TaskTracker/DataNodeNameNode
Cluster•Commodity hardware•Partition tolerant•Network-aware (rack-aware)
555.555.1.* 555.555.2.* 444.444.1.*
Tuesday, January 8, 13
Hadoop Cluster
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
JobTracker TaskTracker/DataNodeNameNode
NameNode•Keeps track of the DataNodes•Uses “heartbeat” to determine a node’s health•The most resources should be spent here
♥
555.555.1.* 555.555.2.* 444.444.1.*
Tuesday, January 8, 13
Hadoop Cluster
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
JobTracker TaskTracker/DataNodeNameNode
DataNode•Stores filesystem blocks•Can be scaled. Spun up/down.•Replicate based on a set replication factor
555.555.1.* 555.555.2.* 444.444.1.*
Tuesday, January 8, 13
Hadoop Cluster
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
JobTracker TaskTracker/DataNodeNameNode
JobTracker•Delegates which TaskTrackers should handle a
MapReduce job•Communicates with the NameNode to assign a TaskTracker
close to the DataNode where the source exists
♥
555.555.1.* 555.555.2.* 444.444.1.*
Tuesday, January 8, 13
Hadoop Cluster
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
JobTracker TaskTracker/DataNodeNameNode
TaskTracker•Worker for MapReduce jobs•The closer to the DataNode with the data, the better
555.555.1.* 555.555.2.* 444.444.1.*
Tuesday, January 8, 13
HDFS
Tuesday, January 8, 13
HDFS
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
JobTracker TaskTracker/DataNodeNameNode
hadoop fs -put localfile /user/hadoop/hadoopfile
555.555.1.* 555.555.2.* 444.444.1.*
Tuesday, January 8, 13
Hadoop Streaming
Tuesday, January 8, 13
Hadoop Streaming
555.555.1.* 555.555.2.* 444.444.1.*
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
TaskTracker/DataNode
JobTracker TaskTracker/DataNodeNameNode
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input "/user/me/samples/cachefile/input.txt" \ -mapper "xargs cat" \ -reducer "cat" \ -output "/user/me/samples/cachefile/out" \ -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' \ -jobconf mapred.map.tasks=3 \ -jobconf mapred.reduce.tasks=3 \ -jobconf mapred.job.name="Experiment"
Tuesday, January 8, 13
Hadoop Streaming
Hadoop Ecosystem
Pig Hive WukongPig Latin SQL-ish Ruby!
Tuesday, January 8, 13
Wukong
•Infochimps•Currently going through
heavy development•Use the 3.0.0.pre3 gem
https://github.com/infochimps-labs/wukong/tree/3.0.0
•Model your jobs with wukong-hadoophttps://github.com/infochimps-labs/wukong-hadoop
Tuesday, January 8, 13
Wukong
Wukong•Write mappers and reducers
using Ruby•As of 3.0.0, Wukong uses
“Processors”, which are Ruby classes that define map, reduce, and other tasks
wukong-hadoop•A CLI to use with Hadoop•Created around building tasks
with Wukong•Better than piping in the shell
(you can see this with --dry_run)
Tuesday, January 8, 13
Wukong Processors
•Fields are accessiblethrough switches in shell
•Local hand-off is made at STDOUT to STDIN
Wukong.processor(:mapper) do field :min_length, Integer, :default => 1 field :max_length, Integer, :default => 256 field :split_on, Regexp, :default => /\s+/ field :remove, Regexp, :default => /[^a-zA-Z0-9\']+/ field :fold_case, :boolean, :default => false def process string tokenize(string).each do |token| yield token if acceptable?(token) end end
private
def tokenize string string.split(split_on).map do |token| stripped = token.gsub(remove, '') fold_case ? stripped.downcase : stripped end end
def acceptable? token (min_length..max_length).include?(token.length) endend
Tuesday, January 8, 13
Wukong Processors
Wukong.processor(:reducer, Wukong::Processor::Accumulator) do
attr_accessor :count def start record self.count = 0 end def accumulate record self.count += 1 end
def finalize yield [key, count].join("\t") endend
Tuesday, January 8, 13
Wukong Processors
Simpsons - Ep 8do 7Doctor 1Does 2doesn't 1dog 2D'oh 1doif 1doing 2done 1doneYou 1don't 10Don't 1
wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb \ --mode=local \ --input=/home/hduser/simpsons/simpsonssubs/Simpsons\ [1.08].sub
Tuesday, January 8, 13