ruby on hadoop

Ruby on HadoopTuesday, January 8, 13

Introduction

Hi.I’m Ted O’Meara

...and I just quit my job last week.

@tomearatedomeara.com

Tuesday, January 8, 13

MapReduceTuesday, January 8, 13

History of MapReduce

•First implemented by Google

•Used in CouchDB, Hadoop, etc.

•Helps to “distill” data into a concentrated result set


What is MapReduce?


What is MapReduce?

input = ["deer", "bear", "river", "car", "car", "river", "deer", "car", "bear"]

input.map! { |x| [x, 1] }

sum = 0input.each do |x| sum += x[1]end


Hadoop BreakdownTuesday, January 8, 13

History of Hadoop

•Doug Cutting @ Yahoo!•It is a Toy Elephant•It is also a framework for

distributed computing•It is a distributed filesystem


Network Topology


Hadoop Cluster

TaskTracker/DataNode















JobTracker TaskTracker/DataNodeNameNode

Cluster•Commodity hardware•Partition tolerant•Network-aware (rack-aware)

555.555.1.* 555.555.2.* 444.444.1.*


Hadoop Cluster

















NameNode•Keeps track of the DataNodes•Uses “heartbeat” to determine a node’s health•The most resources should be spent here

♥

555.555.1.* 555.555.2.* 444.444.1.*


Hadoop Cluster

















DataNode•Stores filesystem blocks•Can be scaled. Spun up/down.•Replicate based on a set replication factor

555.555.1.* 555.555.2.* 444.444.1.*


Hadoop Cluster

















JobTracker•Delegates which TaskTrackers should handle a

MapReduce job•Communicates with the NameNode to assign a TaskTracker

close to the DataNode where the source exists

♥

555.555.1.* 555.555.2.* 444.444.1.*


Hadoop Cluster

















TaskTracker•Worker for MapReduce jobs•The closer to the DataNode with the data, the better

555.555.1.* 555.555.2.* 444.444.1.*


HDFS


HDFS

















hadoop fs -put localfile /user/hadoop/hadoopfile

555.555.1.* 555.555.2.* 444.444.1.*


Hadoop Streaming


Hadoop Streaming

555.555.1.* 555.555.2.* 444.444.1.*

















$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input "/user/me/samples/cachefile/input.txt" \ -mapper "xargs cat" \ -reducer "cat" \ -output "/user/me/samples/cachefile/out" \ -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' \ -jobconf mapred.map.tasks=3 \ -jobconf mapred.reduce.tasks=3 \ -jobconf mapred.job.name="Experiment"


Hadoop Streaming

Hadoop Ecosystem

Pig Hive WukongPig Latin SQL-ish Ruby!


Wukong

•Infochimps•Currently going through

heavy development•Use the 3.0.0.pre3 gem

https://github.com/infochimps-labs/wukong/tree/3.0.0

•Model your jobs with wukong-hadoophttps://github.com/infochimps-labs/wukong-hadoop




https://github.com/infochimps-labs/wukong-hadoop

https://github.com/infochimps-labs/wukong-hadoop

Wukong

Wukong•Write mappers and reducers

using Ruby•As of 3.0.0, Wukong uses

“Processors”, which are Ruby classes that define map, reduce, and other tasks

wukong-hadoop•A CLI to use with Hadoop•Created around building tasks

with Wukong•Better than piping in the shell

(you can see this with --dry_run)


Wukong Processors

•Fields are accessiblethrough switches in shell

•Local hand-off is made at STDOUT to STDIN

Wukong.processor(:mapper) do field :min_length, Integer, :default => 1 field :max_length, Integer, :default => 256 field :split_on, Regexp, :default => /\s+/ field :remove, Regexp, :default => /[^a-zA-Z0-9\']+/ field :fold_case, :boolean, :default => false def process string tokenize(string).each do |token| yield token if acceptable?(token) end end

private

def tokenize string string.split(split_on).map do |token| stripped = token.gsub(remove, '') fold_case ? stripped.downcase : stripped end end

def acceptable? token (min_length..max_length).include?(token.length) endend


Wukong Processors

Wukong.processor(:reducer, Wukong::Processor::Accumulator) do

attr_accessor :count def start record self.count = 0 end def accumulate record self.count += 1 end

def finalize yield [key, count].join("\t") endend


Wukong Processors

Simpsons - Ep 8do 7Doctor 1Does 2doesn't 1dog 2D'oh 1doif 1doing 2done 1doneYou 1don't 10Don't 1

wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb \ --mode=local \ --input=/home/hduser/simpsons/simpsonssubs/Simpsons\ [1.08].sub


The End

Thank you!@[email protected]


mailto:[email protected]

mailto:[email protected]

ruby on hadoop

Technology