streaming api, spark and ruby
TRANSCRIPT
Streaming API, Spark & Ruby
AGENDA
BIG DATA
Ruby in BIG DATA
Why we should know?
Insights from SAS
Data
Information
Knowledge
Data Science Field
Computer Science
Maths & Statitics
Subject Matter
Expertise
Where it all began?
PROBLEM
How to store ? How to process ?
Solution
HADOOP
HADOOP Ecosystem
STORAGE & PROCESS
• Distributed Storage: HDFS A distributed file system where commodity hardware can be used to form clusters and store the huge data in distributed fashion.
• Distributed Processing: MapReduce Paradigm
It can easily scale to multiple nodes(1,500–2,000 nodes in a cluster), with just configuration change.
Case Study
Daywise Analysis of Rubyconfindia Tweets
Twitter API
Sample Data
--- !ruby/object:Twitter::Tweetattrs: :created_at: Tue Mar 08 11:00:57 +0000 2016 :id: 707159160945811457 :id_str: '707159160945811457' :text: 'Once in a life time to meet Matz at the awesome #kochi https://t.co/6oCIagsHCg #ruby #india https://t.co/YRlpABApkP'
Tweet Count
CLOUDERA QUICKSTART
STORAGE LAYER
HDFS
• It is a distributed file system• Streaming Data Access: Write once, read many times
• Able to run on commodity Hardware• Fault tolerance• Replication: 3 nodes by default, configurable• Block based: 64-256MB, configurable
Replication Factor 2
Name Node: Stores Meta Data
Meta Data:/data/pristine/catalina.log.> 1, 2, 4
/data/pristine/myfile. >3,5
Data Node 1 Data Node 2 Data Node 3
1 2 45 5 2 3 4 1 3
HDFS Command Line
Copying File to HDFS
HDFS NameNode UI
PROCESS LAYER
YARN / MapReduce 2.0
• YARN: A framework for job scheduling and cluster resource management.
• MapReduce: Distributed processing paradigm
Map Function
Input: (input_key, value)
Output: bunch of (intermediate_key, value)
System applies the map function in parallel to all inputs
Reduce Function
Input: (intermediate_key, value)
Output: bunch of (values)
System will group all pairs with the same intermediate key and apply the reduce function
OUTPUT RESULT
SHUFFLESTAGE
FILE CHUNKS
Leveraging Ruby
Mapper.rb
#!/usr/bin/env ruby
STDIN.read.split("--- !ruby/object:Twitter::Tweet").each do |t| date = t.match(/\:created_at\: .{30}/).to_s.split puts "#{date[1]}" if date[1]end
Reducer.rb
#!/usr/bin/env ruby
STDIN.readlines.group_by{|i| i.strip}.map{|i,j| "#{i} #{j.count}" }.each{|i| puts i}
HADOOP STREAMING API
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar –input /user/rubyconf/tweets.txt –output /user/rubyconf/daywise –mapper mapper.rb –reducer reducer.rb –file mapper.rb –file reducer.rb
YARN Resource Manager
Job Details
YARN Resource Manager
DAYWISE TWEETS
TREND
SPARK
SPARK Benefits• SPEED • Ease of Use• Runs Everywhere
• Source: http://spark.apache.org/
• Source: http://spark.apache.org/
gem ruby-spark• gem install ruby-spark• ruby-spark build• ruby-spark shell
Ruby with SPARK
require 'ruby-spark'
# ConfigurationSpark.config do set_app_name "RubySpark" set 'spark.ruby.serializer', 'oj' set 'spark.ruby.serializer.batch_size', 100end
# Start Apache SparkSpark.start
# Context referencesc = Spark.scrdd = sc.text_file("hdfs://user/rubyconf/tweets.txt”)
# Collect all created days from datesdays = rdd.map(lambda {|t| date = t.match(/\:created_at\: .{30}/).to_s.split; date[1] if date[1]})
# Creating key value pair pairrdd = days.map(lambda { |x| [x,1] })
# Final output by using reducerdaywise = pairrdd.reduce_by_key( lambda{|x,y| x+y}).collect_as_hash
Expertise & Learnings
Remember
We can use Ruby with HADOOP
Streaming API & SPARK
SPARK is more generalized distributed
computing model
Thank You
@_manoharaa