streaming api, spark and ruby

Streaming API, Spark & Ruby

AGENDA

BIG DATA

Ruby in BIG DATA

Why we should know?

Insights from SAS

Information

Knowledge

Data Science Field

Computer Science

Maths & Statitics

Subject Matter

Expertise

Where it all began?

PROBLEM

How to store ? How to process ?

Solution

HADOOP

HADOOP Ecosystem

STORAGE & PROCESS

• Distributed Storage: HDFS A distributed file system where commodity hardware can be used to form clusters and store the huge data in distributed fashion.

• Distributed Processing: MapReduce Paradigm

It can easily scale to multiple nodes(1,500–2,000 nodes in a cluster), with just configuration change.

Case Study

Daywise Analysis of Rubyconfindia Tweets

Twitter API

Sample Data

--- !ruby/object:Twitter::Tweetattrs: :created_at: Tue Mar 08 11:00:57 +0000 2016 :id: 707159160945811457 :id_str: '707159160945811457' :text: 'Once in a life time to meet Matz at the awesome #kochi https://t.co/6oCIagsHCg #ruby #india https://t.co/YRlpABApkP'

Tweet Count

CLOUDERA QUICKSTART

STORAGE LAYER

• It is a distributed file system• Streaming Data Access: Write once, read many times

• Able to run on commodity Hardware• Fault tolerance• Replication: 3 nodes by default, configurable• Block based: 64-256MB, configurable

Replication Factor 2

Name Node: Stores Meta Data

Meta Data:/data/pristine/catalina.log.> 1, 2, 4

/data/pristine/myfile. >3,5

Data Node 1 Data Node 2 Data Node 3

1 2 45 5 2 3 4 1 3

HDFS Command Line

Copying File to HDFS

HDFS NameNode UI

PROCESS LAYER

YARN / MapReduce 2.0

• YARN: A framework for job scheduling and cluster resource management.

• MapReduce: Distributed processing paradigm

Map Function

Input: (input_key, value)

Output: bunch of (intermediate_key, value)

System applies the map function in parallel to all inputs

Reduce Function

Input: (intermediate_key, value)

Output: bunch of (values)

System will group all pairs with the same intermediate key and apply the reduce function

OUTPUT RESULT

SHUFFLESTAGE

FILE CHUNKS

Leveraging Ruby

Mapper.rb

#!/usr/bin/env ruby

STDIN.read.split("--- !ruby/object:Twitter::Tweet").each do |t| date = t.match(/\:created_at\: .{30}/).to_s.split puts "#{date[1]}" if date[1]end

Reducer.rb

#!/usr/bin/env ruby

STDIN.readlines.group_by{|i| i.strip}.map{|i,j| "#{i} #{j.count}" }.each{|i| puts i}

HADOOP STREAMING API

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar –input /user/rubyconf/tweets.txt –output /user/rubyconf/daywise –mapper mapper.rb –reducer reducer.rb –file mapper.rb –file reducer.rb

YARN Resource Manager

Job Details

YARN Resource Manager

DAYWISE TWEETS

SPARK Benefits• SPEED • Ease of Use• Runs Everywhere

• Source: http://spark.apache.org/

• Source: http://spark.apache.org/

gem ruby-spark• gem install ruby-spark• ruby-spark build• ruby-spark shell

Ruby with SPARK

require 'ruby-spark'

# ConfigurationSpark.config do set_app_name "RubySpark" set 'spark.ruby.serializer', 'oj' set 'spark.ruby.serializer.batch_size', 100end

# Start Apache SparkSpark.start

# Context referencesc = Spark.scrdd = sc.text_file("hdfs://user/rubyconf/tweets.txt”)

# Collect all created days from datesdays = rdd.map(lambda {|t| date = t.match(/\:created_at\: .{30}/).to_s.split; date[1] if date[1]})

# Creating key value pair pairrdd = days.map(lambda { |x| [x,1] })

# Final output by using reducerdaywise = pairrdd.reduce_by_key( lambda{|x,y| x+y}).collect_as_hash

Expertise & Learnings

Remember

We can use Ruby with HADOOP

Streaming API & SPARK

SPARK is more generalized distributed

computing model

Thank You

@_manoharaa

streaming api, spark and ruby

Engineering