distributed and fault tolerant realtime computation with apache storm, apache kafka and apache...

Distributed and Fault-Tolerant Distributed and Fault-Tolerant Realtime ComputationRealtime Computation

www.folio3.com@folio_3

Folio3 – OverviewFolio3 – Overview

www.folio3.com @folio_3

Who We Are

We are a Development Partner for our customers

Design software solutions, not just implement them

Focus on the solution – Platform and technology agnostic

Expertise in building applications that are:

Mobile Social Cloud-based Gamified

What We Do Areas of Focus

Enterprise

Custom enterprise applications

Product development targeting the enterprise

Mobile

Custom mobile apps for iOS, Android, Windows Phone, BB OS

Mobile platform (server-to-server) development

Social Media

CMS based websites for consumers and enterprise (corporate, consumer,

community & social networking)

Social media platform development (enterprise & consumer)

Folio3 At a Glance Founded in 2005

Over 200 full time employees

Offices in the US, Canada, Bulgaria & Pakistan

Palo Alto, CA. Sofia, Bulgaria

Karachi, Pakistan

Toronto, Canada

Areas of Focus: Enterprise Automating workflows

Cloud based solutions

Application integration

Platform development

Healthcare

Mobile Enterprise

Digital Media

Supply Chain

Some of Our Enterprise Clients

http://www.mysql.com/

http://www.twinstrata.com/node

Areas of Focus: Mobile Serious enterprise applications for Banks,

Businesses

Fun consumer apps for app discovery,

interaction, exercise gamification and play

Educational apps

Augmented Reality apps

Mobile Platforms

Some of Our Mobile Clients

Areas of Focus: Web & Social Media

Community Sites based on

Content Management Systems

Enterprise Social Networking

Social Games for Facebook &

Mobile

Companion Apps for games

Some of Our Web Clients

http://www.artchowk.com/index.php

www.folio3.com @folio_3

Distributed and Fault-Tolerant Distributed and Fault-Tolerant Realtime ComputationRealtime Computation

Agenda

Big Data

Hadoop Vs Storm

Lambda Architecture

Storm Architecture And Concepts

Big Data

To understand “Big Data”, it has four dimensions :

Volume : Scale of Data (terabytes, petabytes, exabytes)

Velocity : Need to be analyzed quickly (milliseconds to

seconds to respond)

Variety : Different forms of Data (& Data Sources)

Veracity : Uncertainty of Data (due to data inconsistency,

ambiguities, latency, data incompleteness)

Example Query

Total Number of Page Views To A Website

URL over a range of time

Example Query

function pageViewsOverTime(bigData, url, startTime, endTime) {

int count = 0;

for (data : bigData) {

if ( data.url == url &&

data.timestamp >= startTime &&

data.timestamp <= endTime ) {

count ++;

}

}

return count;

}

Example Query

TOO SLOW : Big Data is in petabytes

(Volume)

Hadoop Data Processing Architecture

Data Store (HDFS)

Hadoop (Map

Reduce)

Batch View (Processed

Data)

Query Views generated in batch maybe out of date

Batch workflow is too slow

Data Flow Batch Run

Lambda Architecture

Immutable Master Dataset ( stored in HDFS)

What is Apache Storm ?

Storm is a real-time distributed computing framework for

reliably processing large volumes of high velocity unbounded

data streams.

It was created by Nathan Marz and his team at BackType, and

released as open source in 2011(after BackType was acquired by

Twitter)

Five characteristics make Storm ideal for real-time data processing workloads.

Fast – benchmarked at processing one million+ 100 byte messages per second

per node

Scalable – with parallel calculations that run across a cluster of machines

Fault-tolerant – when workers die, Storm will automatically restart them. If a

node dies, the work will be restarted on another node.

Reliable – Storm guarantees that each unit of data (tuple) will be processed at

least once or exactly once. Messages are only replayed when there are failures.

Easy to operate – standard configurations are suitable for production on day

one. Once deployed, Storm is easy to operate.

Tweet from Nathan Marz (31 May 2012)

Storm Topology The input stream of a Storm cluster is handled by a component called a Spout.

The spout passes the to a component called a Bolt, which transforms it in some

way.

A Bolt either persists the data in storage, or passes it to some other bolt.

Functional Programming

h(g(f(data)))

λ-calculus

Sample Problem

… Thus the heavens and the earth were finished, and all the host of them. And on the seventh day God ended his work which he had made and he rested on the seventh day from all his work which he had made…

File : Bible.txt

(“thus”, “the”, “heavens”, “and”, “the”, “earth”, “were”, “finished” “and”, “all”, “the”, “host”, “of”, “them”)

{“Thus the heavens and the earth were finished, and all the host of them.”} {“And on the seventh day God ended his work which he had made”}

( (“testaments”, 10), (“holy”, 12), (“faith”, 34) )

f

g

h

Relationship of Storm Topology with Functional Programming

BoltBolt Bolt Bolt Spout Spout Data

f g h

Line-reader Word-Splitter Word-Counter

Data Source Reliability

A data source is considered “unreliable”, if there is no means to replay a

message.

A data source is considered “reliable” if it can somehow replay a

message if processing fails at any point.

A data source is considered “durable” if it can replay any message or set

of messages given the necessary selection criteria.

Reliability Limitations: Integrating Kafka with Apache Storm

Exactly once processing requires a “durable” data source.

At least once processing requires a “reliable” data source.

An “unreliable” data source can be wrapped to provide additional

guarantees.

For Apache Storm (demo), I’ve backed up unreliable data source with

Apache Kafka (minor latency overhead to ensure 100% durability).

Relationship of Storm Topology with Functional Programming

BoltBolt Bolt Bolt Spout Spout

Data

f g h

Storm Spout subscribed to topic bible of kafka messaging queue

Word-Splitter Word-CounterTopic: bible

…5|4|3|2|1

Line-reader

Scenarios / Use cases where Storm can be effectively used

Predictive Analysis

Social Graph Analysis

Network Monitoring

Recommendation Engine

Realtime Analytics

Online Machine Learning

Continuous Computation

Distributed Remote Procedure Call

Website Activity Tracking

Log Aggregation

Storm Components

A Storm cluster has 3 sets of nodes

Nimbus Nodes

Zookeeper Nodes

Supervisor Nodes

Storm Components


Nimbus Nodes

Zookeeper Nodes

Supervisor Nodes

Master Node Daemon

Distributes code across the

cluster

Launches workers across the

cluster

Monitors computation and

reallocates workers as needed

Storm Components


Nimbus Nodes

Zookeeper Nodes

Supervisor Nodes

Manages all the coordination between Nimbus and the supervisors.

Storm Components


Nimbus Nodes

Zookeeper Nodes

Supervisor Nodes

Executes a subset of topology (spout and /or bolts).

Listens for jobs assigned to the machine and starts and stops worker processes as necessary.

Known Limitations: Nimbus : A single point of failure

When Nimbus is down : Topologies continue to work Tasks from failing nodes (Spouts/Bolts) aren’t replayed Can’t upload a new topology or rebalance an old one

It is recommended to run Nimbus under daemon tool or monit so that it could be restarted automatically when it is down.

(In contrast to Hadoop, if the Job Tracker dies, all the running jobs are lost)

Contact

For more details about our services, please get in touch

with us.

[email protected]

US Office: (408) 365-4638

www.folio3.com

mailto:[email protected]

http://www.folio3.com/

distributed and fault tolerant realtime computation with apache storm, apache kafka and apache...

Technology