distributed and fault tolerant realtime computation with apache storm, apache kafka and apache...
TRANSCRIPT
Distributed and Fault-Tolerant Distributed and Fault-Tolerant Realtime ComputationRealtime Computation
www.folio3.com@folio_3
Folio3 – OverviewFolio3 – Overview
www.folio3.com @folio_3
Who We Are
We are a Development Partner for our customers
Design software solutions, not just implement them
Focus on the solution – Platform and technology agnostic
Expertise in building applications that are:
Mobile Social Cloud-based Gamified
What We Do Areas of Focus
Enterprise
Custom enterprise applications
Product development targeting the enterprise
Mobile
Custom mobile apps for iOS, Android, Windows Phone, BB OS
Mobile platform (server-to-server) development
Social Media
CMS based websites for consumers and enterprise (corporate, consumer,
community & social networking)
Social media platform development (enterprise & consumer)
Folio3 At a Glance Founded in 2005
Over 200 full time employees
Offices in the US, Canada, Bulgaria & Pakistan
Palo Alto, CA. Sofia, Bulgaria
Karachi, Pakistan
Toronto, Canada
Areas of Focus: Enterprise Automating workflows
Cloud based solutions
Application integration
Platform development
Healthcare
Mobile Enterprise
Digital Media
Supply Chain
Areas of Focus: Mobile Serious enterprise applications for Banks,
Businesses
Fun consumer apps for app discovery,
interaction, exercise gamification and play
Educational apps
Augmented Reality apps
Mobile Platforms
Some of Our Mobile Clients
Areas of Focus: Web & Social Media
Community Sites based on
Content Management Systems
Enterprise Social Networking
Social Games for Facebook &
Mobile
Companion Apps for games
Some of Our Web Clients
www.folio3.com @folio_3
Distributed and Fault-Tolerant Distributed and Fault-Tolerant Realtime ComputationRealtime Computation
Agenda
Big Data
Hadoop Vs Storm
Lambda Architecture
Storm Architecture And Concepts
Big Data
To understand “Big Data”, it has four dimensions :
Volume : Scale of Data (terabytes, petabytes, exabytes)
Velocity : Need to be analyzed quickly (milliseconds to
seconds to respond)
Variety : Different forms of Data (& Data Sources)
Veracity : Uncertainty of Data (due to data inconsistency,
ambiguities, latency, data incompleteness)
Example Query
Total Number of Page Views To A Website
URL over a range of time
Example Query
function pageViewsOverTime(bigData, url, startTime, endTime) {
int count = 0;
for (data : bigData) {
if ( data.url == url &&
data.timestamp >= startTime &&
data.timestamp <= endTime ) {
count ++;
}
}
return count;
}
Example Query
TOO SLOW : Big Data is in petabytes
(Volume)
Hadoop Data Processing Architecture
Data Store (HDFS)
Hadoop (Map
Reduce)
Batch View (Processed
Data)
Query Views generated in batch maybe out of date
Batch workflow is too slow
Data Flow Batch Run
Lambda Architecture
Immutable Master Dataset ( stored in HDFS)
What is Apache Storm ?
Storm is a real-time distributed computing framework for
reliably processing large volumes of high velocity unbounded
data streams.
It was created by Nathan Marz and his team at BackType, and
released as open source in 2011(after BackType was acquired by
Twitter)
Five characteristics make Storm ideal for real-time data processing workloads.
Fast – benchmarked at processing one million+ 100 byte messages per second
per node
Scalable – with parallel calculations that run across a cluster of machines
Fault-tolerant – when workers die, Storm will automatically restart them. If a
node dies, the work will be restarted on another node.
Reliable – Storm guarantees that each unit of data (tuple) will be processed at
least once or exactly once. Messages are only replayed when there are failures.
Easy to operate – standard configurations are suitable for production on day
one. Once deployed, Storm is easy to operate.
Tweet from Nathan Marz (31 May 2012)
Storm Topology The input stream of a Storm cluster is handled by a component called a Spout.
The spout passes the to a component called a Bolt, which transforms it in some
way.
A Bolt either persists the data in storage, or passes it to some other bolt.
Functional Programming
h(g(f(data)))
λ-calculus
Sample Problem
… Thus the heavens and the earth were finished, and all the host of them. And on the seventh day God ended his work which he had made and he rested on the seventh day from all his work which he had made…
File : Bible.txt
(“thus”, “the”, “heavens”, “and”, “the”, “earth”, “were”, “finished” “and”, “all”, “the”, “host”, “of”, “them”)
{“Thus the heavens and the earth were finished, and all the host of them.”} {“And on the seventh day God ended his work which he had made”}
( (“testaments”, 10), (“holy”, 12), (“faith”, 34) )
f
g
h
Relationship of Storm Topology with Functional Programming
BoltBolt Bolt Bolt Spout Spout Data
f g h
Line-reader Word-Splitter Word-Counter
Data Source Reliability
A data source is considered “unreliable”, if there is no means to replay a
message.
A data source is considered “reliable” if it can somehow replay a
message if processing fails at any point.
A data source is considered “durable” if it can replay any message or set
of messages given the necessary selection criteria.
Reliability Limitations: Integrating Kafka with Apache Storm
Exactly once processing requires a “durable” data source.
At least once processing requires a “reliable” data source.
An “unreliable” data source can be wrapped to provide additional
guarantees.
For Apache Storm (demo), I’ve backed up unreliable data source with
Apache Kafka (minor latency overhead to ensure 100% durability).
Relationship of Storm Topology with Functional Programming
BoltBolt Bolt Bolt Spout Spout
Data
f g h
Storm Spout subscribed to topic bible of kafka messaging queue
Word-Splitter Word-CounterTopic: bible
…5|4|3|2|1
Line-reader
Scenarios / Use cases where Storm can be effectively used
Predictive Analysis
Social Graph Analysis
Network Monitoring
Recommendation Engine
Realtime Analytics
Online Machine Learning
Continuous Computation
Distributed Remote Procedure Call
Website Activity Tracking
Log Aggregation
Storm Components
A Storm cluster has 3 sets of nodes
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes
Storm Components
A Storm cluster has 3 sets of nodes
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes
Master Node Daemon
Distributes code across the
cluster
Launches workers across the
cluster
Monitors computation and
reallocates workers as needed
Storm Components
A Storm cluster has 3 sets of nodes
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes
Manages all the coordination between Nimbus and the supervisors.
Storm Components
A Storm cluster has 3 sets of nodes
Nimbus Nodes
Zookeeper Nodes
Supervisor Nodes
Executes a subset of topology (spout and /or bolts).
Listens for jobs assigned to the machine and starts and stops worker processes as necessary.
Known Limitations: Nimbus : A single point of failure
When Nimbus is down : Topologies continue to work Tasks from failing nodes (Spouts/Bolts) aren’t replayed Can’t upload a new topology or rebalance an old one
It is recommended to run Nimbus under daemon tool or monit so that it could be restarted automatically when it is down.
(In contrast to Hadoop, if the Job Tracker dies, all the running jobs are lost)
Contact
For more details about our services, please get in touch
with us.
US Office: (408) 365-4638
www.folio3.com