scalable stream processing with storm
DESCRIPTION
Scalable stream processing with Storm. Brian Johnson Luke Forehand. Our Ambition: Marketing Decision Platform. Brand Health. Choice & Experience. Brand & Category Environment. Budgets. Equity. Image & Personality. Perceptions & Associations. Choice. Purchase Funnel. Budget Planning. - PowerPoint PPT PresentationTRANSCRIPT
©2012 Networked InsightsProprietary and confidential
Scalable stream processing with StormBrian JohnsonLuke Forehand
©2012 Networked InsightsProprietary and confidential
Our Ambition: Marketing Decision Platform
Competitive Positioning
Position LoyaltyCore Benefit & Differentiation
Product DevelopmentFeatures & Functions Design Cost
StructurePackagingQuality
Advertising Content
Message Naming & Taglines
Relationship Marketing
CRM Engagement
Owned Social Engagement
Consumer Promotion
DirectCoupon
Price/Value Perception
Price Justification
Price Change Response
Price Management
Competitive Pricing
Price Optimization
E-Commerce
Online
Sales Management
Demand Planning
Sales Analysis
Global & Local
Market Management
Channel Management
Bran
dAd
verti
sing
Cons
umer
Pric
ing Ch
anne
l
Public Relations
Buzz Generation
Damage Control
Own Stores
OwnedStores
Retailer Management
Distribution Loyalty ProgramPrice & Costs Feature
PromotionIn-Store
Promotion
Digital Marketing & Advertising
Owned Media Search Email
Social Ad/Display Ad Mobile
Tracking & Attribution
Influencing
Influence & Advocacy
Endorsers & Spokespeople Partnerships Sponsorships
Budgets
Marketing & Media Mix
Budget Planning
Segmentation & Targeting
Brand & Category
Demos & Geos LifestylesBehavioral &
Attitudinal Lifestages Trends
Traditional Advertising
TV RadioOut of HomePrint
Brand & Category Environment
Substitutes Complements
Category Trends
Unmet Needs
Product Lifecycle
Roles & Portfolio
Value Chain
Laws & Regulations
External Forces (i.e. economy)
Category Management
Assortment Price Promotion & Co-marketing
Brand Health
Equity Image & Personality
Perceptions & Associations
Choice & Experience
Choice
EngagementExperience & Usage
Purchase Funnel
2
©2012 Networked InsightsProprietary and confidential 3
Big Data Analytics
• What is “Big Data” to Networked Insights?• Almost exclusively social media posts and metadata• Twitter (~67%), Forums, Blogs, Facebook, etc.
• Total index ~60 Billion documents, ~500 TB in production• New documents of 2 Billion/month, increasing• Historical data going back to 2009
Thematic Clustering(Doppler)
Data
Information
©2012 Networked InsightsProprietary and confidential 4
Utilizing Social Media Data
We do two things: 1) Filter data; 2) Analyze dataOur filtering technology must accomodate two scenarios
I don’t know what I am looking for I know what I want to find
Discovery Technologies• Doppler• Word/Phrase Clouds
Search Technologies• Elastic Search• Named Entity Recognition• Supervised Machine Learning• Computational Linguistics
Explicit Information Implicit InformationPost Content (words and phrases used) LanguageAuthor Topical Themes / Categories (sometimes)Day/Time Tone and Sentiment (sometimes)URL Gender (sometimes)*Followers/Following (sometimes) Location (sometimes)*Likes (sometimes) Relative age (sometimes)*
We analyze 2 types of information: Implicit & Explicit
©2012 Networked InsightsProprietary and confidential 5
Implicit Information Mining Example
Gender Classification – List of methods and features
1. Author name / author ID analysis: compare both fields list of first names from US Census
2. Twitter summary field analysis
3. Post content features: analyze the content for certain clues or common characteristics that one gender has over another
1. Text formality – males tend to have more formality than females2. Suffix preferences – many suffixes show up more in female posts than
male3. Word classes – 23 different groups of words that reflect certain topics
or emotions that skew towards one gender more than another4. Lexical words & phrases – certain words/phrases that are giveaways like
“my husband”5. POS sequences – certain part of speech patterns for unigram, bigram,
trigram, and quadgram phrases
©2012 Networked InsightsProprietary and confidential 6
Lots of data, lots of routing
Spam Classifiers
Gender Analysis
Sentiment
Age Classification
Age Classification
iPhone? Samsung?
BlackBerry? Etc.
Taco Bell?McDonald’s?
Subway?
World War Z?Monsters U?
White House Down?
Timberlake?Bieber?Jay Z?
Meta Data
Original Documents
Reporting Layer
SocialSense Application
Layer
Topical Categorization
= Storm
©2012 Networked InsightsProprietary and confidential 7
Storm Agenda
• Overview• Architecture• Working Example• Spout API / Reliability• Bolt API / Scalability• Topology Demo• Monitoring
©2012 Networked InsightsProprietary and confidential 8
Overview
• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex
ways using simple constructs
©2012 Networked InsightsProprietary and confidential 9
Overview
• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex
ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed
©2012 Networked InsightsProprietary and confidential 10
Overview
• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex
ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed
• Storm is easy to configure and scale• Each component can be scaled independently
©2012 Networked InsightsProprietary and confidential 11
Overview
• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex
ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed
• Storm is easy to configure and scale• Each component can be scaled independently
• Components can be written in any language
©2012 Networked InsightsProprietary and confidential 12
Overview
• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex
ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed
• Storm is easy to configure and scale• Each component can be scaled independently
• Components can be written in any language• Written in Clojure (functional language), driven by ZeroMQ
©2012 Networked InsightsProprietary and confidential 13
Architecture
• Components
©2012 Networked InsightsProprietary and confidential 14
Architecture
• Nimbus• “Master”• Uses Zookeeper to communicate with Supervisors• Responsible for assigning work to supervisors
©2012 Networked InsightsProprietary and confidential 15
Architecture
• Nimbus• “Master”• Uses Zookeeper to communicate with Supervisors• Responsible for assigning work to supervisors
• Supervisor• Manages a set of workers (JVMs) on each storm node• Receives work assignments from Nimbus
©2012 Networked InsightsProprietary and confidential 16
Architecture
• Nimbus• “Master”• Uses Zookeeper to communicate with Supervisors• Responsible for assigning work to supervisors
• Supervisor• Manages a set of workers (JVMs) on each storm node• Receives work assignments from Nimbus
• Worker• Managed by Supervisor• Responsible for receiving, executing, and emitting data
inside a storm topology
©2012 Networked InsightsProprietary and confidential 17
Working Example
©2012 Networked InsightsProprietary and confidential 18
Working Example
©2012 Networked InsightsProprietary and confidential 19
Working Example
• Topology• Defines the logical components of a data flow
©2012 Networked InsightsProprietary and confidential 20
Working Example
• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams
©2012 Networked InsightsProprietary and confidential 21
Working Example
• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples
into a topology
©2012 Networked InsightsProprietary and confidential 22
Working Example
• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples
into a topology• Bolt processes tuples emitted from upstream
components and produces zero or many outputtuples
©2012 Networked InsightsProprietary and confidential 23
Working Example
• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples
into a topology• Bolt processes tuples emitted from upstream
components and produces zero or many outputtuples
• Stream is a flow of tuples from one component toanother, there can be many
©2012 Networked InsightsProprietary and confidential 24
Working Example
• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples
into a topology• Bolt processes tuples emitted from upstream
components and produces zero or many outputtuples
• Stream is a flow of tuples from one component toanother, there can be many
• Tuple is a single record containing a named list of values
©2012 Networked InsightsProprietary and confidential 25
Working Example
©2012 Networked InsightsProprietary and confidential 26
Spout API
ISpoutvoid declareOutputFields(OutputFieldsDeclarer declarer)void open(Map conf, TopologyContext context, SpoutOutputCollector collector)void nextTuple()void close()
ISpoutOutputCollectorList<Integer> emit(String streamId, List<Object> tuple, Object messageId)
©2012 Networked InsightsProprietary and confidential 27
Reliability
• Each Storm component acknowledges that a tuplehas been processed
©2012 Networked InsightsProprietary and confidential 28
Reliability
• Each Storm component acknowledges that a tuplehas been processed
• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout
©2012 Networked InsightsProprietary and confidential 29
Reliability
• Each Storm component acknowledges that a tuplehas been processed
• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout
• The emitting spout will replay the tuple if ACK is notreceived within a configured timeout
©2012 Networked InsightsProprietary and confidential 30
Reliability
• Each Storm component acknowledges that a tuplehas been processed
• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout
• The emitting spout will replay the tuple if ACK is notreceived within a configured timeout
• Spouts can control the number of “pending” tuplesthat are in memory in the topology
©2012 Networked InsightsProprietary and confidential 31
Reliability
• Each Storm component acknowledges that a tuplehas been processed
• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout
• The emitting spout will replay the tuple if ACK is notreceived within a configured timeout
• Spouts can control the number of “pending” tuplesthat are in memory in the topology
• Spouts need to transact properly with an upstream data source when a tuple is fully acknowledged
©2012 Networked InsightsProprietary and confidential 32
Reliability
ISpoutvoid ack(Object msgId)void fail(Object msgId)
©2012 Networked InsightsProprietary and confidential 33
Reliability
• MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology
• Be careful not to artificially decrease throughput!
©2012 Networked InsightsProprietary and confidential 34
Reliability
• MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology
• Be careful not to artificially decrease throughput!• Batching operations with reliability turned on can also
create issues
©2012 Networked InsightsProprietary and confidential 35
Reliability
• If max_spout_pending is smaller thanbatch size, topo will collapse
• If interruption in tuple flow, batch may never fill
©2012 Networked InsightsProprietary and confidential 36
Reliability
• Solution: time based batching with TickTuple• TickTuple exercises the component to prompt a batch
commit on a specified interval
©2012 Networked InsightsProprietary and confidential 37
Reliability
• Questions?
©2012 Networked InsightsProprietary and confidential 38
Bolt API
©2012 Networked InsightsProprietary and confidential 39
Bolt API
• Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.
©2012 Networked InsightsProprietary and confidential 40
Bolt API
• Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.
• Shuffle grouping – tuples are randomly distributedacross the instances of a bolt
©2012 Networked InsightsProprietary and confidential 41
Bolt API
• Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.
• Shuffle grouping – tuples are randomly distributedacross the instances of a bolt
• Fields grouping – stream is partitioned by fields specifiedin the grouping, so tuples with a particular named valuewill always flow to the same bolt instance
©2012 Networked InsightsProprietary and confidential 42
Bolt API
©2012 Networked InsightsProprietary and confidential 43
Bolt API
IBoltvoid declareOutputFields(OutputFieldsDeclarer declarer)void prepare(Map stormConf, TopologyContext context, OutputCollector collector)void execute(Tuple input)void cleanup()
IOutputCollectorList<Integer> emit(String streamId, Collection<Tuple> anchors, List<Object> tuple)void ack(Tuple input)void fail(Tuple input)
©2012 Networked InsightsProprietary and confidential 44
Bolt API
• You can also build the components of your topology inother languages
public class MyPythonBolt extends ShellBolt {public MyPythonBolt() {
super("python", "mybolt.py");
}...
}
©2012 Networked InsightsProprietary and confidential 45
Scalability
• The goal should be to scale components accordingly inorder to keep up with realtime data flow
©2012 Networked InsightsProprietary and confidential 46
Scalability
• The goal should be to scale components accordingly inorder to keep up with realtime data flow
• Scalability is easy and can happen in several ways• Increase the number of executors (threads) that work
within a component (bolt or spout)
©2012 Networked InsightsProprietary and confidential 47
Scalability
• The goal should be to scale components accordingly inorder to keep up with realtime data flow
• Scalability is easy and can happen in several ways• Increase the number of executors (threads) that work
within a component (bolt or spout)• Increase the number of workers assigned to a topology
©2012 Networked InsightsProprietary and confidential 48
Scalability
• The goal should be to scale components accordingly inorder to keep up with realtime data flow
• Scalability is easy and can happen in several ways• Increase the number of executors (threads) that work
within a component (bolt or spout)• Increase the number of workers assigned to a topology• Increase total workers available in cluster
©2012 Networked InsightsProprietary and confidential 49
Scalability
Example Topology increasing number of executors per component
©2012 Networked InsightsProprietary and confidential 50
Scalability
2 workers, MySpout with 2 executors, MyBolt with 4 executors
4 workers, MySpout with 2 executors, MyBolt with 4 executors
Example Topology increasing number of workers in the topology
• Work will always be spreadevenly across the workerswhen possible
©2012 Networked InsightsProprietary and confidential 51
Scalability
• Questions?
©2012 Networked InsightsProprietary and confidential 52
Topology Demo
• Demonstrate Topology
©2012 Networked InsightsProprietary and confidential 53
Monitoring
• Monitoring is important to verify data throughput iskeeping up with realtime data flow
• Storm provides excellent monitoring via a UI
©2012 Networked InsightsProprietary and confidential 54
Monitoring
• Monitoring is important to verify data throughput iskeeping up with realtime data flow
• Storm provides excellent monitoring via a UI• UI per topology component will indicate• Tuples transferred• Tuples ACKd, tuples failed (timeout)• Execute Latency ms (self time)• Process Latency ms (total time)
©2012 Networked InsightsProprietary and confidential 55
Monitoring
• Monitoring is important to verify data throughput iskeeping up with realtime data flow
• Storm provides excellent monitoring via a UI• UI per topology component will indicate• Tuples transferred• Tuples ACKd, tuples failed (timeout)• Execute Latency ms (self time)• Process Latency ms (total time)
• Nimbus also provides this interface via Thrift service so one can flexibly collect and aggregate stats (graphite?)
©2012 Networked InsightsProprietary and confidential 56
Monitoring
• Another key indicator of problems is the capacity of a component, if it is at 1.0 or greater, it is a bottleneck
©2012 Networked InsightsProprietary and confidential 57
Monitoring
• Another key indicator of problems is the capacity of a component, if it is at 1.0 or greater, it is a bottleneck
• If you trend the standard deviation of the throughput ofyour components (using either average execute or processlatency) you can quickly respond to changes in typicaldata flow
©2012 Networked InsightsProprietary and confidential 58
Monitoring
• Questions?
©2012 Networked InsightsProprietary and confidential
THANK YOU