scalable stream processing with storm

59
©2012 Networked Insights Proprietary and confidential Scalable stream processing with Storm Brian Johnson Luke Forehand

Upload: cree

Post on 25-Feb-2016

35 views

Category:

Documents


4 download

DESCRIPTION

Scalable stream processing with Storm. Brian Johnson Luke Forehand. Our Ambition: Marketing Decision Platform. Brand Health. Choice & Experience. Brand & Category Environment. Budgets. Equity. Image & Personality. Perceptions & Associations. Choice. Purchase Funnel. Budget Planning. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential

Scalable stream processing with StormBrian JohnsonLuke Forehand

Page 2: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential

Our Ambition: Marketing Decision Platform

Competitive Positioning

Position LoyaltyCore Benefit & Differentiation

Product DevelopmentFeatures & Functions Design Cost

StructurePackagingQuality

Advertising Content

Message Naming & Taglines

Relationship Marketing

CRM Engagement

Owned Social Engagement

Consumer Promotion

DirectCoupon

Price/Value Perception

Price Justification

Price Change Response

Price Management

Competitive Pricing

Price Optimization

E-Commerce

Online

Sales Management

Demand Planning

Sales Analysis

Global & Local

Market Management

Channel Management

Bran

dAd

verti

sing

Cons

umer

Pric

ing Ch

anne

l

Public Relations

Buzz Generation

Damage Control

Own Stores

OwnedStores

Retailer Management

Distribution Loyalty ProgramPrice & Costs Feature

PromotionIn-Store

Promotion

Digital Marketing & Advertising

Owned Media Search Email

Social Ad/Display Ad Mobile

Tracking & Attribution

Influencing

Influence & Advocacy

Endorsers & Spokespeople Partnerships Sponsorships

Budgets

Marketing & Media Mix

Budget Planning

Segmentation & Targeting

Brand & Category

Demos & Geos LifestylesBehavioral &

Attitudinal Lifestages Trends

Traditional Advertising

TV RadioOut of HomePrint

Brand & Category Environment

Substitutes Complements

Category Trends

Unmet Needs

Product Lifecycle

Roles & Portfolio

Value Chain

Laws & Regulations

External Forces (i.e. economy)

Category Management

Assortment Price Promotion & Co-marketing

Brand Health

Equity Image & Personality

Perceptions & Associations

Choice & Experience

Choice

EngagementExperience & Usage

Purchase Funnel

2

Page 3: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 3

Big Data Analytics

• What is “Big Data” to Networked Insights?• Almost exclusively social media posts and metadata• Twitter (~67%), Forums, Blogs, Facebook, etc.

• Total index ~60 Billion documents, ~500 TB in production• New documents of 2 Billion/month, increasing• Historical data going back to 2009

Thematic Clustering(Doppler)

Data

Information

Page 4: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 4

Utilizing Social Media Data

We do two things: 1) Filter data; 2) Analyze dataOur filtering technology must accomodate two scenarios

I don’t know what I am looking for I know what I want to find

Discovery Technologies• Doppler• Word/Phrase Clouds

Search Technologies• Elastic Search• Named Entity Recognition• Supervised Machine Learning• Computational Linguistics

Explicit Information Implicit InformationPost Content (words and phrases used) LanguageAuthor Topical Themes / Categories (sometimes)Day/Time Tone and Sentiment (sometimes)URL Gender (sometimes)*Followers/Following (sometimes) Location (sometimes)*Likes (sometimes) Relative age (sometimes)*

We analyze 2 types of information: Implicit & Explicit

Page 5: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 5

Implicit Information Mining Example

Gender Classification – List of methods and features

1. Author name / author ID analysis: compare both fields list of first names from US Census

2. Twitter summary field analysis

3. Post content features: analyze the content for certain clues or common characteristics that one gender has over another

1. Text formality – males tend to have more formality than females2. Suffix preferences – many suffixes show up more in female posts than

male3. Word classes – 23 different groups of words that reflect certain topics

or emotions that skew towards one gender more than another4. Lexical words & phrases – certain words/phrases that are giveaways like

“my husband”5. POS sequences – certain part of speech patterns for unigram, bigram,

trigram, and quadgram phrases

Page 6: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 6

Lots of data, lots of routing

Spam Classifiers

Gender Analysis

Sentiment

Age Classification

Age Classification

iPhone? Samsung?

BlackBerry? Etc.

Taco Bell?McDonald’s?

Subway?

World War Z?Monsters U?

White House Down?

Timberlake?Bieber?Jay Z?

Meta Data

Original Documents

Reporting Layer

SocialSense Application

Layer

Topical Categorization

= Storm

Page 7: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 7

Storm Agenda

• Overview• Architecture• Working Example• Spout API / Reliability• Bolt API / Scalability• Topology Demo• Monitoring

Page 8: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 8

Overview

• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex

ways using simple constructs

Page 9: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 9

Overview

• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex

ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed

Page 10: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 10

Overview

• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex

ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed

• Storm is easy to configure and scale• Each component can be scaled independently

Page 11: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 11

Overview

• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex

ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed

• Storm is easy to configure and scale• Each component can be scaled independently

• Components can be written in any language

Page 12: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 12

Overview

• Storm is a realtime distributed processing system• Think of Hadoop but in realtime• Data can be transformed and grouped in complex

ways using simple constructs• Storm is reliable and fault tolerant• Message delivery is guaranteed

• Storm is easy to configure and scale• Each component can be scaled independently

• Components can be written in any language• Written in Clojure (functional language), driven by ZeroMQ

Page 13: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 13

Architecture

• Components

Page 14: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 14

Architecture

• Nimbus• “Master”• Uses Zookeeper to communicate with Supervisors• Responsible for assigning work to supervisors

Page 15: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 15

Architecture

• Nimbus• “Master”• Uses Zookeeper to communicate with Supervisors• Responsible for assigning work to supervisors

• Supervisor• Manages a set of workers (JVMs) on each storm node• Receives work assignments from Nimbus

Page 16: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 16

Architecture

• Nimbus• “Master”• Uses Zookeeper to communicate with Supervisors• Responsible for assigning work to supervisors

• Supervisor• Manages a set of workers (JVMs) on each storm node• Receives work assignments from Nimbus

• Worker• Managed by Supervisor• Responsible for receiving, executing, and emitting data

inside a storm topology

Page 17: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 17

Working Example

Page 18: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 18

Working Example

Page 19: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 19

Working Example

• Topology• Defines the logical components of a data flow

Page 20: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 20

Working Example

• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams

Page 21: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 21

Working Example

• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples

into a topology

Page 22: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 22

Working Example

• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples

into a topology• Bolt processes tuples emitted from upstream

components and produces zero or many outputtuples

Page 23: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 23

Working Example

• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples

into a topology• Bolt processes tuples emitted from upstream

components and produces zero or many outputtuples

• Stream is a flow of tuples from one component toanother, there can be many

Page 24: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 24

Working Example

• Topology• Defines the logical components of a data flow• Composed of Spouts, Bolts, Streams• Spout is a special component that emits data tuples

into a topology• Bolt processes tuples emitted from upstream

components and produces zero or many outputtuples

• Stream is a flow of tuples from one component toanother, there can be many

• Tuple is a single record containing a named list of values

Page 25: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 25

Working Example

Page 26: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 26

Spout API

ISpoutvoid declareOutputFields(OutputFieldsDeclarer declarer)void open(Map conf, TopologyContext context, SpoutOutputCollector collector)void nextTuple()void close()

ISpoutOutputCollectorList<Integer> emit(String streamId, List<Object> tuple, Object messageId)

Page 27: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 27

Reliability

• Each Storm component acknowledges that a tuplehas been processed

Page 28: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 28

Reliability

• Each Storm component acknowledges that a tuplehas been processed

• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout

Page 29: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 29

Reliability

• Each Storm component acknowledges that a tuplehas been processed

• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout

• The emitting spout will replay the tuple if ACK is notreceived within a configured timeout

Page 30: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 30

Reliability

• Each Storm component acknowledges that a tuplehas been processed

• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout

• The emitting spout will replay the tuple if ACK is notreceived within a configured timeout

• Spouts can control the number of “pending” tuplesthat are in memory in the topology

Page 31: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 31

Reliability

• Each Storm component acknowledges that a tuplehas been processed

• An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout

• The emitting spout will replay the tuple if ACK is notreceived within a configured timeout

• Spouts can control the number of “pending” tuplesthat are in memory in the topology

• Spouts need to transact properly with an upstream data source when a tuple is fully acknowledged

Page 32: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 32

Reliability

ISpoutvoid ack(Object msgId)void fail(Object msgId)

Page 33: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 33

Reliability

• MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology

• Be careful not to artificially decrease throughput!

Page 34: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 34

Reliability

• MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology

• Be careful not to artificially decrease throughput!• Batching operations with reliability turned on can also

create issues

Page 35: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 35

Reliability

• If max_spout_pending is smaller thanbatch size, topo will collapse

• If interruption in tuple flow, batch may never fill

Page 36: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 36

Reliability

• Solution: time based batching with TickTuple• TickTuple exercises the component to prompt a batch

commit on a specified interval

Page 37: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 37

Reliability

• Questions?

Page 38: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 38

Bolt API

Page 39: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 39

Bolt API

• Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.

Page 40: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 40

Bolt API

• Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.

• Shuffle grouping – tuples are randomly distributedacross the instances of a bolt

Page 41: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 41

Bolt API

• Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.

• Shuffle grouping – tuples are randomly distributedacross the instances of a bolt

• Fields grouping – stream is partitioned by fields specifiedin the grouping, so tuples with a particular named valuewill always flow to the same bolt instance

Page 42: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 42

Bolt API

Page 43: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 43

Bolt API

IBoltvoid declareOutputFields(OutputFieldsDeclarer declarer)void prepare(Map stormConf, TopologyContext context, OutputCollector collector)void execute(Tuple input)void cleanup()

IOutputCollectorList<Integer> emit(String streamId, Collection<Tuple> anchors, List<Object> tuple)void ack(Tuple input)void fail(Tuple input)

Page 44: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 44

Bolt API

• You can also build the components of your topology inother languages

public class MyPythonBolt extends ShellBolt {public MyPythonBolt() {

super("python", "mybolt.py");

}...

}

Page 45: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 45

Scalability

• The goal should be to scale components accordingly inorder to keep up with realtime data flow

Page 46: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 46

Scalability

• The goal should be to scale components accordingly inorder to keep up with realtime data flow

• Scalability is easy and can happen in several ways• Increase the number of executors (threads) that work

within a component (bolt or spout)

Page 47: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 47

Scalability

• The goal should be to scale components accordingly inorder to keep up with realtime data flow

• Scalability is easy and can happen in several ways• Increase the number of executors (threads) that work

within a component (bolt or spout)• Increase the number of workers assigned to a topology

Page 48: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 48

Scalability

• The goal should be to scale components accordingly inorder to keep up with realtime data flow

• Scalability is easy and can happen in several ways• Increase the number of executors (threads) that work

within a component (bolt or spout)• Increase the number of workers assigned to a topology• Increase total workers available in cluster

Page 49: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 49

Scalability

Example Topology increasing number of executors per component

Page 50: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 50

Scalability

2 workers, MySpout with 2 executors, MyBolt with 4 executors

4 workers, MySpout with 2 executors, MyBolt with 4 executors

Example Topology increasing number of workers in the topology

• Work will always be spreadevenly across the workerswhen possible

Page 51: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 51

Scalability

• Questions?

Page 52: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 52

Topology Demo

• Demonstrate Topology

Page 53: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 53

Monitoring

• Monitoring is important to verify data throughput iskeeping up with realtime data flow

• Storm provides excellent monitoring via a UI

Page 54: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 54

Monitoring

• Monitoring is important to verify data throughput iskeeping up with realtime data flow

• Storm provides excellent monitoring via a UI• UI per topology component will indicate• Tuples transferred• Tuples ACKd, tuples failed (timeout)• Execute Latency ms (self time)• Process Latency ms (total time)

Page 55: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 55

Monitoring

• Monitoring is important to verify data throughput iskeeping up with realtime data flow

• Storm provides excellent monitoring via a UI• UI per topology component will indicate• Tuples transferred• Tuples ACKd, tuples failed (timeout)• Execute Latency ms (self time)• Process Latency ms (total time)

• Nimbus also provides this interface via Thrift service so one can flexibly collect and aggregate stats (graphite?)

Page 56: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 56

Monitoring

• Another key indicator of problems is the capacity of a component, if it is at 1.0 or greater, it is a bottleneck

Page 57: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 57

Monitoring

• Another key indicator of problems is the capacity of a component, if it is at 1.0 or greater, it is a bottleneck

• If you trend the standard deviation of the throughput ofyour components (using either average execute or processlatency) you can quickly respond to changes in typicaldata flow

Page 58: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential 58

Monitoring

• Questions?

Page 59: Scalable stream processing with Storm

©2012 Networked InsightsProprietary and confidential

THANK YOU