Download - storm at twitter
stormstream processing @twitter
Krishna GadeTwitter
@krishnagade
Sunday, June 16, 13
what is storm?
storm is a platform for doing analysis on streams of data as they come in, so you can react to data as it
happens.
Sunday, June 16, 13
storm v hadoop
storm & hadoop are complementary!
hadoop => big batch processingstorm => fast, reactive, real time processing
Sunday, June 16, 13
origins
• originated at backtype, acquired by twitter in 2011.
• to vastly simplify dealing with queues & workers.
Sunday, June 16, 13
queue-worker model
queues workers
a a a a a
Sunday, June 16, 13
typical workflow
queues queues
workers workers
datastore
Sunday, June 16, 13
problems
• scaling is painful - queue partitioning & worker deploy.
• operational overhead - worker failures & queue backups.
• no guarantees on data processing.
Sunday, June 16, 13
storm
Sunday, June 16, 13
what does storm provide?
• at least once message processing.
• horizontal scalability.
• no intermediate queues.
• less operational overhead.
• “just works”.
Sunday, June 16, 13
storm primitives
• streams
• spouts
• bolts
• topologies
Sunday, June 16, 13
streams
unbounded sequence of tuples
T T T T T T T T T T T T T T T
Sunday, June 16, 13
spouts
source of streams
A A A A A A A A A A A A
B B B B B B B B B B B B
Sunday, June 16, 13
typical spouts
• read from a kestrel/kafka queue. {tuples = events}
• read from a http server log. {tuples = http requests}
• read from twitter streaming api. {tuples = tweets}
Sunday, June 16, 13
bolts
process input stream - Aproduce output stream - B
A A A A A A A A B B B B B B B B
Sunday, June 16, 13
bolts
• filtering tuples in a stream.
• aggregation of tuples.
• joining multiple streams.
• arbitrary functions on streams.
• communication with external caches/dbs.
Sunday, June 16, 13
topology
directed-acyclic-graph of spouts and bolts.
s1
s2
b1
b2
b3
b4
b5
Sunday, June 16, 13
storm cluster
nimbus
supervisor
w1 w2 w3 w4
supervisor
w1 w2 w3 w4
ZK
topology map
sync code
topology submission
master node
slave nodesSunday, June 16, 13
nimbus
• master node.
• manages the topologies.
• job tracker in hadoop.
$ storm jar myapp.jar com.twitter.MyTopology demo
Sunday, June 16, 13
supervisor
• runs on slave nodes.
• co-ordinates with zookeeper.
• manages workers.
Sunday, June 16, 13
worker
jvm process
executor
task task
task
task
executor executor
Sunday, June 16, 13
recap
• worker - process that executes a subset of a topology.
• executor - a thread spawned by a worker.
• task - performs the actual data processing.
Sunday, June 16, 13
stream grouping
• shuffle grouping - random distribution of tuples.
• field grouping - groups tuples by a field.
• all grouping - replicates to all tasks.
• global grouping - sends the entire stream to one task.
Sunday, June 16, 13
streaming word-count TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("tweet_spout", new RandomTweetSpout(), 5); builder.setBolt("parse_bolt", new ParseTweetBolt(), 8) .shuffleGrouping("tweet_spout") .setNumTasks(2); builder.setBolt("count_bolt", new WordCountBolt(), 12) .fieldsGrouping("parse_bolt", new Fields("word"));
Config config = new Config(); config.setNumWorkers(3); StormSubmitter.submitTopology(“demo”, config, builder.createTopology());
Sunday, June 16, 13
tweet spoutclass RandomTweetSpout extends BaseRichSpout { SpoutOutputCollector collector; Random rand; String[] tweets = new String[] { "@jkrums: There’s a plane in the Hudson. I’m on the ferry to pick up people. Crazy", "@barackobama: Four more years. pic.twitter.com/bAJE6Vom", ...
};
....
@Override public void nextTuple() { Utils.sleep(100); String tweet = tweets[rand.nextInt(tweets.length)]; collector.emit(new Values(tweet)); }}
Sunday, June 16, 13
parse boltclass ParseTweetBolt extends BaseBasicBolt {
@Override public void execute(Tuple tuple, BasicOutputCollector collector) { String tweet = tuple.getString(0); for (String word : tweet.split(" ")) { collector.emit(new Values(word)); } }
@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }}
Sunday, June 16, 13
word count boltclass WordCountBolt extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>();
@Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); count = (count == null) ? 1 : count + 1; counts.put(word, count); collector.emit(new Values(word, count)); }
@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); }}
Sunday, June 16, 13
word-count topology
RandomTweetSpout ParseTweetBolt WordCountBolt
shuffle grouping fields grouping
Sunday, June 16, 13
how do we run storm @twitter ?
Sunday, June 16, 13
storm on mesos
node node node node
mesos
we run multiple instances of storm on the same cluster via mesos.
storm(production)
storm(dev) provides efficient
resource isolation and sharing across distributed
frameworks such as storm.
Sunday, June 16, 13
topology isolation
isolation scheduler solves the problem of multi-tenancy – avoiding resource contention between topologies, by providing full isolation
between topologies.
Sunday, June 16, 13
topology isolation
• shared pool - multiple topologies can run on the same host.
• isolated pool - dedicated set of hosts to run a single topology.
Sunday, June 16, 13
topology isolationshared pool
storm cluster
Sunday, June 16, 13
topology isolationshared pool
storm cluster
joe’s topology
isolated pools
Sunday, June 16, 13
topology isolationshared pool
storm cluster
joe’s topology
isolated pools
jane’s topology
Sunday, June 16, 13
topology isolationshared pool
storm cluster
joe’s topology
isolated pools
jane’s topology
dave’s topology
Sunday, June 16, 13
topology isolation
X
shared pool
storm cluster
joe’s topology
isolated pools
jane’s topology
dave’s topology
host failure
Sunday, June 16, 13
topology isolationshared pool
storm cluster
joe’s topology
isolated pools
jane’s topology
dave’s topology
repair hostadd host
Sunday, June 16, 13
topology isolationshared pool
storm cluster
joe’s topology
isolated pools
jane’s topology
dave’s topology
add to shared pool
Sunday, June 16, 13
numbers
• benchmarked at a million tuples processed per second per node.
• running 30 topologies in a 200 node cluster..
• processing 50 billion messages a day with an average complete latency under 50 ms.
Sunday, June 16, 13
storm use-cases@twitter
Sunday, June 16, 13
stream processing applications
tweets
favorites, retweets
impressions
twitter stormstreams
spout
bolt
bolt
$$$$
realtime dashboards
new features
Sunday, June 16, 13
current use-cases
• discovery of emerging topics/stories.
• online learning of tweet features for search result ranking.
• realtime analytics for ads.
• internal log processing.
Sunday, June 16, 13
tweet scoring pipeline
tweets
data streams
impressions
interactions
storm topology
graphstore
metadatastore
join: tweets, impressions
join: tweets, interactions
last 7 days of:tweet ->
feature_val, feature_type,
timestamp
persistent store:
tweet -> feature_val,
feature_type,timestamp
thriftservice
cassandra
twemcache
input: tweet idoutput: score
write tweetfeatures
Sunday, June 16, 13
road ahead
• auto scaling.
• persistent bolts.
• better grouping schemes.
• replicated computation.
• higher-level abstractions.
Sunday, June 16, 13
companies using storm
Sunday, June 16, 13
questions?
project: https://storm-project.net
mailing-list: http://groups.google.com/group/storm-user
Sunday, June 16, 13