twitter storm
DESCRIPTION
Slides for internal techtalk about Twitter StormTRANSCRIPT
Twitter StormRealtime distributed computations
Sergey Lukjanov <[email protected]>Dmitry Mescheryakov <[email protected]>
Wednesday, October 3, 12
Real-time data processing
2Wednesday, October 3, 12
Real-time data processingbefore Twitter Storm:
network of queues and workers
2Wednesday, October 3, 12
Real-time data processingbefore Twitter Storm:
network of queues and workers
2
MESSAGES QUEUE
Wednesday, October 3, 12
Real-time data processingbefore Twitter Storm:
network of queues and workers
2
MESSAGES QUEUE
Message routing can be
complex!
Wednesday, October 3, 12
Real-time data processing
3
MESSAGES QUEUE
QUEUE
QUEUE
MESSAGES
MESSAGES
Wednesday, October 3, 12
Real-time data processing
3
MESSAGES QUEUE
QUEUE
QUEUE
MESSAGES
MESSAGESQueues replication is needed for reliability
Wednesday, October 3, 12
Real-time data processing
3
MESSAGES QUEUE
QUEUE
QUEUE
MESSAGES
MESSAGESQueues replication is needed for reliability
Hard to maintain queues
Wednesday, October 3, 12
Real-time data processing
3
MESSAGES QUEUE
QUEUE
QUEUE
MESSAGES
MESSAGESQueues replication is needed for reliability
Hard to maintain queues
Each new computation branch
requires routing reconfiguration
Wednesday, October 3, 12
Twitter Storm
4Wednesday, October 3, 12
Twitter Stormdistributed;
4Wednesday, October 3, 12
Twitter Stormdistributed;
fault-tolerant;
4Wednesday, October 3, 12
Twitter Stormdistributed;
fault-tolerant;
real-time;
4Wednesday, October 3, 12
Twitter Stormdistributed;
fault-tolerant;
real-time;
computation;
4Wednesday, October 3, 12
Twitter Stormdistributed;
fault-tolerant;
real-time;
computation;
fail-fast components.
4Wednesday, October 3, 12
(Very) basic info
5Wednesday, October 3, 12
(Very) basic infocreated by Nathan Marz from Backtype/Twitter;
5Wednesday, October 3, 12
(Very) basic infocreated by Nathan Marz from Backtype/Twitter;
Eclipse Public License 1.0;
5Wednesday, October 3, 12
(Very) basic infocreated by Nathan Marz from Backtype/Twitter;
Eclipse Public License 1.0;
open sourced at September 19th, 2011;
5Wednesday, October 3, 12
(Very) basic infocreated by Nathan Marz from Backtype/Twitter;
Eclipse Public License 1.0;
open sourced at September 19th, 2011;
about 16k Java and 7k Clojure LoC;
5Wednesday, October 3, 12
(Very) basic infocreated by Nathan Marz from Backtype/Twitter;
Eclipse Public License 1.0;
open sourced at September 19th, 2011;
about 16k Java and 7k Clojure LoC;
most watched Java repo at Github (> 4k watchers);
5Wednesday, October 3, 12
(Very) basic infocreated by Nathan Marz from Backtype/Twitter;
Eclipse Public License 1.0;
open sourced at September 19th, 2011;
about 16k Java and 7k Clojure LoC;
most watched Java repo at Github (> 4k watchers);
active UG.
5Wednesday, October 3, 12
Current status
6Wednesday, October 3, 12
Current statuscurrent stable release: 0.8.1;
6Wednesday, October 3, 12
Current statuscurrent stable release: 0.8.1;
0.8.2 with small bug fixes is already on the way;
6Wednesday, October 3, 12
Current statuscurrent stable release: 0.8.1;
0.8.2 with small bug fixes is already on the way;
0.9.0 with major core improvements is planned;
6Wednesday, October 3, 12
Current statuscurrent stable release: 0.8.1;
0.8.2 with small bug fixes is already on the way;
0.9.0 with major core improvements is planned;
not very active contributions, we can try to get into;
6Wednesday, October 3, 12
Current statuscurrent stable release: 0.8.1;
0.8.2 with small bug fixes is already on the way;
0.9.0 with major core improvements is planned;
not very active contributions, we can try to get into;
used by over 30 companies (such as Twitter, Groupon, Alibaba, GumGum, etc).
6Wednesday, October 3, 12
Key properties
7Wednesday, October 3, 12
Key propertiesextremely broad set of use cases:
streams processing;
database updating;
distributed rpc;
7Wednesday, October 3, 12
Key propertiesextremely broad set of use cases:
streams processing;
database updating;
distributed rpc;
scalable and extremely robust;
7Wednesday, October 3, 12
Key propertiesextremely broad set of use cases:
streams processing;
database updating;
distributed rpc;
scalable and extremely robust;
guarantees no data loss;
7Wednesday, October 3, 12
Key propertiesextremely broad set of use cases:
streams processing;
database updating;
distributed rpc;
scalable and extremely robust;
guarantees no data loss;
fault-tolerant;
7Wednesday, October 3, 12
Key propertiesextremely broad set of use cases:
streams processing;
database updating;
distributed rpc;
scalable and extremely robust;
guarantees no data loss;
fault-tolerant;
programming language agnostic.7
Wednesday, October 3, 12
Key conceptsTuples (ordered list of elements)
8Wednesday, October 3, 12
Key conceptsTuples (ordered list of elements)
8
( “Saratov”, “slukjanov”, “event1”, “10/3/12 16:20”)
Wednesday, October 3, 12
Key conceptsStreams (unbounded sequence of tuples)
9Wednesday, October 3, 12
Key conceptsStreams (unbounded sequence of tuples)
9
TUPLE TUPLE TUPLE TUPLE TUPLE
Wednesday, October 3, 12
Key conceptsSpouts (source of streams)
10Wednesday, October 3, 12
Key conceptsSpouts (source of streams)
10
TUPLE TUPLE TUPLE TUPLE TUPLE
Wednesday, October 3, 12
Key conceptsSpouts (source of streams)
10
TUPLE TUPLE TUPLE TUPLE TUPLE
Spouts can talk with:queues;
logs;
API calls;
event data.
Wednesday, October 3, 12
Key conceptsBolts (process tuples and create new streams)
11Wednesday, October 3, 12
Key conceptsBolts (process tuples and create new streams)
11
TUPLE TUPLE TUPLE TUPLE TUPLE
TUPLETUPLE
TUPLETUPLE
TUPLE
TUPLE
TUPLE
TUPLE
TUPLE
TUPLE
Wednesday, October 3, 12
Key conceptsYou can do the following things in Bolts:
12
apply functions / transformations;
filter;
aggregation;
streaming joins;
access DBs, APIs, etc...
Wednesday, October 3, 12
Key conceptsTopologies (a directed graph of Spouts and Bolts)
13Wednesday, October 3, 12
Key conceptsTopologies (a directed graph of Spouts and Bolts)
13
TUPLE TUPLE TUPLE TUPLE TUPLE
TUPLETUPLE
TUPLETUPLE
TUPL
TUPLE
TUPLE
TUPLE
TUPLE
TUPLE
Wednesday, October 3, 12
Key conceptsTopologies (a directed graph of Spouts and Bolts)
14
TUPLE TUPLE TUPLE TUPLE TUPLE
TUPLE TUPLE TUPLE TUPLE TUPLE
TUPLE
TUPLE
TUPLE
TUPLE
TUPLE
TUPLE TUPLE TUPLE TUPLE TUPLE
TUPLE TUPLE TUPLE TUPLE TUPLE
TUPLE
TUPLE
TUPLE
TUPLE
TUPLE
Wednesday, October 3, 12
Key conceptsTasks (instances of spouts and bolts)
15Wednesday, October 3, 12
Key conceptsTasks (instances of spouts and bolts)
15
Task 1
Task 2
Task 3
Task 4
Wednesday, October 3, 12
Key conceptsCluster
16
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Wednesday, October 3, 12
Key conceptsCluster
16
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
UI
Wednesday, October 3, 12
Key conceptsCluster
16
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Hadoop’s Job tracker
UI
Wednesday, October 3, 12
Key conceptsCluster
16
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Hadoop’s Job tracker
Hadoop’s Task tracker
UI
Wednesday, October 3, 12
Based on
17Wednesday, October 3, 12
Based onApache Zookeeper (maintaining configs);
17Wednesday, October 3, 12
Based onApache Zookeeper (maintaining configs);
∅MQ (transport layer);
17Wednesday, October 3, 12
Based onApache Zookeeper (maintaining configs);
∅MQ (transport layer);
Apache Thrift (cross-language bridge, rpc);
17Wednesday, October 3, 12
Based onApache Zookeeper (maintaining configs);
∅MQ (transport layer);
Apache Thrift (cross-language bridge, rpc);
LMAX Disruptor (bounded prod-cons queue);
17Wednesday, October 3, 12
Based onApache Zookeeper (maintaining configs);
∅MQ (transport layer);
Apache Thrift (cross-language bridge, rpc);
LMAX Disruptor (bounded prod-cons queue);
Kryo (serialization framework).
17Wednesday, October 3, 12
Grouping
18Wednesday, October 3, 12
Groupingshuffle (randomly and evenly distributed);
18Wednesday, October 3, 12
Groupingshuffle (randomly and evenly distributed);
local or shuffle (local workers are preferred);
18Wednesday, October 3, 12
Groupingshuffle (randomly and evenly distributed);
local or shuffle (local workers are preferred);
fields (the stream is partitioned by specified fields);
18Wednesday, October 3, 12
Groupingshuffle (randomly and evenly distributed);
local or shuffle (local workers are preferred);
fields (the stream is partitioned by specified fields);
all (the stream is replicated across all the bolt’s tasks);
18Wednesday, October 3, 12
Groupingshuffle (randomly and evenly distributed);
local or shuffle (local workers are preferred);
fields (the stream is partitioned by specified fields);
all (the stream is replicated across all the bolt’s tasks);
global (the entire stream goes to a single bolt’s task);
18Wednesday, October 3, 12
Groupingshuffle (randomly and evenly distributed);
local or shuffle (local workers are preferred);
fields (the stream is partitioned by specified fields);
all (the stream is replicated across all the bolt’s tasks);
global (the entire stream goes to a single bolt’s task);
direct (producers could directly emit tuples);
18Wednesday, October 3, 12
Groupingshuffle (randomly and evenly distributed);
local or shuffle (local workers are preferred);
fields (the stream is partitioned by specified fields);
all (the stream is replicated across all the bolt’s tasks);
global (the entire stream goes to a single bolt’s task);
direct (producers could directly emit tuples);
custom (implement interface CustomStreamGrouping).
18Wednesday, October 3, 12
WordCount sample
19Wednesday, October 3, 12
WordCount samplerandom sentence generator;
19Wednesday, October 3, 12
WordCount samplerandom sentence generator;
sentence splitter;
19Wednesday, October 3, 12
WordCount samplerandom sentence generator;
sentence splitter;
word counter;
19Wednesday, October 3, 12
WordCount samplerandom sentence generator;
sentence splitter;
word counter;
ping spout (metronome).
19Wednesday, October 3, 12
WordCount sample
20Wednesday, October 3, 12
WordCount sample
20
SENTENCE SENTENCE
SENTENCE GENERATOR
Wednesday, October 3, 12
WordCount sample
20
SENTENCE SENTENCE
SENTENCE GENERATOR
WORD WORD
SENTENCESPLITTER
Wednesday, October 3, 12
WordCount sample
20
SENTENCE SENTENCE
SENTENCE GENERATOR
WORD WORD
SENTENCESPLITTER
WORDCOUNTER
Wednesday, October 3, 12
WordCount sample
20
SENTENCE SENTENCE
SENTENCE GENERATOR
WORD WORD
SENTENCESPLITTER
WORDCOUNTER
GROUP BY WORD
Wednesday, October 3, 12
WordCount sample
20
SENTENCE SENTENCE
SENTENCE GENERATOR
WORD WORD
SENTENCESPLITTER
WORDCOUNTER
PING
PINGPINGGENERATOR
GROUP BY WORD
Wednesday, October 3, 12
WordCount sample
20
SOUT
SENTENCE SENTENCE
SENTENCE GENERATOR
WORD WORD
SENTENCESPLITTER
WORDCOUNTER
PING
PINGPINGGENERATOR
GROUP BY WORD
Wednesday, October 3, 12
WordCount sample
20
SENTENCE SENTENCE
SENTENCE GENERATOR
WORD WORD
SENTENCESPLITTER
WORDCOUNTER
PING
PINGPINGGENERATOR
DB
GROUP BY WORD
Wednesday, October 3, 12
Sentence generator
21
public class RandSentenceGenerator extends BaseRichSpout { private SpoutOutputCollector collector; private Random random; private String[] sentences; @Override public void open(Map map, TopologyContext ctx, SpoutOutputCollector collector) { this.collector = collector; this.random = new Random(); this.sentences = <sentences array>; } @Override public void nextTuple() { Utils.sleep(10); String sentence = sentences[random.nextInt(sentences.length)]; collector.emit(new Values(sentence)); }
@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("sentence")); } }
Wednesday, October 3, 12
Sentence splitter
22
public class SplitSentence extends BaseBasicBolt { @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String sentence = tuple.getString(0); for (String word : sentence.split("\\s")) { collector.emit(new Values(word)); } } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } }
Wednesday, October 3, 12
Word count
23
public class WordCount extends BaseBasicBolt { private HashMultiset<String> words = HashMultiset.create();
@Override public void prepare(Map conf, TopologyContext ctx) { super.prepare(conf, ctx); this.logger = Logger.getLogger(this.getClass()); this.name = ctx.getThisComponentId(); this.task = ctx.getThisTaskIndex(); }
@Override public void execute(Tuple tuple, BasicOutputCollector collector) { String source = tuple.getSourceComponent(); if ("split".equals(source)) { words.add(tuple.getString(0)); } else if ("ping".equals(source)) { logger.warn("RESULT " + name + ":" + task + " :: " + words); } }
@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }
Wednesday, October 3, 12
Topology builder
24
public class WordCounter { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("source", new RandSentenceGenerator(), 3); builder.setSpout("ping", new PingSpout()); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("source"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")) .allGrouping("ping");
<topology submitting> } }
Wednesday, October 3, 12
Topology submitter
25
public class WordCounter { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); <building topology> Config conf = new Config(); conf.setDebug(true); conf.setNumWorkers(3); StormSubmitter.submitTopology("tplg-name", conf, builder.createTopology()); } }
Wednesday, October 3, 12
Multilang support
26Wednesday, October 3, 12
Multilang supportDSLs for Scala, JRuby and Clojure;
26Wednesday, October 3, 12
Multilang supportDSLs for Scala, JRuby and Clojure;
ShellSpout, ShellBolt;
26Wednesday, October 3, 12
Multilang supportDSLs for Scala, JRuby and Clojure;
ShellSpout, ShellBolt;
json-based protocol:
receive/emit tuples;
ack/fail tuples;
write to logs.
26Wednesday, October 3, 12
Online logs processing
27Wednesday, October 3, 12
Online logs processing
27
Application
Application
Application QUEUE
Twitter Storm Cluster
RABBITMQ
STORAGE CASSANDRA
Wednesday, October 3, 12
Online logs processing
28Wednesday, October 3, 12
Online logs processing
28
RABBITMQ
Wednesday, October 3, 12
Online logs processing
28
RABBITMQ
QUEUECONSUMER
LOGMESSAGES
Wednesday, October 3, 12
Online logs processing
28
MESSAGE MESSAGE
MESSAGEPARSERSHUFFLE GROUPING
RABBITMQ
QUEUECONSUMER
LOGMESSAGES
Wednesday, October 3, 12
Online logs processing
28
MESSAGE MESSAGE
MESSAGEPARSERSHUFFLE GROUPING
EVENT EVENT
FIELDS GROUPINGEVENT
AGGREGATOR
RABBITMQ
QUEUECONSUMER
LOGMESSAGES
Wednesday, October 3, 12
Online logs processing
28
MESSAGE MESSAGE
MESSAGEPARSERSHUFFLE GROUPING
EVENT EVENT
FIELDS GROUPINGEVENT
AGGREGATOR
CASSANDRA
REALTIMEINFO & STATS
RABBITMQ
QUEUECONSUMER
LOGMESSAGES
Wednesday, October 3, 12
Storm fault-tolerance
29Wednesday, October 3, 12
Storm fault-tolerance
29
Parts of Storm cluster:
Zookeeper nodes;
Nimbus (master) node;
Supervisor nodes.
Wednesday, October 3, 12
Nimbus as a point of failure
30Wednesday, October 3, 12
Nimbus as a point of failure
30
when Nimbus is down:
topologies continue to work;
tasks from failing nodes aren’t respawned;
can’t upload a new topology or rebalance an old one;
Wednesday, October 3, 12
Nimbus as a point of failure
30
when Nimbus is down:
topologies continue to work;
tasks from failing nodes aren’t respawned;
can’t upload a new topology or rebalance an old one;
impossible to run Nimbus at another node:
either fix the failed node;
or create new and resubmit all topologies.
Wednesday, October 3, 12
Tuple types
31
spout tuple - emitted from Spouts;
child tuple - emitted from Bolts, based on parent tuple(s) (child or spout ones).
Wednesday, October 3, 12
Tuple types
31
spout tuple - emitted from Spouts;
child tuple - emitted from Bolts, based on parent tuple(s) (child or spout ones).
[“the cow jumped over the moon”]
[“the”]
[“cow”]
[“jumped”]
[“over”]
[“the”]
[“moon”]
[“the”, 1]
[“cow”, 1]
[“jumped”, 1]
[“over”, 1]
[“the”, 2]
[“moon”, 1]
Wednesday, October 3, 12
Reliability API Guaranties
32
public class QueueConsumer extends BaseRichSpout {
...
@Override public void nextTuple() { Message msg = queueClient.popMessage(); collector.emit(msg.getPayload(), msg.getId()); }
@Override public void ack(Object msgId) { queueClient.ack(msgId); } @Override public void fail(Object msgId) { queueClient.fail(msgId); }
...
}
Wednesday, October 3, 12
Tuple tree tracking
33Wednesday, October 3, 12
Tuple tree tracking
33
spout tuple creation:
collector.emit(values, msgId);
Wednesday, October 3, 12
Tuple tree tracking
33
spout tuple creation:
collector.emit(values, msgId);
child tuple creation:
collector.emit(parentTuples, values);
Wednesday, October 3, 12
Tuple tree tracking
33
spout tuple creation:
collector.emit(values, msgId);
child tuple creation:
collector.emit(parentTuples, values);
tuple end of processing:
collector.ack(tuple);
Wednesday, October 3, 12
Tuple tree tracking
33
spout tuple creation:
collector.emit(values, msgId);
child tuple creation:
collector.emit(parentTuples, values);
tuple end of processing:
collector.ack(tuple);
tuple failed to process:
collector.fail(tuple);
Wednesday, October 3, 12
Disabling reliability API
34Wednesday, October 3, 12
Disabling reliability API
34
globally:
Config.TOPOLOGY_ACKER_EXECUTORS = 0;
Wednesday, October 3, 12
Disabling reliability API
34
globally:
Config.TOPOLOGY_ACKER_EXECUTORS = 0;
on topology level:
collector.emit(values, msgId);
Wednesday, October 3, 12
Disabling reliability API
34
globally:
Config.TOPOLOGY_ACKER_EXECUTORS = 0;
on topology level:
collector.emit(values, msgId);
for a single tuple:
collector.emit(parentTuples, values);
Wednesday, October 3, 12
Acker system impl
35
every tuple is assigned a random 64-bit ID
Wednesday, October 3, 12
Acker system impl
35
every tuple is assigned a random 64-bit ID
Spout Bolt A Bolt B Bolt C[1] [2] [3]
Wednesday, October 3, 12
Acker system impl
35
every tuple is assigned a random 64-bit ID
Spout Bolt A Bolt B Bolt C[1] [2] [3]
[1] emit
Wednesday, October 3, 12
Acker system impl
35
every tuple is assigned a random 64-bit ID
Spout Bolt A Bolt B Bolt C[1] [2] [3]
[1] emit[2] emit
Wednesday, October 3, 12
Acker system impl
35
every tuple is assigned a random 64-bit ID
Spout Bolt A Bolt B Bolt C[1] [2] [3]
[1] emit[2] emit[1] ack
Wednesday, October 3, 12
Acker system impl
35
every tuple is assigned a random 64-bit ID
Spout Bolt A Bolt B Bolt C[1] [2] [3]
[1] emit[2] emit[1] ack[3] emit
Wednesday, October 3, 12
Acker system impl
35
every tuple is assigned a random 64-bit ID
Spout Bolt A Bolt B Bolt C[1] [2] [3]
[1] emit[2] emit[1] ack[3] emit[2] ack
Wednesday, October 3, 12
Acker system impl
35
every tuple is assigned a random 64-bit ID
Spout Bolt A Bolt B Bolt C[1] [2] [3]
[1] emit[2] emit[1] ack[3] emit[2] ack[3] ack
Wednesday, October 3, 12
Acker - simplified algo
36Wednesday, October 3, 12
Acker - simplified algo
36
tuple tree:
{ Spout tuple ID, set };
Wednesday, October 3, 12
Acker - simplified algo
36
tuple tree:
{ Spout tuple ID, set };
message processing:
[tuple_id] emit: set.add(tuple_id);
[tuple_id] ack: set.remove(tuple_id);
if (set.size == 0): send ack to parent spout.
Wednesday, October 3, 12
Acker - real algo
37
tuple tree:
{ Spout tuple ID, ackVal: int64 };
message processing:
[tuple_id] emit: ackVal ^= tuple_id;
[tuple_id] ack: ackVal ^= tuple_id;
if (ackVal == 0): send ack to parent spout.
Wednesday, October 3, 12
Correctness of the tracking
38Wednesday, October 3, 12
Correctness of the tracking
38
bolt fails before sending ack for a tuple:
no ack arrive before timeout, spout tuple fails;
Wednesday, October 3, 12
Correctness of the tracking
38
bolt fails before sending ack for a tuple:
no ack arrive before timeout, spout tuple fails;
acker fails before acking tuple tree processing:
-- the same as above --;
Wednesday, October 3, 12
Correctness of the tracking
38
bolt fails before sending ack for a tuple:
no ack arrive before timeout, spout tuple fails;
acker fails before acking tuple tree processing:
-- the same as above --;
spout fails before acking message:
the message source should handle client’s death.
Wednesday, October 3, 12
Reliability API - Conclusion
39Wednesday, October 3, 12
Reliability API - Conclusion
39
easy to dismiss:
on message - at most one processing;
Wednesday, October 3, 12
Reliability API - Conclusion
39
easy to dismiss:
on message - at most one processing;
if using, little overhead and high durability:
one message - at least one processing;
Wednesday, October 3, 12
Reliability API - Conclusion
39
easy to dismiss:
on message - at most one processing;
if using, little overhead and high durability:
one message - at least one processing;
with some further work (transactions, Trident API):
one message - exactly one processing.
Wednesday, October 3, 12
Transactional approach: design #1
40Wednesday, October 3, 12
Transactional approach: design #1
40
Spout Bolt ATUPLE COMMITMESSAGE
Wednesday, October 3, 12
Transactional approach: design #1
40
input provides messages in strong order;
Spout Bolt ATUPLE COMMITMESSAGE
Wednesday, October 3, 12
Transactional approach: design #1
40
input provides messages in strong order;
each message is assigned Transaction ID;
Spout Bolt ATUPLE COMMITMESSAGE
Wednesday, October 3, 12
Transactional approach: design #1
40
input provides messages in strong order;
each message is assigned Transaction ID;
if (curr_tx_id > prev_tx_id) commit(result, curr_tx_id).
Spout Bolt ATUPLE COMMITMESSAGE
Wednesday, October 3, 12
Transactional approach: design #2
41
input provides messages in strong order;
each batch of messages is assigned Transaction ID;
if (curr_tx_id > prev_tx_id) commit(result, curr_tx_id).
Spout Bolt A
BATCH OF TUPLES COMMIT
BATCH OFMESSAGES
Wednesday, October 3, 12
Transactional approach: design #3
42
the same as #2, but each transaction is split:
processing phase;
commit phase;
Wednesday, October 3, 12
Transactional approach: design #3
42
the same as #2, but each transaction is split:
processing phase;
commit phase;
process phases might intersect for difference transactions;
Wednesday, October 3, 12
Transactional approach: design #3
42
the same as #2, but each transaction is split:
processing phase;
commit phase;
process phases might intersect for difference transactions;
commit phases go in strong order.
Wednesday, October 3, 12
Trident API: Intro
43Wednesday, October 3, 12
Trident API: Intro
43
high-level abstraction for doing realtime computations;
Wednesday, October 3, 12
Trident API: Intro
43
high-level abstraction for doing realtime computations;
high throughput (millions of messages per second);
Wednesday, October 3, 12
Trident API: Intro
43
high-level abstraction for doing realtime computations;
high throughput (millions of messages per second);
stateful stream processing;
Wednesday, October 3, 12
Trident API: Intro
43
high-level abstraction for doing realtime computations;
high throughput (millions of messages per second);
stateful stream processing;
low latency distributed querying;
Wednesday, October 3, 12
Trident API: Intro
43
high-level abstraction for doing realtime computations;
high throughput (millions of messages per second);
stateful stream processing;
low latency distributed querying;
different semantics (including exactly-once one);
Wednesday, October 3, 12
Trident API: Intro
43
high-level abstraction for doing realtime computations;
high throughput (millions of messages per second);
stateful stream processing;
low latency distributed querying;
different semantics (including exactly-once one);
smth. like Pig or Cascading.
Wednesday, October 3, 12
Trident API: Operations
44Wednesday, October 3, 12
Trident API: Operations
44
partition-local operations (w/o network transfer):
function, filter, partitionAggregate, stateQuery, etc;
Wednesday, October 3, 12
Trident API: Operations
44
partition-local operations (w/o network transfer):
function, filter, partitionAggregate, stateQuery, etc;
repartitioning operations (grouping);
Wednesday, October 3, 12
Trident API: Operations
44
partition-local operations (w/o network transfer):
function, filter, partitionAggregate, stateQuery, etc;
repartitioning operations (grouping);
aggregations operations:
aggregate, persistentAggregate;
Wednesday, October 3, 12
Trident API: Operations
44
partition-local operations (w/o network transfer):
function, filter, partitionAggregate, stateQuery, etc;
repartitioning operations (grouping);
aggregations operations:
aggregate, persistentAggregate;
operations on grouped streams;
Wednesday, October 3, 12
Trident API: Operations
44
partition-local operations (w/o network transfer):
function, filter, partitionAggregate, stateQuery, etc;
repartitioning operations (grouping);
aggregations operations:
aggregate, persistentAggregate;
operations on grouped streams;
merges and joins.
Wednesday, October 3, 12
Trident API: Demo
45
TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", new FixedBatchSpout()) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6); topology.newDRPCStream("words") .each(new Fields("args"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count")) .each(new Fields("count"), new FilterNull()) .aggregate(new Fields("count"), new Sum(), new Fields("sum")); Config config = new Config(); config.setMaxSpoutPending(100); cluster.submitTopology("word-count-tplg", config, topology.build()); DRPCClient client = new DRPCClient("drpc.server.host", 3772); System.out.println(client.execute("words", "cat dog the man")); System.out.println(client.execute("words", "cat")); // prints the JSON-encoded result, e.g.: "[[5078]]"
Wednesday, October 3, 12
Q & A
46Wednesday, October 3, 12