streaming big data analysis - meetupfiles.meetup.com/16951782/20140122 - streaming big data...
TRANSCRIPT
![Page 1: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/1.jpg)
Streaming Big Data Analysis: Apache Storm and SensorStorm
Wilco Wijbrandi
![Page 2: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/2.jpg)
About me
Studied Computing Science BSc and MSc
here in Groningen
Working at TNO since 2012, department
Service Enabling & Management
Working in the field of Smart Grids and
Cloud Computing
Tools of preference: Java and OSGi
![Page 3: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/3.jpg)
Batch processing
Pros
Easy to implement
Redo processing
Cons
Results are delayed
Requires some serious
disk space
![Page 4: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/4.jpg)
MapReduce
![Page 5: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/5.jpg)
Stream Processing
Useful when
There’s just too much data to store
Near real-time results
Downsides
Data is gone after processing
Harder to implement
![Page 6: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/6.jpg)
Apache Storm
Process unbounded streams of data
Scalable
Guarantees no data loss (at least, no tuples are lost)
Robust
Fault-tolerant
Programming language agnostic (sort of)
![Page 7: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/7.jpg)
Apache Storm
What I like about Storm
Storm is hot! That means an active community!
Robustness and reliability is taken care of
Nice WebUI: metrics and logging
Scalable: Runs on your laptop, runs in a datacenter
Later on I’ll tell you what I don’t like…
![Page 8: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/8.jpg)
Robustness
If Nimbus fails the topology can continue, but
nothing can change (that also means no
redistributing work if a supervisor fails)
![Page 9: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/9.jpg)
Topology
Data is transmitted as Tuples
Untyped key-value pairs
Predefined set of keys
Spouts pull in data and produce tuples
Usually there is a message queue in front of a spout
Bolts process tuples and produce new tuples (or do something else)
Topology cannot be changed once started
A storm cluster can run multiple topologies at the same time
![Page 10: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/10.jpg)
Parallelism
Spouts and bolts have multiple
instances (Tasks)
Tasks are automatically
distributed over workers
![Page 11: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/11.jpg)
Stream Groupings
Defines which tuple go to which bolt instance
Most common:
ShuffleGrouping: Spread tuples equally over instances
FieldGrouping: Tuples with same value always end up at same
instance
AllGrouping: Broadcast tuples over instances
Spout Bolt A
1
0
instance(tuple, field) = hash(tupe[field]) % NrOfInstances
![Page 12: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/12.jpg)
Reliability
Fail-fast
When a supervisor/worker crashes, tasks are reassigned
Tuples not acknowledged are resubmitted
At least once semantics
Spout is responsible for resubmitting
State of bolt instances gets lost when a node crashes or tasks are
moved
![Page 13: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/13.jpg)
Chaining: Succes!
S
B2
B1
B3
A
B
C
Ack A!
Ack B!
Ack C!
![Page 14: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/14.jpg)
Chaining: Fail
S
B2
B1
B3
A
B
C
Fail A!
Ack B!
Resubmit
A
B1 receives A twice
B2 receives B twice
![Page 15: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/15.jpg)
Throttling
S
B2
B1
B3
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
In practice this happens in parallel
Spouts pull in data; this is the only place where storm can throttle
Number of ‘pending spout-tuples’ is limited
Configuration parameter MAX_SPOUT_PENDING
![Page 16: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/16.jpg)
public class RandomSentenceSpout extends BaseRichSpout {
@Override
public void open(Map conf, TopologyContext context,
SpoutOutputCollector collector) {
...
}
@Override
public void nextTuple() {
String sentence = ...
this.collector.emit(new Values(sentence));
}
@Override
public void ack(Object id) {
}
@Override
public void fail(Object id) {
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
![Page 17: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/17.jpg)
class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
![Page 18: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/18.jpg)
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(),
8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split",
new Fields("word"));
Config conf = new Config();
conf.setDebug(true);
StormSubmitter.submitTopology(args[0], conf,
builder.createTopology());
}
Spouts and bolts are copied to worker nodes by serializing them
![Page 19: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/19.jpg)
Apache Storm
What I don’t like about Storm
Very static: Topologies cannot be altered while running
No real topology management, no run-time configuration
Use tuples because strong typing is annoying, but I do have to
declare output fields first?
State of bolt is not managed (unless you use Trident)
No dependency management
Other interesting design choices
Ugly Java Clojure hybrid (but that won’t bother most users)
![Page 20: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/20.jpg)
SensorStorm
(suggestions for a logo are welcome)
![Page 21: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/21.jpg)
Why SensorStorm?
Dealing with sensor data is a bit different than vanilla Storm or Trident
When dealing with sensor data…
We want measurements to be in the right order
Reliability is nice, but we can live with loosing a measurement once in
while
A lot of operations are reusable
What is SensorStorm?
SensorStorm is a library on top of Apache Storm
SensorStorm is an open source project with an Apache 2 license
![Page 22: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/22.jpg)
Other research topics related to Storm
Elastic Storm
Automatically scale Storm clusters up and down -> paper
Make it easier to manage, configure and deploy topologies
Run multiple instances of the same topology
Central configuration (ConfigAdmin for Storm)
![Page 23: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/23.jpg)
Particles are the new Tuples!
Particles are a special type of tuple
Particles always have a timestamp
Particles are strongly typed
Particles are automatically mapped to Storm Tuples, so we don’t
break compatibility
You can also customize the mapping
We take care of declareOutputFields
![Page 24: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/24.jpg)
Time travel is complicated (ever seen Primer?)
For debugging and demo purposes…
We want to be able to process live data
We want to be able to replay a historic dataset
We want to run a historic dataset as fast as possible
So we need to be able to use the real clock, but also a fake, controlled
clock…
And that’s difficult in a
distributed system
![Page 25: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/25.jpg)
Analysis of the measurement
We also want to analyse the frequency at which we receive
measurements
Processing of measurements does not always take the same
amount of time
What if there is no measurement at all?
We don’t want to
use the time at
processing…
But when the
measurement
was taken
![Page 26: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/26.jpg)
Time injection
S
B2
B1
B3
Measurement Particle
Time Particle
Every particle carries a timestamp
Time Particles are injected at a fixed
(configurable) interval
Time Particles trigger different processing logic
than measurement Particles
![Page 27: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/27.jpg)
DataParticles and MetaParticles
S
B2
B1
B3
MetaParticle
DataParticle
MetaParticles don’t carry measurements, but
are injected to the stream to trigger certain
behaviour to the stream
TimerMetaParticle carries time and triggers
scheduled tasks
ShutdownMetaParticle signals bolts to store
their state (e.g. before reassigning tasks)
![Page 28: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/28.jpg)
What does a Particle look like?
public class SensorParticle extends AbstractDataParticle {
public SensorParticle() {
}
@TupleField
private String sensorId;
@TupleField
private double measurement;
public String getSensorId() {
return sensorId;
}
public double getMeasurement() {
return measurement;
}
...
}
![Page 29: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/29.jpg)
Parallelism
Spout Bolt A
0
1
0
1
Bolt B
1
0
Specialized stream groupings
MetaParticles are broadcasted between instances
Duplicate MetaParticles are filtered out
DataParticles have their own grouping strategy (e.g. fields grouping)
Before each bolt particles are put in order in the SyncBuffer
![Page 30: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/30.jpg)
SensorStormSpout and SensorStormBolt
Generic Spout and Bolt in which you can run your processing logic
SensorStorm
Spout
SensorStorm
Bolt Fetcher
Operation Batcher
SensorStorm
Bolt
Operation SensorStorm
Bolt
Operation
Injects Time Particles
![Page 31: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/31.jpg)
Fetcher
@FetcherDeclaration(outputs = SensorParticle.class)
public class BlockFetcher implements Fetcher {
public void prepare(Map stormConfig,
ExternalStormConfiguration externalConfig, TopologyContext
context) {
}
public void activate() {
...
}
public void deactivate() {
...
}
@Override
public DataParticle fetchParticle() {
return new SensorParticle();
}
}
![Page 32: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/32.jpg)
SingleParticleOperation
Process one particle at a time
Only DataParticles are offered
Processing a DataParticle can result
in no new Particles or many
@OperationDeclaration(inputs = SensorParticle.class, outputs =
SensorParticle.class)
public class ExampleOperation implements SingleParticleOperation {
public void init(...) throws OperationException {
}
public List<? extends DataParticle> execute(DataParticle
inputParticle)
throws OperationException {
System.out.println("Received Particle: " + inputParticle);
return inputParticle;
}
}
![Page 33: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/33.jpg)
Batcher
Separate logic for creating Batches of particles
Particle can be part of multiple Batches
For example, creating windows
5 6 7 8 1 2 3 4
Batch 1
5 6 7 8 1 2 3 4
Batch 2
5 6 7 8 1 2 3 4
Batch 3
![Page 34: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/34.jpg)
ParticleBatchOperation
Same as calculating the SingleParticleOperation, but now we process
a DataParticleBatch
For example:
Average
Min
Max
![Page 35: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/35.jpg)
DIY MetaParticles
Extend the SensorStormSpout so it sends your MetaParticle
Create a MetaParticleHandler
Create an interface so Operations can interact with your
MetaParticleHardler
SensorStorm
Bolt
Operation
MetaParticle
Handler DataParticles
MetaParticles
Register
![Page 36: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/36.jpg)
FieldOperations
Create an Operation for every value of a field
For example: process every sensor separately
You probably want to use a FieldsGroupings before this Bolt
SensorStorm
Bolt
Operation
S3 Operation
S2
Operation
S1 Operation
S4
![Page 37: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/37.jpg)
SensorStorm FieldOperationBolt
Sync
Buffer Synchronizes
particles from
different sources
and filters out
duplicate
MetaParticels
Sensor
Storm
Stream
Grouping Broadcast
MetaParticles
SensorStorm SingleOperationBolt
Single container
Batcher Operation
Sync
Buffer Synchronizes
particles from
different sources
and filters out
duplicate
MetaParticels
Sensor
Storm
Stream
Grouping Broadcast
MetaParticles
Particles
Field container Batcher Operation
Field container Batcher Operation
![Page 38: Streaming Big Data Analysis - Meetupfiles.meetup.com/16951782/20140122 - Streaming Big Data Analysis.pdf · 1/22/2014 · Scalable: Runs on your laptop, runs in a datacenter](https://reader034.vdocument.in/reader034/viewer/2022052000/601209043cb9520f042f1602/html5/thumbnails/38.jpg)
And we’re open source now!
So, if SensorStorm is for you…
Give it a try
Let us know what you think
Help us improve it
Together we can do more!
https://github.com/sensorstorm