hadoopcon 2015: building a stream processing system for playable ads data at vmfive
TRANSCRIPT
![Page 1: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/1.jpg)
Building a Stream Processing System for Playable Ads Data at VMFive
Gordon Tai Data Engineer
@ HadoopCon 2015
![Page 2: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/2.jpg)
3Techcrunch Beijing, Aug. 2014Champion
![Page 3: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/3.jpg)
137
![Page 4: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/4.jpg)
Gordon Tai 戴資⼒力
InterestsCluster computing, open-source
Research• FedRDD: Federated RDDs for
Multicluster Computing • Criteria-Based Cluster Scheduling on
Hadoop YARN • FedLoop: Looping on Federated
MapReduce • FedMR: Federated MapReduce to
Transparently Run Applications
HobbiesCooking and Basketball
237
![Page 5: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/5.jpg)
How did we start from here …
337
![Page 6: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/6.jpg)
Try Apps Before You Install
![Page 7: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/7.jpg)
Playable Ads Data
5
.. . .
• Allows for interactive ads
• Play without install / download
• Preference “intensity” info
• Ordered event stream
• Real-time data
37
![Page 8: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/8.jpg)
Playable Ads Data
6t
E E E E E E E E E E E
E E E E E E E E E E
E E E E E E E E E E E E E E E E
E E E E E E E E E E E E
E E E E E E E E E E E E E E
37
![Page 9: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/9.jpg)
Playable Ads Data
6t
E E E E E E E E E E E
E E E E E E E E E E
E E E E E E E E E E E E E E E E
E E E E E E E E E E E E
E E E E E E E E E E E E E E
Sessions
37
![Page 10: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/10.jpg)
Playable Ads Data
6t
E E E E E E E E E E E
E E E E E E E E E E
E E E E E E E E E E E E E E E E
E E E E E E E E E E E E
E E E E E E E E E E E E E E
37
![Page 11: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/11.jpg)
7
Can����������� ������������������ we����������� ������������������ inspect����������� ������������������ the����������� ������������������ situation����������� ������������������ of����������� ������������������ a����������� ������������������ single����������� ������������������ session����������� ������������������ ?����������� ������������������
Can����������� ������������������ we����������� ������������������ query����������� ������������������
on����������� ������������������ how����������� ������������������ every����������� ������������������
AdPlay����������� ������������������ session����������� ������������������
ended?����������� ������������������
Can����������� ������������������ we����������� ������������������ query����������� ������������������ on����������� ������������������ the����������� ������������������ platform����������� ������������������ types����������� ������������������ of����������� ������������������ error����������� ������������������ sessions����������� ������������������ ?
What����������� ������������������ about����������� ������������������ by����������� ������������������
device?����������� ������������������ By����������� ������������������ time?
37
![Page 12: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/12.jpg)
7
Can����������� ������������������ we����������� ������������������ inspect����������� ������������������ the����������� ������������������ situation����������� ������������������ of����������� ������������������ a����������� ������������������ single����������� ������������������ session����������� ������������������ ?����������� ������������������
Can����������� ������������������ we����������� ������������������ query����������� ������������������
on����������� ������������������ how����������� ������������������ every����������� ������������������
AdPlay����������� ������������������ session����������� ������������������
ended?����������� ������������������
Can����������� ������������������ we����������� ������������������ query����������� ������������������ on����������� ������������������ the����������� ������������������ platform����������� ������������������ types����������� ������������������ of����������� ������������������ error����������� ������������������ sessions����������� ������������������ ?
What����������� ������������������ about����������� ������������������ by����������� ������������������
device?����������� ������������������ By����������� ������������������ time?
37
![Page 13: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/13.jpg)
8
Requirements [WHEN] Need to query merged sessions immediately.
[WHAT] Output sessions to a separate dataset.
[HOW] Can’t interfere with current flow.
1. Merge events into sessions
37
![Page 14: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/14.jpg)
9
Storm vs. Spark Streaming• Benchmark:*
Storm 10,000 records / sec / node
Spark Streaming 400,000 records /sec / node
• Storm => not really that popular
* http://www.cs.duke.edu/~kmoses/cps516/dstream.html
37
- version 0.9.x - Spark was released later and is already in 1.5.x
• Isn’t Spark Streaming the obvious choice?
![Page 15: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/15.jpg)
10
Storm vs. Spark Streaming• Storm: essentially a stream processing framework. Can also do micro-
batch processing (with Trident API).
• Spark: essentially a batch processing framework that does stream processing using micro-batch.
Batch Stream
Micro-batch
EEEEEEE…
EEEEEEE…
micro-batch
streaming
E
process
processprocess
37
![Page 16: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/16.jpg)
Storm vs. Spark StreamingAdPlay’s use case:
stream processing framework: real-time merging before DB landing
EEE E
E
EE E EE
E E E E?
? E EE E
EE …
…
database landing
1234567
123456
1234End
End
End
Why Storm?
• Streaming fits our use case better.
• Programming model provides general primitives to fit our application logic.
……
1137
![Page 17: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/17.jpg)
Apache Storm Quick RecapA processing framework for streams of data. React to data as it happens.
Core concepts - Data Model
Tuple
Stream
• Immutable set of K-V pairs • “Events”
• Unbounded sequence of tuples
stream
T T T T T T T T T…
1237
![Page 18: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/18.jpg)
Apache Storm Quick RecapA processing framework for streams of data. React to data as it happens.
Core concepts - Programming Model
Spout
Bolt
• Source of data streams (tuples) • All kinds of sources …
• Consumes streams and potentially produce new streams
Topology• DAG (directed acyclic graph) formed by
wiring spouts and bolts
Bolt
Bolt
Bolt
Bolt
Spout
Spout
1337
![Page 19: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/19.jpg)
Apache Storm Quick RecapParallelism
Spout
task #1
task #N
…
Bolt
task #1
task #M
…
Grouping
…
same field value
goes to same task
Fields Grouping
…
random assignment
Shuffle Grouping
1437
![Page 20: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/20.jpg)
Events Puller Spout
Sess
ion
Mer
ger
Bolt
task #1
task #2
task #N
…
(sid
, eve
nt)
(sid
, ev
ent) (sid, event)
Mon
goD
B In
sert
Bol
t
task #1
task #2
task #M
…
(sid
, se
ssio
n)
Discrete Events
Collection (capped)
Merged Sessions
Collection
stream by tail query
DB landing after merging15
37
![Page 21: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/21.jpg)
2. Reliability• Cluster computing frameworks is prone to failures.
• Two main considerations for reliability:
- Event process guarantee - Failure of stateful computation
1637
![Page 22: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/22.jpg)
Event process guarantee
3 types of guarantees
• At most once [0,1]:
Each event is processed only once, regardless of success/failure.
• At least once [1 … n]:
Each event can be redelivered multiple times to ensure success.
• Exactly once [1]:
Events are never lost and are never redelivered. Perfect delivery.
1737
![Page 23: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/23.jpg)
Storm’s built-in fault tolerance?
Tuple Acknowledgment
Bolt
Bolt
Bolt
BoltSpout
Spout
T
anchor with a tuple ID
1837
![Page 24: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/24.jpg)
Storm’s built-in fault tolerance?
Tuple Acknowledgment
Bolt
Bolt
Bolt
BoltSpout
Spout
T
ack(tu
ple)
1937
![Page 25: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/25.jpg)
Storm’s built-in fault tolerance?
Tuple Acknowledgment
Bolt
Bolt
Bolt
BoltSpout
Spout
T
fail(tuple)
Still dependent of the data source for processing guarantees.
2037
![Page 26: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/26.jpg)
Event process guarantee
3 types of event sources
• Unreliable:
No means to replay a previously-received message.
• Reliable:
Can somehow replay a message if processing fails at any point.
• Durable:
Can replay any message or set of messages given selection criteria.
2137
![Page 27: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/27.jpg)
Events Puller Spout
Sess
ion
Mer
ger
Bolt
task #1
task #2
task #N
…
(sid
, eve
nt)
(sid
, ev
ent) (sid, event)
Mon
goD
B In
sert
Bol
t
task #1
task #2
task #M
…
(sid
, se
ssio
n)
Discrete Events
Collection (capped)
Merged Sessions
Collection
stream by tail query
DB landing after merging
unreliable
2237
![Page 28: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/28.jpg)
Kafka Quick Overview• Publish-subscribe distributed messaging queue.
• Configurable message retention.
• Highly fault tolerant. N-1 nodes fault tolerant for N partitioning.
• High throughput: Producer - 2M msgs / sec. Consumer - 100M/sec.*
* http://kafka.apache.org/07/performance.html 2337
![Page 29: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/29.jpg)
Events Puller Spout
Sess
ion
Mer
ger
Bolt
task #1
task #2
task #N
…
(sid
, eve
nt)
(sid
, ev
ent) (sid, event)
Mon
goD
B In
sert
Bol
t
task #1
task #2
task #M
…
(sid
, se
ssio
n)
Merged Sessions
Collection
DB landing after merging
discrete-events-topic
E E E E… consume topic
…topic
topic
2437
![Page 30: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/30.jpg)
Failure of stateful computation
•Bolts hold “state” information.
- In-memory Map for merging AdPlay sessions
•Although Kafka is a durable data source, deciding the
selection criteria on failure can still be very hard.
•Solution: store all stateful info in an external
in-memory storage.
2537
![Page 31: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/31.jpg)
Events Puller Spout
Sess
ion
Mer
ger
Bolt
task #1
task #N
…
(sid
, eve
nt) (sid, event)
Mon
goD
B In
sert
Bol
t
task #1
task #M
…
(sid
, se
ssio
n)
Merged Sessions
Collection
discrete-events-topic
E E E E…
…topic
topic
External Object State Storage
ZADD sid time event
ZRANGE sid 0 -1
EEE? …
key: sid value: sorted set of events
online monitoring
2637
![Page 32: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/32.jpg)
• Exposing Kafka to public network can be dangerous.
3. Security
• Pull input to the internal pipeline.
- Separate read / write permissions.
- Ingest only recognizable data.
2737
![Page 33: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/33.jpg)
• PaaS version of Apache Kafka.
AWS Kinesis
• Fixed data retention for 24 hours.
• Producers and consumers using AWS SDK / AWS KCL.
• Fully self-managed.
• Usage unit: Kinesis Stream- Hourly charge per shard every hour.
- Also charged per 1M PUTs.
2837
![Page 34: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/34.jpg)
Events Puller Spout
Session Merger Bolt
…
discrete events topic A
External Object State Storage
key: sid value: sorted set of events
discrete events topic B
discrete events topic C
discrete events topic D
MongoDB Insert Bolt
Kafka Topic
Dispatcher
(Kinesis consumer)
producer user
consumer user
2937
![Page 35: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/35.jpg)
4. Schema Adaptability
• We soon realized that our event log schema was modified very fast.
• Problem: log schema was hardcoded into our stream processing logic.
- required a single topology for every different schema
• Stupid, I know ;)
• Solution: need a central serialization system to manage schema.
- replace JSON after event logs enter stream pipeline.
3037
![Page 36: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/36.jpg)
Apache Avro
•A data serialization system.
•Rich data structures.
- String - Numbers (int, long, float) - Bytes - Boolean - null - Nested objects - …
•When Avro data is read, the schema when writing it is always present.
3137
![Page 37: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/37.jpg)
Apache Avro
.avsc schema definitionThe actual event JSON
3237
![Page 38: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/38.jpg)
…
discrete events topic A
discrete events topic B
discrete events topic C
Kafka Topic
Dispatcher
topology A
topology B
topology C
…
External object state storage
3337
![Page 39: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/39.jpg)
discrete events topic
Kafka Topic
Dispatcher
single session merge
topology
External Object
State Storage
schema_1.avsc schema_2.avsc …
E
E.avro E.avro
E.avro
3437
![Page 40: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/40.jpg)
discrete events topic
Kafka Topic
Dispatcher
single session merge
topology
External Object
State Storage
schema_1_ updated.avsc
schema_2.avsc …
3437
backup events topic
backup session merge
topology
![Page 41: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/41.jpg)
Kappa Architecture*
•Everything is a stream.
•Backup your data into a durable buffer (Kafka).
•Resubmit an updated job that consumes the backup.
* http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html 3537
![Page 42: HadoopCon 2015: Building a Stream Processing System for Playable Ads Data at VMFive](https://reader031.vdocument.in/reader031/viewer/2022030312/58ee0af61a28abd72e8b466f/html5/thumbnails/42.jpg)
So, what did we went through?
• Building a stream processing system for ordered events stream.
• Reliability: Kafka + Redis
• Security: AWS Kinesis
• Schema Adaptability: Avro + Kappa architecture
Every use case is unique. Look closely to your needs ;)
3637