Data Stream ProcessingCan we finally forget the batches?
2
Who am I?
Dominik WagenknechtSenior Technology ArchitectAccenture Vienna / Austria
Dealing with data in many industries
Data needs to move!A à B
4
To get data where it’s needed
Monolith
Gargantuan DB
ServiceService
DB DB
Service Service
Teams collidingLow agility
JOIN everything!
Per TeamHigh agility
Data as needed
5
To get smarter
Service
Data Warehouse
ReportingAnalyticsInsight
SystemSystem
System System
SystemSystem
Every systemdoes it’s job
Tells you whatto do better
= to integrate services
Let’s do Batch!(oldie but goldie)
8
So how does batch work?
Source
Source
Extract
Extract
Processing
MergeTransform
Enrich
Load Target
every night
9
Speed it up?!? Delta-batches…
every hour à ½ hour à ¼ hour
Source TargetETL
Source TargetETL
Source TargetETL
enjoy the fun whenbatches overlap
10
• It’s basically always too late*• Bumpy load-patterns >> shockwaves in the system• Mostly in the night• Testing becomes painful
Batch is not enough
*exceptions: on-purpose timings like interest rate calculations, monthly billing, etc…
You can not go from batch to stream!
12
How does the stream look
Source
Source
Event-by-EventProcessing
(there is often some state here!)
Target
continuously
13
But you can go from stream to batch
Source
Source
Event-by-EventProcessing
(there is often some state here!)
Target
continuously
Extract
Load
Why now?
15
• Message Queues exist since forever• It’s a cost and efficiency thing
What changed?• LinkedIn / Netflix & Friends• Transaction guarantees of classic MQ’s not needed• High performance message store: Kafka• Ubiquity of high performance distributed stream processing: Storm, Samza,
Kafka Streams, Spark Streaming, Heron, Flink,…
Why now?
16
Performance & Transactional Guarantees
Classic MQSource Target
Fully transactionalMQ keeps track of all messagesand transactions from all sources
Fully transactionalMQ needs to track state of every message• re-available after timeout• back-out-queues on rollback, etc…• look-ups by correlation ID, etc...
17
Enter the simple distributed log
Source
Target
Log-File(s)
oldest data
latest data
Writing a message just appends to one of the log-files. The message essentially has a file position index
Reading is essentially pulling at an index and just keeps reading forward
Challenge: Position tracking • Kafka helps with that
18
We loose• Full transaction support• Lookups by correlation ID, etc…
We win• Very high throughput• No overfilled queues, so we can batch into it!• Strict ordering per log-file/partition*• Multiple target systems can read independently• Very simple testing
Consequences of the distributed log
*which is quite useful given we‘re replacing batch...
Summary
20
Technologies in play
Source Event processing Target
DB: Change-data-capture or batch export JSystem: data feed
DB: just insert (with decent commit-size)System: REST-calls
21
• Think differently: streaming-first• Go idempotent to keep-it-simple• Partition to go fast & ordered• Establish governance & standards as you go*
What should you take away from this?
*data formats, naming conventions, operations,…
Thank youQuestions?