1 © Volker Markl© 2013 Berlin Big Data Center • All Rights Reserved1 © Volker Markl
Big Data Management
and Scalable Data Science:
Challenges and (some) SolutionsProf. Dr. Volker Markl
http://www.user.tu-berlin.de/marklv/
http://www.dima.tu-berlin.de
http://www.dfki.de/web/forschung/iam
http://bbdc.berlin
2 © Volker Markl2 © 2013 Berlin Big Data Center • All Rights Reserved
2 © Volker Markl
ML
ML
alg
orith
ms
alg
orith
ms
data too uncertain Veracity Data Mining MATLAB, R, Python
Predictive/Prescriptive MATLAB, R, Python
Data & Analysis: Increasingly Complex!
data volume too large Volume
data rate too fast Velocity
data too heterogeneous Variability
Data
Reporting aggregation, selection
Ad-Hoc Queries SQL, XQuery
ETL/ELT MapReduce
Analysis
DM
DM
sca
lab
ility
sca
lab
ility
3 © Volker Markl3 © 2013 Berlin Big Data Center • All Rights Reserved
3 © Volker Markl
Application
Data
Science
Control Flow
Iterative Algorithms
Error Estimation
Active Sampling
Sketches
Curse of Dimensionality
Decoupling
Convergence
Monte Carlo
Mathematical Programming
Linear Algebra
Stochastic Gradient Descent
Regression
Statistics
Hashing
Parallelization
Query Optimization
Fault Tolerance
Relational Algebra / SQL
Scalability
Data Analysis Language
Compiler
Memory Management
Memory Hierarchy
Data Flow
Hardware Adaptation
Indexing
Resource Management
NF2 /XQuery
Data Warehouse/OLAP
“Data Scientist” – “Jack of All Trades!”Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)
Real-Time
New Technology to the Rescue!
4 © Volker Markl4
4 © Volker Markl
Big Data Analytics Requires Systems Programming
R/Matlab:
3 million users
Hadoop:
100,000
users
Data Analysis
Statistics
Algebra
Optimization
Machine Learning
NLP
Signal Processing
Image Analysis
Audio-,Video Analysis
Information Integration
Information Extraction
Data Value Chain
Data Analysis Process
Predictive Analytics
Indexing
Parallelization
Communication
Memory Management
Query Optimization
Efficient Algorithms
Resource Management
Fault Tolerance
Numerical StabilityBig Data is now where database systems were in the
70s (prior to relational algebra, query optimization
and a SQL-standard)!
People with Big Data
Analytics Skills
Declarative languages to the rescue!
5 © Volker Markl5 © 2013 Berlin Big Data Center • All Rights Reserved
5 © Volker Markl
Deep Analysis of „Big Data“ is Key!
Small Data Big Data (3V)
Deep A
naly
tics
Sim
ple
Analy
sis
Many new companies and products are emerging to enable deep big data analysis;strong European contenders include Apache Flink, Parstream, and Exasol.„New companies“ are the (b)leading users of these technologies, e.g., in theinformation economy (e.g., Zalando, Amazon, Researchgate, Soundcloud, Spotify).„Traditional Big companies“ are following and still determining strategies (Industrie4.0, Logistics, Telco, etc.). Most SMEs are not ready yet to capitalize on Big Data.
6 © Volker Markl6 © 2013 Berlin Big Data Center • All Rights Reserved
6 © Volker Markl
Challenge: Technologies for Data Science at the Intersection of
Data Management and Machine Learning
aggregation relational algebra UDF iteration/recursionlinear algebragraph algebraeetc.
DM MLFeature Engineering
Representation
Algorithms (SVM, EM, etc.)
Declarative Languages
Automatic Adaption
Scalable processing
Think ML-algorithms
in a scalable way
Process (iterative)
algorithms
in a scalable way
declarative
7 © Volker Markl7 © Volker Markl
Apache Flink–
a success story
originating in Berlin
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, et al : Apache Flink™: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38(4): 28-38 (2015)
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al:The Stratosphere platform for big data analytics. VLDB J. 23(6): 939-964 (2014)
Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, Daniel Warneke:Nephele/PACTs: a programming model and execution framework for web-scale analyticalprocessing. SoCC 2010: 119-130
8 © Volker Markl
• Relational Algebra
• Declarativity
• Query Optimization
• Robust Out-of-core
• Scalability
• User-defined
Functions
• Complex Data Types
• Schema on Read
• Iterations
• Advanced Dataflows
• General APIs
• Native Streaming
8
Draws on
Database Technology
Draws on
MapReduce Technology
Adds
Stratosphere: General Purpose
Programming + Database Execution
9 © Volker Markl9
9 © Volker Markl
Apache Flink is an open source platform for scalable batch and stream
data processing.
What is Apache Flink?
http://flink.apache.org
• The core of Flink is a distributed
streaming dataflow engine.
• Executing dataflows in parallel on
clusters
• Providing a reliable foundation for
various workloads
• DataSet and DataStream programming
abstractions are the foundation for user
programs and higher layers
10 © Volker Markl10 © 2013 Berlin Big Data Center • All Rights Reserved
10 © Volker Markl
Technology inside Flink
case class Path (from: Long, to:Long)val tc = edges.iterate(10) {
paths: DataSet[Path] =>val next = paths
.join(edges)
.where("to")
.equalTo("from") {(path, edge) =>
Path(path.from, edge.to)}.union(paths).distinct()
next}
Cost-based
optimizer
Type extraction
stack
Task
scheduling
Recovery
metadata
Pre-flight (Client)
MasterWorkers
DataSourc
eorders.tbl
Filter
MapDataSourc
elineitem.tbl
JoinHybrid Hash
build
HTprobe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
Memory
manager
Out-of-core
algos
Batch &
Streaming
State &
Checkpoints
deploy
operators
track
intermediate
results
11 © Volker Markl11 © 2013 Berlin Big Data Center • All Rights Reserved
11 © Volker Markl
Effect of optimization
11
Run on a sampleon the laptop
Run a month laterafter the data evolved
Hash vs. SortPartition vs. BroadcastCachingReusing partition/sortExecution
Plan A
ExecutionPlan B
Run on large fileson the cluster
ExecutionPlan C
12 © Volker Markl12 © 2013 Berlin Big Data Center • All Rights Reserved
12 © Volker Markl
Why optimization ?
Do you want to hand-tune that?
13 © Volker Markl13 © 2013 Berlin Big Data Center • All Rights Reserved
13 © Volker Markl
DATA STREAMING ANALYSIS
14 © Volker Markl14
14 © Volker Markl
Life of data streams
• Create: create streams from event sources (machines, databases, logs, sensors, …)
• Collect: collect and make streams available for consumption (e.g., Apache Kafka)
• Process: process streams, possibly generating derived streams (e.g., Apache Flink)
14
15 © Volker Markl15 © 2013 Berlin Big Data Center • All Rights Reserved
15 © Volker Markl
Stream Analysis in Flink
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
16 © Volker Markl16
16 © Volker Markl
Defining windows in Flink
• Trigger policy
– When to trigger the computation on current window
• Eviction policy
– When data points should leave the window
– Defines window width/size
• E.g., count-based policy
– evict when #elements > n
– start a new window every n-th element
• Built-in: Count, Time, Delta policies
17 © Volker Markl17
17 © Volker Markl
Checkpointing / Recovery
• Flink acknowledges batches of records
– Less overhead in failure-free case
– Currently tied to fault tolerant data sources (e.g., Kafka)
• Flink operators can keep state
– State is checkpointed
– Checkpointing and record acks go together
• Exactly one semantics for state
18 © Volker Markl18 © 2013 Berlin Big Data Center • All Rights Reserved
18 © Volker Markl
Checkpointing / Recovery
18
Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots
Pushes checkpoint barriersthrough the data flow
Operator checkpointstarting
Checkpoint done
Data Stream
barrier
Before barrier =part of the snapshot
After barrier =Not in snapshot
Checkpoint done
checkpoint in progress
(backup till next snapshot)
19 © Volker Markl19 © 2013 Berlin Big Data Center • All Rights Reserved
19 © Volker Markl
ITERATIONS IN DATA FLOWS
MACHINE LEARNING
ALGORITHMS
Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, Volker Markl:Spinning Fast Iterative Data Flows. PVLDB 5(11): 1268-1279 (2012)
20 © Volker Markl20
20 © Volker Markl
Iterate by looping
• for/while loop in client submits one job per iteration step
• Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
21 © Volker Markl21
21 © Volker Markl
Iterate in the Dataflow
partial
solution partial
solution X
other
datasets
Y initial
solution
iteration
result
Replace
Step function
22 © Volker Markl22 © 2013 Berlin Big Data Center • All Rights Reserved
22 © Volker Markl
Large-Scale Machine Learning
Factorizing a matrix with28 billion ratings forrecommendations
(Scale of Netflixor Spotify)
More at: http://data-artisans.com/computing-recommendations-with-flink.html
23 © Volker Markl23
23 © Volker Markl
Optimizing iterative programs
Caching Loop-invariant DataPushing work„out of the loop“
Maintain state as index
24 © Volker Markl24 © 2013 Berlin Big Data Center • All Rights Reserved
24 © Volker Markl
STATE IN ITERATIONS
GRAPHS AND MACHINE
LEARNING
25 © Volker Markl25
25 © Volker Markl
Iterate natively with deltas
partial
solution
delta
setX
other
datasets
Y initial
solution
iteration
result
workset A B workset
Merge deltas
Replace
initial
workset
26 © Volker Markl26
26 © Volker Markl
Effect of delta iterations…
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 51 56 61
# o
f e
lem
en
ts u
pd
ate
d
iteration
27 © Volker Markl27 © 2013 Berlin Big Data Center • All Rights Reserved
27 © Volker Markl
… very fast graph analysis
… and mix and matchETL-style and graph analysisin one program
Performance competitivewith dedicated graph
analysis systems
More at: http://data-artisans.com/data-analysis-with-flink.html
28 © Volker Markl28 © Volker Markl
Current Benchmark Results
Performed by Yahoo! Engineering,
Dec 16, 2015
[..]Storm 0.10.0, 0.11.0-SNAPSHOT and
Flink 0.10.1 show sub- second latencies
at relatively high throughputs[..]. Spark
streaming 1.5.1 supports high
throughputs, but at a relatively higher
latency.
Flink achieves highest throughput
with competitive low latency!
Source: http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
29 © Volker Markl29
29 © Volker Markl
Timeline
30 © Volker Markl30
30 © Volker Markl
(Strictly) Flink European Meetups with
Member Totals (as of 30.5.16)
Country Total Members
Berlin 758
Paris 500
Madrid 384
Stockholm 313
Brussels 279
London 190
Munich 98
Istanbul 56
31 © Volker Markl31
31 © Volker Markl
Meetups By Country Concerning Flink
Apache Flink Meetups Worldwide (Data accurate as of 30.5.16) 6326 members strictly focused on Apache Flink (comprising 57%)4771 members broader in scope, including Flink (comprising 43%)
32 © Volker Markl32
32 © Volker Markl
Distribution of (Strictly) Flink Meetup Group
Members by Country (as of 30.5.16)
Country Total Members
USA 3184
Germany 856
France 500
Spain 384
Sweden 313
Belgium 279
Brazil 233
UK 190
Taiwan 142
India 139
Turkey 56
Mexico 50
33 © Volker Markl33
33 © Volker Markl
> 13 Companies Using Flink
34 © Volker Markl34
34 © Volker Markl
> 6 Software Projects Using Flink
35 © Volker Markl35
35 © Volker Markl
> 10 Research Institutions
Using Flink
36 © Volker Markl36
36 © Volker Markl
Fault tolerance
Pessimistic Recovery:
• Write intermediate state to stable storage
• Restart from such a checkpoint in case of a failure
Problematic:
• High overhead, checkpoint must
be replicated to other machines
• Overhead always incurred, even if no
failures happen!
How can we avoid this overhead in failure-free casesru
nti
me
per
iter
atio
n (
sec)
} actual work
} checkpointingoverhead
20
120
37 © Volker Markl37
37 © Volker Markl
Optimistic recovery
• Many data mining algorithms are fixpoint algorithms
• Optimistic Recovery: jump to a different state in case of a failure,
still converge to solution
• No checkpoints No overhead in absense of failures!
• algorithm-specific compensation function must restore state
pessimistic recovery optimistic recoveryfailure-free
38 © Volker Markl38
38 © Volker Markl
All Roads lead to Rome
If you are interested more, read our CIKM 2013 paper:
Sebastian Schelter, Stephan Ewen, Kostas Tzoumas, Volker
Markl: "All roads lead to Rome": optimistic recovery for
distributed iterative data processing. CIKM 2013: 1919-1928
Sergey Dudoladov, Chen Xu, Sebastian Schelter, Asterios
Katsifodimos, Stephan Ewen, Kostas Tzoumas, Volker Markl:
Optimistic Recovery for Iterative Dataflows in Action.
To appear in SIGMOD 2015
39 © Volker Markl39 © Volker Markl
Declarative Data
Processing
and Big Data
40 © Volker Markl40
40 © Volker Markl
A Billion $$$ Mantra...
Declarative Data Processing
SQL Relations RDBMS
A simple, high-level language for querying data (Chamberlin ’74).
An effective, formal foundation based on relational algebra and calculus (Codd ’71).
An efficient, low-level execution environment tailored towards the data (Selinger ’79).
41 © Volker Markl41
41 © Volker Markl
With 40+ years of success...
Declarative Data Processing
SQL Relations RDBMS
42 © Volker Markl42
42 © Volker Markl
Is Being Revised
Declarative Data Processing
SQL Relations RDBMS
DistributedCollections
Parallel DataflowEngines
Second-OrderFunctions
43 © Volker Markl43
43 © Volker Markl
Overall Vision & Next Steps
• First results– Alexander Alexandrov, Andreas Kunft, Asterios Katsifodimos, et al:
Implicit Parallelism through Deep Language Embedding. SIGMOD Conference 2015: 47-61
– Alexander Alexandrov, Andreas Salzmann, Georgi Krastev, Asterios Katsifodimos, Volker Markl: Emma in Action: Declarative Dataflows for Scalable Data Analysis. SIGMOD 2016
– Alexander Alexandrov, Asterios Katsifodimos, Georgi Krastev, Volker Markl: Implicit Parallelism through Deep Language Embedding. SIGMOD Record 45(1): 51-58 (2016)
• Next Steps (Fall 2016)– Open-Source Release
• Vision (Frontend): Multi-model DSL based on type contracts– Collection Processing DataBag[A]
– Linear Algebra Matrix[A], Vector[A]
– Stream Processing Stream[A]
• Vision (Backend): Target more execution engines– Column Stores
– GPUs
44 © Volker Markl44
44 © Volker Markl
Thanks to my team members and students
• Dr. Stephan Ewen
• Sebastian Schelter
• Dr. Kostas Tzoumas
• Dr. Asterios Katsifodimos
• Fabian Hüske
• Alexander Alexandrov
• Max Heimel
and many more members of the Stratosphere Project, the
Berlin Big Data Center, and the Apache Flink community
45 © Volker Markl
Evolution of Big Data Platforms
4G
3G
2G
1G
Relational Databases
Hadoop
Flink
Scale-out, Map/Reduce, UDFs
Spark
In-memory Performance and Improved Programming Model
In-memory + Out of Core Performance, Declarativity, Optimisation of iterative Algorithms, True Streaming/Lambda