machine learning for data streamsapache scalable advanced massive online analysis (samoa) is a...
TRANSCRIPT
![Page 1: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/1.jpg)
Machine Learning for Data Streams
Albert Bifet (@abifet)
Paris TMA Conference 2019
20 June 2019
![Page 2: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/2.jpg)
Internet of Things
IoT: sensors and actuators connected by networks to computing systems.
• Gartner predicts 20.8 billion IoT devices by 2020.• IDC projects 32 billion IoT devices by 2020
![Page 3: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/3.jpg)
IoT versus Big Data
3
![Page 4: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/4.jpg)
AI/Machine Learning is the new Electricity
• Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed.
• Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data.
4
![Page 5: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/5.jpg)
–Jeff Bezos
“Over the past decades, computers have broadly automated tasks that programmers
could describe with clear rules and algorithms.
Modern machine learning techniques now allow us to do the same for tasks where describing the precise rules is much
harder.”
![Page 6: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/6.jpg)
–Unknown Author
“Machine Learning is a way of solving problems without knowing how to solve them”
![Page 7: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/7.jpg)
Computer Science
Imperative Programming
The programmer specifies an explicit
sequences of steps to follow to produce a
result.
Decision: +, -
Data
Software 1.0
![Page 8: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/8.jpg)
Computer Science
Imperative Programming
while { for { do {..} } }
Decision: +, -
Data
Software 1.0
![Page 9: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/9.jpg)
Machine Learning
Machine Learning Algorithm
Decision: +, -
Data
Software 2.0
![Page 10: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/10.jpg)
5 Minutes Course on Machine Learning
AI/ML = Data + Algorithms +
Computing Power
![Page 11: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/11.jpg)
Classification of Visitors to a
Website
Costumer
Non Costumer
Is this visitor a costumer?
![Page 12: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/12.jpg)
Costumer
Non Costumer
![Page 13: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/13.jpg)
Costumer
Non Costumer
Is this visitor a costumer?
![Page 14: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/14.jpg)
Non Costumer
Classification
Majority Class
Is this visitor a costumer?
![Page 15: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/15.jpg)
Classification
Linear Classifier
Is this visitor a costumer?
Costumer
![Page 16: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/16.jpg)
Classification
Nearest Neighbour
Is this visitor a costumer?
Costumer
![Page 17: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/17.jpg)
Classification
Decision Tree
Is this visitor a costumer?
Costumer
Island
Region
SouthNorth
Manawatu-WanganuiGisborne BoP
![Page 18: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/18.jpg)
Classification
Random Forest: Ensemble of
Random Trees
Is this visitor a costumer?
Costumer
![Page 19: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/19.jpg)
Classification
Deep Learning
Is this visitor a costumer?
Costumer
![Page 20: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/20.jpg)
![Page 21: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/21.jpg)
AI Systems• According to Nikola Kasabov, AI systems should exhibit the
following characteristics:• Accommodate new problem solving rules incrementally• Adapt online and in real time • Are able to analyze itself in terms of behavior, error and
success.• Learn and improve through interaction with the environment
(embodiment)• Learn quickly from large amounts of data (Big Data)• Have memory-based exemplar storage and retrieval capacities• Have parameters to represent short and long term memory,
age, forgetting, etc.
21
![Page 22: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/22.jpg)
Data Streams
• Maintain models online
• Incorporate data on the fly
• Unbounded training sets
• Resource efficient
• Detect changes and adapts
• Dynamic models
22
![Page 23: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/23.jpg)
Analytic Standard ApproachFinite training sets
Static models 23
Data Set
Model
Classifier Algorithm builds Model
![Page 24: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/24.jpg)
Data Stream ApproachInfinite training sets
Dynamic models 24
D
M
Update Model
D
M
D
M
D
M
D
M
D
M
D
M
D
M
D
M
D
M
D
M
D
M
![Page 25: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/25.jpg)
Importance$of$Online$Learning$$
• As$spam$trends$change,$it$is$important$to$retrain$the$model$with$newly$judged$data$
• Previously$tested$using$news$comment$in$Y!Inc$
• Over$29$days$period,$you$can$see$degrada)on$in$performance$of$base$model$(w/o$ac)ve$learning)$VS$Online$model$(AUC$stands$for$Area$Under$Curve)$
• Original$paper$$
Adversarial Learning
• Need to retrain!
• Things change over time
• How often?
• Data unused until next update!
• Value of data wasted
25
![Page 26: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/26.jpg)
AI Challenges
![Page 27: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/27.jpg)
Cédric Villani and Marc Shoenauer
![Page 28: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/28.jpg)
1. Open AI
![Page 29: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/29.jpg)
MOA• {M}assive {O}nline {A}nalysis is a framework for online
learning from data streams.
• It is closely related to WEKA
• It includes a collection of offline and online as well as tools for evaluation:
• classification, regression
• clustering, frequent pattern mining
• Easy to extend, design and run experiments
{M}assive {O}nline {A}nalysisMOA (Bifet et al. 2010)
{M}assive {O}nline {A}nalysis is a framework for onlinelearning from data streams.
It is closely related to WEKAIt includes a collection of offline and online as well astools for evaluation:
classification, regressionclusteringfrequent pattern mining
Easy to extendEasy to design and run experiments
![Page 30: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/30.jpg)
MOA
30
Richard Kirkby
Software Developer at 11Ants Analytics Ltd
Geoff Holmes
Dean of Computing & Mathematical
Sciences University of
Waikato
Bernhard Pfahringer
Computer Science Department University of
Waikato
![Page 31: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/31.jpg)
Main Contributors• Weka ML Group: Peter Reutemann, Eibe Frank,
Mike Mayo
• Jesse Read, Indrė Žliobaitė, Philipp Kranen, Hardy Kremer, Timm Jansen, Marwan Hassani, Thomas Seidl, Dimitris Georgiadis, Anastasios Gounaris, Apostolos N. Papadopoulos, Kostas Tsichlas, Yannis Manolopoulos, Dariusz Brzeziński, Ricard Gavaldà, Alex Catarineu, Joao Gama, Ricardo Sousa, Joao Duarte, Aljaž Osojnik, …
31
![Page 32: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/32.jpg)
32
![Page 33: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/33.jpg)
WEKA: the bird
33
![Page 34: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/34.jpg)
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct.
34
![Page 35: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/35.jpg)
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct.
35
![Page 36: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/36.jpg)
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct.
36
![Page 37: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/37.jpg)
Stream Setting• Process an example at a
time,and inspect it only once (at most)
• Use a limited amount of memory
• Work in a limited amount of time
• Be ready to predict at any point
![Page 38: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/38.jpg)
Stream Evaluation
• Holdout Evaluation
• Interleaved Test-Then-Train or Prequential
![Page 39: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/39.jpg)
Stream EvaluationHoldout an independent test set
• Apply the current decision model to the test set, at regular time intervals
• The loss estimated in the holdout is an unbiased estimator
![Page 40: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/40.jpg)
Stream EvaluationPrequential Evaluation
• The error of a model is computed from the sequence of examples.
• For each example in the stream, the actual model makes a prediction based only on the example attribute-values.
![Page 41: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/41.jpg)
Clustering
![Page 42: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/42.jpg)
MOA Algorithms• Multi-label/ Multi-target
• Outlier Detection
• Concept Drift Detection
• Active Learning
• Frequent Itemset Mining
• Frequent Graph Mining
• Recommendation Systems
42
![Page 43: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/43.jpg)
Command Line• java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask "EvaluatePeriodicHeldOutTest -l DecisionStump -s generators.WaveformGenerator -n 100000 -i 100000000 -f 1000000" > dsresult.csv
• This command creates a comma separated values file:
• training the DecisionStump classifier on the WaveformGenerator data,
• using the first 100 thousand examples for testing,
• training on a total of 100 million examples,
• and testing every one million examples
![Page 44: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/44.jpg)
44
![Page 45: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/45.jpg)
ADAMSAdvanced Data Mining And Machine Learning System
45
![Page 46: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/46.jpg)
OpenML
46
![Page 47: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/47.jpg)
scikit-multiflow
47
![Page 48: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/48.jpg)
scikit-multiflow
![Page 49: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/49.jpg)
scikit-multiflow
![Page 50: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/50.jpg)
scikit-multiflow
Jesse Read Ecole Polytechnique
France
Jacob Montiel Telecom ParisTech
France
![Page 51: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/51.jpg)
Learning Fast and Slow
![Page 52: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/52.jpg)
![Page 53: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/53.jpg)
![Page 54: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/54.jpg)
Learning Fast and Slow
![Page 55: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/55.jpg)
Learning Fast and Slow
![Page 56: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/56.jpg)
Learning Fast and Slow
![Page 57: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/57.jpg)
Learning Fast and Slow
![Page 58: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/58.jpg)
Learning Fast and Slow
![Page 59: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/59.jpg)
2. Green AI
![Page 60: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/60.jpg)
![Page 61: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/61.jpg)
![Page 62: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/62.jpg)
![Page 63: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/63.jpg)
![Page 64: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/64.jpg)
![Page 65: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/65.jpg)
Green AI
• One pass over the data
• Approximation algorithms: small error ε with high probability 1-δ
• True hypothesis H, and learned hypothesis Ĥ
• Pr[ |H - Ĥ| < ε|H| ] > 1-δ
![Page 66: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/66.jpg)
Approximation Algorithms
• What is the largest number that we can store in 8 bits?
66
![Page 67: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/67.jpg)
Approximation Algorithms
• What is the largest number that we can store in 8 bits?
67
![Page 68: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/68.jpg)
Approximation Algorithms
• What is the largest number that we can store in 8 bits?
68
![Page 69: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/69.jpg)
Approximation Algorithms
• What is the largest number that we can store in 8 bits?
69
![Page 70: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/70.jpg)
Green AI• Transform Big Data into Small
Data
• Vertical: reducing features
• Horizontal: reducing instances
• Make data stream methods more energy efficient
• Use Energy as a measure, not time and memory
Data
Attributes
Instances
![Page 71: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/71.jpg)
Compressed Sensing
Joint Work with: - Maroua Bahri - Silviu Maniu - Nikos Tziortziotis - Rodrigo Mello
![Page 72: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/72.jpg)
Coresets
![Page 73: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/73.jpg)
3. Explainable AI
![Page 74: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/74.jpg)
![Page 75: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/75.jpg)
![Page 76: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/76.jpg)
![Page 77: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/77.jpg)
Lime: Explaining the predictions of any machine learning classifier
"Why Should I Trust You?": Explaining the Predictions of Any ClassifierMarco Tulio Ribeiro, Sameer Singh, Carlos Guestrin, KDD 2016
![Page 78: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/78.jpg)
Decision Tree• Each node tests a features
• Each branch represents a value
• Each leaf assigns a class
• Greedy recursive induction
• Sort all examples through tree
• xi = most discriminative attribute
• New node for xi, new branch for each value, leaf assigns majority class
• Stop if no error | limit on #instances
78
RoadTested?
Mileage?
Age?
NoYes
High
✅
❌
Low
OldRecent
✅ ❌
Car deal?
![Page 79: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/79.jpg)
HOEFFDING TREE• Sample of stream enough for near optimal decision
• Estimate merit of alternatives from prefix of stream
• Choose sample size based on statistical principles
• When to expand a leaf?
• Let x1 be the most informative attribute, x2 the second most informative one
• Hoeffding bound: split if G(x1) - G(x2) > ε
P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00
=
rR2 ln(1/�)
2n
![Page 80: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/80.jpg)
![Page 81: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/81.jpg)
Ensembles of Adaptive Model Rules from High-Speed Data Streams
AMRules
Rules
Rules
����������
���������
A rule is a set of conditions based onattribute values.If all the conditions are true, a prediction ismade based on L.L contains the sufficient statistics to:
expand a rule,make predictions,detect changes,detect anomalies.
6 / 33
Rules• Problem: very large decision trees have
context that is complex andhard to understand
• Rules: self-contained, modular, easier to interpret, no need to cover universe
• ! keeps sufficient statistics to:
• make predictions
• expand the rule
• detect changes and anomalies
81
![Page 82: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/82.jpg)
Ensembles of Adaptive Model Rules from High-Speed Data Streams
AMRules
Rule sets
Predicting with a rule set
�������
���� ����� �����
E.g: x = [4,�1, 1, 2]
f (x) =X
Rl2S(xi )
✓l yl ,
Prediction consists of a weightedaverage of the predictions madeby the rules that cover x.Weights are inverselyproportional to the MAE of theprediction functions.The uncertainty of a prediction isthe weighted average of theerrors.
✓l =(el + ")�1
X
Rj2S(xi )
(ej + ")�1
12 / 33
Adaptive Model Rules• Ruleset: ensemble of rules
• Rule prediction: mean, linear model
• Ruleset prediction
• Weighted avg. of predictions of rules covering instance x
• Weights inversely proportional to error
• Default rule covers uncovered instances
82
E. Almeida, C. Ferreira, J. Gama. "Adaptive Model Rules from Data Streams." ECML-PKDD ‘13
![Page 83: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/83.jpg)
Adaptive Random Forest
Adaptive random forests for evolving data stream classification. Gomes, H M; Bifet, A; Read, J; Barddal, J P; Enembreck, F; Pfharinger, B; Holmes, G; Abdessalem, T.Machine Learning, Springer, 2017.
• Based on the original Random Forest by Breiman
83
![Page 84: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/84.jpg)
ADWIN
84
![Page 85: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/85.jpg)
ADWIN
85
![Page 86: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/86.jpg)
4. Ethical Issues
![Page 87: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/87.jpg)
![Page 88: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/88.jpg)
![Page 89: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/89.jpg)
Should data have an expiration date?
![Page 90: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/90.jpg)
5. Distributed Machine Learning for Data
Streams
![Page 91: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/91.jpg)
Streaming
Vision
91
Distributed
IoT Big Data Stream Mining
![Page 92: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/92.jpg)
APACHE SAMOA
92
http://samoa-project.net
Data Mining
Distributed
Batch
Hadoop
Mahout
Stream
Storm, S4, Samza
SAMOA
Non Distributed
Batch
R, WEKA,…
Stream
MOA
G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014)
![Page 93: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/93.jpg)
SAMOA ARCHITECTURE
5 CREATING A FLINK ADAPTER ON APACHE SAMOA
5 Creating a Flink Adapter on Apache SAMOA
Apache Scalable Advanced Massive Online Analysis (SAMOA) is a platform formining data streams with the use of distributed streaming Machine Learning al-gorithms, which can run on top of different Data Stream Processing Engines(DSPE)s.
As depicted in Figure 20, Apache SAMOA offers the abstractions and APIs fordeveloping new distributed ML algorithms to enrich the existing library of state-of-the-art algorithms [27, 28]. Moreover, SAMOA provides the possibility of inte-grating new DSPEs, allowing in that way the ML programmers to implement analgorithm once and run it in different DSPEs [28].
An adapter for integrating Apache Flink into Apache SAMOA was implementedin scope of this master thesis, with the main parts of its implementation beingaddressed in this section. With the use of our adapter, ML algorithms can beexecuted on top of Apache Flink. The implemented adapter will be used for theevaluation of the ML pipelines and HT algorithm variations.
Figure 20: Apache SAMOA’s high level architecture.
5.1 Apache SAMOA Abstractions
Apache SAMOA offers a number of abstractions which allow users to implementany distributed streaming ML algorithms in a platform independent way. The mostimportant abstractions of Apache SAMOA are presented below [27, 28].
40
![Page 94: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/94.jpg)
Stats
Stats
Stats
Stream
Model
Attributes
Splits
Vertical Partitioning
94
Single attribute tracked in
single node
N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Murdopo: “VHT: Vertical Hoeffding Tree”, 2016 Big Data Conference 2016
![Page 95: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/95.jpg)
Yahoo! Confidential & Proprietary.
Apache SAMOA TeamGianmarco De Francisci Morales, Nicolas Kourtellis, Matthieu
Morel, Arinto Murdopo, Antonio Severien, and Olivier Van Laere 6/6/13
![Page 97: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/97.jpg)
Summary• Machine Learning for Data Streams useful for finding approximate
solutions with reasonable amount of time & limited resources
• Challenges:
• Open AI
• Green AI
• Explainable AI
• Ethical Issues
• Distributed Data Stream Mining
![Page 98: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/98.jpg)
98
![Page 99: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/99.jpg)
Green Data Mining
![Page 100: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/100.jpg)
Thanks!
100
@abifet
![Page 101: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/101.jpg)
Machine Learning for Data Streams
Albert Bifet (@abifet)
Paris TMA Conference 2019
20 June 2019
![Page 102: Machine Learning for Data StreamsApache Scalable Advanced Massive Online Analysis (SAMOA) is a platform for mining data streams with the use of distributed streaming Machine Learning](https://reader030.vdocument.in/reader030/viewer/2022040410/5ed1552defd7b2537304c7a6/html5/thumbnails/102.jpg)
102