data analytics intro session 1, 2013
DESCRIPTION
Introduction to Data AnalyticsTRANSCRIPT
Introduction to Data Analytics Session 1 Florent Renucci, Data Scientist, Amadeus North America
1
• I – What is it ?
• II – What’s in it for us ?
• III – How to succeed ?
• IV – How to fail ?
• V – Does it really work ?
• VI – Does it really create value ?
Introduction to Data Analytics
2
3
“Field of study that gives the computer the ability to learn without being explicitly programmed.”
- Arthur Samuel, 1959
I – You said Data Science ?
4
input X
function f output Y = f(X)
input X
output Y Model f such as Y = f(X) + ε
Computer Sciences
Data Sciences
I – Building a model “Numbers have an important story to tell. They rely on us to give them a voice.”
– Stephen Few
5
I – Using a model “Numbers have an important story to tell. They rely on us to give them a voice.”
– Stephen Few
6
I – What is it ?
• Feed a metaheuristic with the features and the explained phenomenon. The output is the link between them.
• Use the link to emulate new decisions.
In a nutshell
• I – What is it ?
• II – What’s in it for us ?
• III – How to succeed ?
• IV – How to fail ?
• V – Does it really work ?
• VI – Does it really create value ?
Introduction to Data Analytics
7
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
Concept
Goal
8
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
Concept
Goal Make predictions
9
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
Classification Regression
Concept
Goal Make predictions
10
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Concept
Goal Make predictions
11
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Concept
Goal Make predictions Understand systems
12
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Concept
Goal Make predictions Understand systems
13
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Concept
Goal Make predictions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
14
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Concept
Goal Make predictions Optimize functions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
15
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
16
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
• automatic investment on financial markets
• game playing
• Yield management
• Pavlov’s dog
• those funny things : https://www.youtube.com/watch?v=Lt-KLtkDlh8
17
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
• automatic investment on financial markets
• game playing
• Yield management
• Pavlov’s dog
• those funny things : https://www.youtube.com/watch?v=Lt-KLtkDlh8
? 18
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
• automatic investment on financial markets
• game playing
• Yield management
• Pavlov’s dog
• those funny things : https://www.youtube.com/watch?v=Lt-KLtkDlh8
? 19
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
• automatic investment on financial markets
• game playing
• Yield management
• Pavlov’s dog
• those funny things : https://www.youtube.com/watch?v=Lt-KLtkDlh8
? 20
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
• automatic investment on financial markets
• game playing
• Yield management
• Pavlov’s dog
• those funny things : https://www.youtube.com/watch?v=Lt-KLtkDlh8
? 21
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
• automatic investment on financial markets
• game playing
• Yield management
• Pavlov’s dog
• those funny things : https://www.youtube.com/watch?v=Lt-KLtkDlh8
? 22
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
• automatic investment on financial markets
• game playing
• Yield management
• Pavlov’s dog
• those funny things : https://www.youtube.com/watch?v=Lt-KLtkDlh8
? 23
II – What ‘s in it for us? “Those who ignore Statistics are condemned to reinvent it.”
- Brad Efron
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
• automatic investment on financial markets
• game playing
• Yield management
• Pavlov’s dog
• those funny things : https://www.youtube.com/watch?v=Lt-KLtkDlh8
? 24
• I – What is it ?
• II – What’s in it for us ?
• III – How to succeed ?
• IV – How to fail ?
• V – Does it really work ?
• VI – Does it really create value ?
Introduction to Data Analytics
25
Introduction to Data Analytics Session 2 Florent Renucci, Data Scientist, Amadeus North America
26
27 Map (filter, clean, project)
Reduce (count)
Machine Learning (learn)
III – From Data to Knowledge : Typical workflow “You can have data without information, but you cannot have information without data.”
– Daniel Keys Moran
Business insight
28
III – How to succeed ?
• Keep only the variables that are the most “linked” with the pattern, basing on knowledge about the observed phenomenon. Usually used in a MapReduce framework.
• Clean the data by deleting non-meaningful observations.
• Preprocessing the features to “make them talk”.
• Feed a metaheuristic with this data. The output is the model between them.
• Use the link to emulate new decisions.
In a nutshell
• I – What is it ?
• II – What’s in it for us ?
• III – How to succeed ?
• IV – How to fail ?
• V – Does it really work ?
• VI – Does it really create value ?
Introduction to Data Analytics
29
30
IV – What is a good predictive model ?
Feature X
Predicted Variable Y
“Essentially, all models are wrong, but some are useful.”
- G. Box
31
IV – What is a good predictive model ?
Feature X
Predicted Variable Y
“Essentially, all models are wrong, but some are useful.”
- G. Box
Undersmoothing
32
IV – What is a good predictive model ?
Feature X
Predicted Variable Y
“Essentially, all models are wrong, but some are useful.”
- G. Box
Undersmoothing
33
IV – What is a good predictive model ?
Feature X
Predicted Variable Y
“Essentially, all models are wrong, but some are useful.”
- G. Box
Overfitting Undersmoothing
34
IV – What is a good predictive model ?
Feature X
Predicted Variable Y
“Essentially, all models are wrong, but some are useful.”
- G. Box
Overfitting Undersmoothing
35
IV – What is a good predictive model ?
Feature X
Predicted Variable Y
“Essentially, all models are wrong, but some are useful.”
- G. Box
Undersmoothing Good level of complexity Overfitting
36
IV – What is a good predictive model ?
Feature X
Predicted Variable Y
“Essentially, all models are wrong, but some are useful.”
- G. Box
Undersmoothing Good level of complexity Overfitting
37
IV – What is a good predictive model ?
Feature X
Predicted Variable Y
“Essentially, all models are wrong, but some are useful.”
- G. Box
Error rate : how to measure it ?
38
IV – Choose the good metric ! “What gets measured, gets managed.”
- Peter Drucker
• 𝐑𝐌𝐒𝐄 = 𝐲 − 𝐲 𝟐𝐧𝐢=𝟏
• 𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝐌𝐒𝐄 = 𝐲 −𝐲 𝟐𝐧𝐢=𝟏
𝐧−𝐤
• 𝐀𝐈𝐂 = 𝟐 𝐤 − 𝐥𝐨𝐠 𝐋
• 𝐀𝐈𝐂 = 𝐤. 𝐥𝐧 𝐧 − 𝟐. 𝐥𝐧(𝐋)
…
39
IV – Choose the good metric ! “What gets measured, gets managed.”
- Peter Drucker
• 𝐑𝐌𝐒𝐄 = 𝐲 − 𝐲 𝟐𝐧𝐢=𝟏
• 𝐚𝐝𝐣𝐮𝐬𝐭𝐞𝐝 𝐑𝐌𝐒𝐄 = 𝐲 −𝐲 𝟐𝐧𝐢=𝟏
𝐧−𝐤
• 𝐀𝐈𝐂 = 𝟐 𝐤 − 𝐥𝐨𝐠 𝐋
• 𝐀𝐈𝐂 = 𝐤. 𝐥𝐧 𝐧 − 𝟐. 𝐥𝐧(𝐋)
…
No Silver Bullet
40
Learning set
Complexity
error
time
error
Test set
“If you do not know how to ask the right question, you discover nothing.”
- Deming
IV – What is a good predictive model ?
Accuracy Robustness
…
• I – What is it ?
• II – What’s in it for us ?
• III – How to succeed ?
• IV – How to fail ?
• V – Does it really work ?
• VI – Does it really create value ?
Introduction to Data Analytics
41
Introduction to Data Analytics Session 3 Florent Renucci, Data Scientist, Amadeus North America
42
II – What ‘s in it for us? “Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.”
- Geoffrey Moore
Examples
• Spam detection
• biology/medicine
• fraud detection
• scoring (Google, Meetic)
• Weather prediction
• stock prediction
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• speech recognition
• e-marketing
• sentiment mining
• recommendation system : "you would also like" in Amazon
• rare event detection Obama’s campaign
• automatic investment on financial markets
• game playing
• Yield management
• Pavlov’s dog
• those funny things : https://www.youtube.com/watch?v=Lt-KLtkDlh8
II – What ‘s in it for us?
• Random Forest
• Naive Bayes classifier
• Support Vector Machine
• Artificial Neural Network • Bagging
• Time Series
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• K-means
• Fuzzy clustering
• Artificial Neural Network
• DBSCAN
• Expectation-Maximisation
• Gradient Descent
• Max-flow min-cut
• Belief propagation
• Temporal difference learning
• Q-learning
Algos
“Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.”
- Geoffrey Moore
II – What ‘s in it for us?
• Random Forest
• Naive Bayes classifier
• Support Vector Machine
• Artificial Neural Network • Bagging
• Time Series
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• K-means
• Fuzzy clustering
• Artificial Neural Network
• DBSCAN
• Expectation-Maximisation
• Gradient Descent
• Max-flow min-cut
• Belief propagation
• Temporal difference learning
• Q-learning
Algos
“Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.”
- Geoffrey Moore
II – What ‘s in it for us?
• Random Forest
• Naive Bayes classifier
• Support Vector Machine
• Artificial Neural Network • Bagging
• Time Series
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• K-means
• Fuzzy clustering
• Artificial Neural Network
• DBSCAN
• Expectation-Maximisation
• Gradient Descent
• Max-flow min-cut
• Belief propagation
• Temporal difference learning
• Q-learning
Algos
No Silver Bullet
“Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.”
- Geoffrey Moore
II – What ‘s in it for us?
• Random Forest
• Naive Bayes classifier
• Support Vector Machine
• Artificial Neural Network • Bagging
• Time Series
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• K-means
• Fuzzy clustering
• Artificial Neural Network
• DBSCAN
• Expectation-Maximisation
• Gradient Descent
• Max-flow min-cut
• Belief propagation
• Temporal difference learning
• Q-learning
Algos
No Free Lunch Theorem
“Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.”
- Geoffrey Moore
48
V – Example 1 (see R simulation) Classification and Regression Tree
Color ?
x
y
z
49
Color ?
V – Example 1 (see R simulation) Classification and Regression Tree
x
y
z Color ?
up down
Green !
50
x
y
z Color ?
up down
left right
Green ! Color ?
Purple ! Orange !
V – Example 1 (see R simulation) Classification and Regression Tree
51
α1 α2 α3 α𝑛
…
Prediction step What is the color/value of the output ?
Learning step Building n trees from n random subsets
Testing step Can I really trust this tree ?
Boosting step Smartly combining the trees to avoid overfitting
𝑌 = α𝑖
𝑛
𝑖=1
𝑦𝑖
…
Evaluation step How good are my predictions ?
error rate = ?
V – Example 1 (see R simulation) Random Forest
52
V– Example 2 (see R simulation) Univariate Time-Series AutoRegression
• The process is entirely explained by its past values.
• The goal is to find the link between k successive values and the next one. “k” (the lag order) and the link have to be estimated.
• Each process (travel) is considered independently.
The idea
• 𝐘𝐭 = f Yt−k, … , Yt−1
• “k” : How many points are related to each other ?
• “f” : what was the underlying pattern ? what happened from (t − k) to (t − 1) ?
Definition
t
Predicted
Variable Yt
t t
53
• 𝐘𝐭 = f Yt−k, … , Yt−1, 𝑍𝑡−𝑘′ , … , 𝑍𝑡−1, …
• Again “k”, “f”.
• “p”, “ k’ ” : What neighbors should I consider ? How “late” are they ?
Definition
Predicted
Variable Yt
t
• The process is entirely explained by its past values AND by the past values of its “neighbors”.
• The goal is to find the link between k successive values and the next ones. “k” (the lag order), the p neighbors, and the link have to be estimated.
• Processes (travels) are considered as correlated.
The idea
𝑍t, a "good" neighbor
V– Example 2 (see R simulation) Multivariate Time-Series AutoRegression
• I – What is it ?
• II – What’s in it for us ?
• III – How to succeed ?
• IV – How to fail ?
• V – Does it really work ?
• VI – Does it really create value ?
Introduction to Data Analytics
54
• Random Forest
• Naive Bayes classifier
• Support Vector Machine
• Artificial Neural Network • Bagging
• Time Series
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• K-means
• Fuzzy clustering
• Artificial Neural Network
• DBSCAN
• Expectation-Maximisation
• Gradient Descent
• Max-flow min-cut
• Belief propagation
• Temporal difference learning
• Q-learning
Algos
IV – Data Analytics “Data really powers everything that we do.”
– Jeff Weiner
• Random Forest
• Naive Bayes classifier
• Support Vector Machine
• Artificial Neural Network • Bagging
• Time Series
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• K-means
• Fuzzy clustering
• Artificial Neural Network
• DBSCAN
• Expectation-Maximisation
• Gradient Descent
• Max-flow min-cut
• Belief propagation
• Temporal difference learning
• Q-learning
Algos
IV – Data Analytics “Data really powers everything that we do.”
– Jeff Weiner
Business value ?
• Random Forest
• Naive Bayes classifier
• Support Vector Machine
• Artificial Neural Network • Bagging
• Time Series
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• K-means
• Fuzzy clustering
• Artificial Neural Network
• DBSCAN
• Expectation-Maximisation
• Gradient Descent
• Max-flow min-cut
• Belief propagation
• Temporal difference learning
• Q-learning
Algos
IV – Data Analytics “Data really powers everything that we do.”
– Jeff Weiner
Business value ?
Mathematical concept ?
• Random Forest
• Naive Bayes classifier
• Support Vector Machine
• Artificial Neural Network • Bagging
• Time Series
Classification Regression
Clustering Discrete optimization
Reinforcement Learning Concept
Goal Make predictions Optimize functions Understand systems
• K-means
• Fuzzy clustering
• Artificial Neural Network
• DBSCAN
• Expectation-Maximisation
• Gradient Descent
• Max-flow min-cut
• Belief propagation
• Temporal difference learning
• Q-learning
Algos
Implementation ?
Business value ?
Mathematical concept ?
IV – Data Analytics “Data really powers everything that we do.”
– Jeff Weiner
59
IV – Data Analytics “Listening to the data is important… but so is experience and intuition. After all, what is intuition at its best but large amounts of data of all kinds filtered through a human brain rather than a math model ?”
- Steve Lohr
60
IV – Data Analytics “Listening to the data is important… but so is experience and intuition. After all, what is intuition at its best but large amounts of data of all kinds filtered through a human brain rather than a math model ?”
- Steve Lohr
61
IV – Data Analytics “Listening to the data is important… but so is experience and intuition. After all, what is intuition at its best but large amounts of data of all kinds filtered through a human brain rather than a math model ?”
- Steve Lohr
62
IV – Data Analytics “Listening to the data is important… but so is experience and intuition. After all, what is intuition at its best but large amounts of data of all kinds filtered through a human brain rather than a math model ?”
- Steve Lohr
63
IV – Data Analytics “Information is the oil of the 21st century, and Data Analytics is the combustion engine.”
- P. Sondergaard
Business
IT Data
Sciences
High tech
64
IV – Data Analytics
Business
IT Data
Sciences
High tech Business Insights
“Information is the oil of the 21st century, and Data Analytics is the combustion engine.”
- P. Sondergaard
65
IV – Data Analytics
Business
IT Data
Sciences
High tech Business Insights
Data processing
“Information is the oil of the 21st century, and Data Analytics is the combustion engine.”
- P. Sondergaard
66
IV – Data Analytics
Business
IT Data
Sciences
High tech
Breakthrough Innovation
Business Insights
Data processing
“Information is the oil of the 21st century, and Data Analytics is the combustion engine.”
- P. Sondergaard
67
IV – Data Analytics
Business
IT Data
Sciences
High tech Business Insights
Data processing
Business expertise.
Technical Expertise.
Mathematical expertise and links with academical research.
Big Data Meaningful Data, and relevant questions.
Investments in R&D.
Consistency with Core Activities.
Maturity of the industry.
Infrastructure.
Culture of innovation.
Requirements
Breakthrough Innovation
“Information is the oil of the 21st century, and Data Analytics is the combustion engine.”
- P. Sondergaard
Thanks. Any questions ?
68
“It’s easy to lie with statistics. It’s hard to tell the truth without it.”
- Andrejs Dunkels
69
Annex A– Univariate Time-Series AutoRegression
∃ k ∈ ℕ, ∃ α1, … , αk ∈ ℝk such as ∶
Yt= α1Yt−1 + α2Yt−2+…+ αkYt−k+εt
i.e. : Yt= αiYt−iki=1 + εt
So 𝐘𝐭 = αiYt−iki=1 = Yt−1, … , Yt−k
T. α1, … , αk ≝𝐗𝐀
Definition
Penalization for non-parsimonious models :
• k, Αoptim = argmink,Yt
Yt − Yt 22+ 2k (Mallows, 73)
• Akaike Criterion (same with log) (Akaike, 73)
• Student test and forward-backward algorithm
Estimating 𝐤
𝐘𝐭 ≝ 𝐗𝐀 Then 𝐘𝐭 - 𝐘𝐭= 𝛆𝐭
Αoptim = argminYt Yt − Yt 2
2= argmin
AXA− Yt 2
2
first-order condition : 2 XTXΑoptim − 2YTX = 0
So 𝚨𝐨𝐩𝐭𝐢𝐦 = 𝐗𝐓𝐗−𝟏𝐘𝐓𝐗
Estimating 𝐀 (OLS Method)
• The process is entirely explained by its past values.
• The goal is to find the link between k successive values and the next one. “k” (the lag order) and the link have to be estimated.
• Each process (travel) is considered independently.
The idea
70
∃ k ∈ ℕ, ∃ α1, … , αk ∈ ℝk∗𝐩 𝐯𝐞𝐜𝐭𝐨𝐫𝐬 ! such as ∶
Yt= α1. Yt−1 + α2. Yt−2+…+ αk. Yt−k+εt
i.e. : Yt= αi. Yt−iki=1 + εt
So 𝐘𝐭 ≝ 𝐗𝐀 (matrices !)
Definition
Penalization for non-parsimonious models :
• k, Αoptim = argmink,Yt
Yt − Yt 22+ 2k (Mallows, 73)
• Akaike Criterion (same with log) (Akaike, 73)
• Student test and forward-backward algorithm
Estimating 𝐤𝐦𝐚𝐱
• The process is entirely explained by its past values AND by the past values of its “neighbors”.
• The goal is to find the link between k successive values and the next ones. “k” (the lag order), the p neighbors, and the link have to be estimated.
• Processes (travels) are considered as correlated.
The idea
cross − correlation Yt0 , Yt0+i t (𝜏) = 𝑐𝑜𝑣 𝑌𝑡0𝑡 , 𝑌𝑡0+𝑖
𝑡−𝜏
p = argmaxi <kmax
(cross − correlation Yt0 , Yt0+i (0)(𝜏) > treshold)
ki = argmax𝜏∈ℕ
(cross − correlation Yt0 , Yt0+i 𝑡 (𝜏) > treshold)
Treshold : chosen empirically from a given grid.
Estimating 𝐤 𝐟𝐨𝐫 𝐞𝐚𝐜𝐡 𝐧𝐞𝐢𝐠𝐡𝐛𝐨𝐫
Annex B – Multivariate Time-Series AutoRegression
71
Annex C – Gradient descent algorithm in a 3-dimensional space