introduction to data science supervised learning
TRANSCRIPT
![Page 1: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/1.jpg)
Introduction to Data ScienceSupervised Learning
Benoit Gauzere
INSA Rouen Normandie - Laboratoire LITIS
March 4, 2021
1 / 48
![Page 2: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/2.jpg)
Supervised Learning
Purpose
Given a dataset (xi, yi) ∈ X × Y, i = 1, . . . , N, learn thedependancies between X and Y.
I Example: Learn the links between cardiac risk and foodhabits. xi is one person describe by d features concerning itsfood habits; yi is a binary category (risky, not risky).
I yi are essential for the learning process.
I Methods : K-Nearest Neighbors, SVM, Decision Tree, . . .
2 / 48
![Page 3: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/3.jpg)
How to Encode Data
X ∈ Rn×p
Samples
I n samples (number of lines)
I X(i, :) = x>i : the i-th sample
I xi ∈ Rp
FeaturesI p features (number of colons)
I Each sample is described by p features
I X(:, j) = x•j : The j-th feature for all sample
X(i, j) : j-th feature of i-th sample.
3 / 48
![Page 4: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/4.jpg)
observation 1
observation i
variable j
variable 1
variable
p
observation n
![Page 5: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/5.jpg)
Learning Model
Model
f : X → Yxi → y
We want thatf(xi) ' yi
5 / 48
![Page 6: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/6.jpg)
Example
What is the underlying f function ?
6 / 48
![Page 7: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/7.jpg)
Example
0 2 4 6 8 100
20
40
60
80
0 2 4 6 8 100
20
40
60
80
0 2 4 6 8 100
20
40
60
80
100
0 2 4 6 8 10
0
20
40
60
80
100
7 / 48
![Page 8: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/8.jpg)
How to find a good f ?
f ? = arg minfL(f (X),y) + λΩ(f )
I L : Y × Y → R
I λ ∈ R+
I Ω : (X → Y)→ R+
8 / 48
![Page 9: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/9.jpg)
Fit to data term
L(f (X),y)
I Guarantees that the model fits the data
I Penalizes when the predicted value of f(xi) is far from yi
9 / 48
![Page 10: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/10.jpg)
Regularization term
Ω(f )
I Constrain the complexity of function f
I Occam’s razor : simpler is better
I λ : Weight the balance between the two terms
10 / 48
![Page 11: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/11.jpg)
Illustrations I
0 2 4 6 8 100
20
40
60
80
I Very high L(f(X),y)
I Very low Ω(f)
11 / 48
![Page 12: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/12.jpg)
Illustrations II
0 2 4 6 8 100
20
40
60
80
I High L(f(X),y)
I Low Ω(f)
12 / 48
![Page 13: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/13.jpg)
Illustrations III
0 2 4 6 8 100
20
40
60
80
100
I L(f(X),y) = 0
I High Ω(f)
13 / 48
![Page 14: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/14.jpg)
Illustrations IV
0 2 4 6 8 10
0
20
40
60
80
100
I L(f(X),y) = 0
I Very high Ω(f)
14 / 48
![Page 15: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/15.jpg)
Generalization
A good model generalizes well
I Good generalization : good prediction on unseen data
I Hard to evaluate without bias
I Overfitting
I Regularization term prevents from overfitting
15 / 48
![Page 16: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/16.jpg)
Supervised Learning Tasks I
Binary Classification
I Y = 0, 1I Dog or cat ? Positive or Negative ?
I Performance : Accuracy, recall, precision, . . .
Dog
Cat
16 / 48
![Page 17: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/17.jpg)
Supervised Learning Tasks IIRegression
I Y = R
I Stock market, House price, boiling points of molecules, . . .
I Performance: RMSE, MSE, MAE, . . .
H3C
Cl O
O
O P
S
O
CH3
O CH3
H3C
Cl O
O
O P
S
O
CH3
O CH3120 °C
17 / 48
![Page 18: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/18.jpg)
Supervised Learning Tasks III
And many others
I Ranking
I MultiClass classification
I Multi Labeling
I . . .
18 / 48
![Page 19: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/19.jpg)
Ridge Regression
Ridge regression
I Tikhonov Regression, Least Squares
I Model for regression tasks
I Linear model
I Simple to implement
I Simple to understand
19 / 48
![Page 20: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/20.jpg)
Problem definition
Formal definition
w? = arg minw
J(w) = arg minw
‖Xw − y‖22 + λ‖w‖22
I f(x) = w>x
I w ∈ Rd are the parameters of the model
I Fit to data: norm of differences
I Regularisation term: L2 norm of w
20 / 48
![Page 21: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/21.jpg)
Problem resolution
Convexity
I J(w) is convex
I ⇒ An unique and global minimum
I arg minw J(w)⇔ w | ∇wJ(w) = 0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.00
20
40
60
80
100
21 / 48
![Page 22: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/22.jpg)
Closed Form I
Rewriting J(w)
J(w) = ‖Xw − y‖22 + λ‖w‖22= (Xw − y)>(Xw − y) + λw>w
= ((Xw)>(Xw)− 2(Xw)>y − y>y) + λw>w
= (w>X>Xw)− 2w>X>y − y>y + λw>w
Differential of J(w)
∇wJ(w) = ∇w(w>X>Xw)− 2w>X>y − y>y + λw>w
= 2X>Xw − 2X>y + 2λw
22 / 48
![Page 23: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/23.jpg)
Closed Form II
∇wJ(w) = 0
∇wJ(w) = 0
⇔ X>Xw −X>y + λw = 0
⇔ (X>X + λI)w = X>y
⇔ w = (X>X + λI)−1X>y
I Need to compute an inverse matrix
I X>X(i, j) = x>i xj
I (X>X + λI) is definite positive, so inversible
23 / 48
![Page 24: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/24.jpg)
Prediction
The modelI To predict on new (unseen) data x′:
y = f(x′) = w>x′
Adding a bias
If we want f(x) = w>x + b, b ∈ R.
I Augment X with a column of ones
I Xbias = [X 1]
24 / 48
![Page 25: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/25.jpg)
Principle
GeneralitiesI Iterative approach to find w
I Optimal on convex objective function, local optimum onothers
I Avoid to compute an inverse of matrix
25 / 48
![Page 26: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/26.jpg)
Gradient Descent
We want:I to find a direction to update w
I that this direction lead to a lower J(w)
Which direction to use ?
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.00
20
40
60
80
100
26 / 48
![Page 27: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/27.jpg)
Gradient descent
GradientI Computed from the derivative of J(w)
I Indicates the direction where J(w) increases locally
DirectionI We want to minimize J(w)
I The opposite of gradient is a good direction
I A better w:w′ = w − s ∗ ∇wJ(w)
I s is the step towards the given direction (or learning rate)
27 / 48
![Page 28: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/28.jpg)
Basic Gradient Descent AlgorithmWith a fixed step size
function gradient descent(w(0),X,y, λ)s← 1w← w(0)
while not conv dod← grad(w,X,y, λ)w← w − s ∗ d
end whilereturn w
end function
28 / 48
![Page 29: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/29.jpg)
When to stop ?
Convergence
I Updates of w needs to have an endI We stop when:
I Loss J(w) is no more decreasingI ∇wJ(w) is vanishingI Number max of iterations is reachedI Introduction of an ε threshold and ite max parameter
Mathematically
Convergence is reached when:
I J(w(k−1))− J(w(k)) < ε
I ‖∇wJ(w)‖ < ε
I k > ite max
29 / 48
![Page 30: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/30.jpg)
Importance of step size
Step s
I The jump towards the direction ∇wJ(w) is quantified bys ∈ R+
I s too small : optimization is slow (small steps)
I s too big: we overshoot the minimization
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.00
20
40
60
80
100
30 / 48
![Page 31: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/31.jpg)
Line Search
What is a good s
I With s = 1, no control on jumps
I We want the biggest s so that:
J(w′) < J(w)
with J(w′) = w − s ∗ d
Line searchSearch a good s towards the direction d
s← 1while J(w − s ∗ d) > J(w) do
s← s ∗ 0.1end while
Logarithmic search is faster than linear
31 / 48
![Page 32: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/32.jpg)
Extensions
I Newton: Make better steps ⇒ Faster convergence
I Stochastic : Estimate gradient from a subset of data ⇒Faster computation of d
I other regularization:I Lasso: L1 norm instead of L2:
J(w) = arg minw
‖Xw − y‖22 + λ‖w‖1
⇒ Sparsity for wI Elastic-net: Somewhere between L1 and L2 norms:
J(w) = arg minw
‖Xw − y‖22 + λ1‖w‖1 + λ2‖w‖2
32 / 48
![Page 33: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/33.jpg)
Introduction
How to learn a “good” model ?
I We want good performance
I Simple as possible
I Able to predict unseen data
33 / 48
![Page 34: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/34.jpg)
Empirical Risk
Error on learning set
I Empirical risk:
Remp(f) =1
N
N∑i=1
L(f(xi), yi)
I L evaluates the performance of prediction f(xi)
I Error is computed on the training set
I The model can be too specialized on this particular dataset
34 / 48
![Page 35: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/35.jpg)
Generalisation
Tentative of DefinitionI Ability of the model to predict well unseen data
I Hard to evaluate
I Real objective of a model
Regularisation
I Regularization term control the model
I Balances between empirical risk and generalization ability
I Need to tune the balance (λ)
35 / 48
![Page 36: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/36.jpg)
How to evaluate to ability to generalize ?
Evaluate on unseen dataI Define and isolate a test set
I Evaluate on the test set
BiasI Avoid to use same data in train and test
I Test set must be totally isolated
36 / 48
![Page 37: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/37.jpg)
Overfitting vs Underfitting
I Overfitting: low Remp, high generalization error
I Underfitting: high Remp, medium generalization errorPr
edic
tion
erro
r
Test set
Learning set
Model Complexitylow High
37 / 48
![Page 38: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/38.jpg)
Hyperparameters
Parameters outside the modelI Some parameters are not learned by the model
I They are “hyperparameters” and must be tuned
I Tuned on data outside the test set
I Example: λ in Ridge Regression
38 / 48
![Page 39: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/39.jpg)
How to tune the hyperparameters ?
Validation setI Split train set into validation and learning set
I Learn model parameters using the learning set
I Evaluate the performance on validation set
I Validation set simulates the test set, aka unseen data
39 / 48
![Page 40: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/40.jpg)
General framework
Dataset
Validation set Test setLearning set
Model Model learned
Model learned & optimal
hyperparametersLearning Cross
ValidationPerformance
Evaluation
Learning performance
Validation performance
Final performance
40 / 48
![Page 41: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/41.jpg)
Validation strategies
How to split validation/training set
I Need of a strategy to split between training and validation sets
I Training is used to tune the parameters of the model
I Validation is used to evaluate the model according tohyperparameters
41 / 48
![Page 42: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/42.jpg)
Train/Validation/Test
Single split
An unique model to learn
May be subject to split bias
Only one evaluation of performance
Available Data
learning set validation set test set
42 / 48
![Page 43: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/43.jpg)
Leave one out
N splits
N models to learn
Validation error is evaluated on 1 data
Available Data
Split 1
Split 2
Split n
...
43 / 48
![Page 44: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/44.jpg)
KFold Cross validation
K splits
K models to learn
I Validation error is evaluated on N/K data
I Some splits may be biased
Available Data
Split 1
Split 2
Split 3
K = 344 / 48
![Page 45: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/45.jpg)
Shuffle Split Cross validation
K splits
I Learn/Valid sets are randomly splited
K models to learn
Avoid bias
Some data may not be evaluated
Available Data
Split 1
Split 2
Split K
...
45 / 48
![Page 46: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/46.jpg)
With scikit-learn
I sklearn.model selection.train test split
I sklearn.model selection.KFold
I sklearn.model selection.ShuffleSplit
I sklearn.model selection.GridSearchCV
46 / 48
![Page 47: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/47.jpg)
Recommandation
Size of splits
I How many splits ?
I How many element by split ?
I Depends on the number of data
I Tradeoff between learning and generalization
Stratified splits
I Splitting may induce to imbalanced datasets
I Take care that the distribution of y is the same for all sets
47 / 48
![Page 48: Introduction to Data Science Supervised Learning](https://reader030.vdocument.in/reader030/viewer/2022021405/620890b38af8f23fd825cba9/html5/thumbnails/48.jpg)
Conclusion
I A good protocol avoid bias
I Test is never used during tuning of (hyper)parameters
I Perfect protocol doesn’t exists
48 / 48