introduction to data science supervised learning

Introduction to Data ScienceSupervised Learning

Benoit Gauzere

INSA Rouen Normandie - Laboratoire LITIS

March 4, 2021

1 / 48

Supervised Learning

Purpose

Given a dataset (xi, yi) ∈ X × Y, i = 1, . . . , N, learn thedependancies between X and Y.

I Example: Learn the links between cardiac risk and foodhabits. xi is one person describe by d features concerning itsfood habits; yi is a binary category (risky, not risky).

I yi are essential for the learning process.

I Methods : K-Nearest Neighbors, SVM, Decision Tree, . . .

2 / 48

How to Encode Data

X ∈ Rn×p

Samples

I n samples (number of lines)

I X(i, :) = x>i : the i-th sample

I xi ∈ Rp

FeaturesI p features (number of colons)

I Each sample is described by p features

I X(:, j) = x•j : The j-th feature for all sample

X(i, j) : j-th feature of i-th sample.

3 / 48

observation 1

observation i

variable j

variable 1

variable

p

observation n

Learning Model

Model

f : X → Yxi → y

We want thatf(xi) ' yi

5 / 48

Example

What is the underlying f function ?

6 / 48

Example

0 2 4 6 8 100

20

40

60

80

0 2 4 6 8 100

20

40

60

80

0 2 4 6 8 100

20

40

60

80

100

0 2 4 6 8 10

0

20

40

60

80

100

7 / 48

How to find a good f ?

f ? = arg minfL(f (X),y) + λΩ(f )

I L : Y × Y → R

I λ ∈ R+

I Ω : (X → Y)→ R+

8 / 48

Fit to data term

L(f (X),y)

I Guarantees that the model fits the data

I Penalizes when the predicted value of f(xi) is far from yi

9 / 48

Regularization term

Ω(f )

I Constrain the complexity of function f

I Occam’s razor : simpler is better

I λ : Weight the balance between the two terms

10 / 48

Illustrations I

0 2 4 6 8 100

20

40

60

80

I Very high L(f(X),y)

I Very low Ω(f)

11 / 48

Illustrations II

0 2 4 6 8 100

20

40

60

80

I High L(f(X),y)

I Low Ω(f)

12 / 48

Illustrations III

0 2 4 6 8 100

20

40

60

80

100

I L(f(X),y) = 0

I High Ω(f)

13 / 48

Illustrations IV

0 2 4 6 8 10

0

20

40

60

80

100

I L(f(X),y) = 0

I Very high Ω(f)

14 / 48

Generalization

A good model generalizes well

I Good generalization : good prediction on unseen data

I Hard to evaluate without bias

I Overfitting

I Regularization term prevents from overfitting

15 / 48

Supervised Learning Tasks I

Binary Classification

I Y = 0, 1I Dog or cat ? Positive or Negative ?

I Performance : Accuracy, recall, precision, . . .

Dog

Cat

16 / 48

Supervised Learning Tasks IIRegression

I Y = R

I Stock market, House price, boiling points of molecules, . . .

I Performance: RMSE, MSE, MAE, . . .

H3C

Cl O

O

O P

S

O

CH3

O CH3

H3C

Cl O

O

O P

S

O

CH3

O CH3120 °C

17 / 48

Supervised Learning Tasks III

And many others

I Ranking

I MultiClass classification

I Multi Labeling

I . . .

18 / 48

Ridge Regression

Ridge regression

I Tikhonov Regression, Least Squares

I Model for regression tasks

I Linear model

I Simple to implement

I Simple to understand

19 / 48

Problem definition

Formal definition

w? = arg minw

J(w) = arg minw

‖Xw − y‖22 + λ‖w‖22

I f(x) = w>x

I w ∈ Rd are the parameters of the model

I Fit to data: norm of differences

I Regularisation term: L2 norm of w

20 / 48

Problem resolution

Convexity

I J(w) is convex

I ⇒ An unique and global minimum

I arg minw J(w)⇔ w | ∇wJ(w) = 0

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.00

20

40

60

80

100

21 / 48

Closed Form I

Rewriting J(w)

J(w) = ‖Xw − y‖22 + λ‖w‖22= (Xw − y)>(Xw − y) + λw>w

= ((Xw)>(Xw)− 2(Xw)>y − y>y) + λw>w

= (w>X>Xw)− 2w>X>y − y>y + λw>w

Differential of J(w)

∇wJ(w) = ∇w(w>X>Xw)− 2w>X>y − y>y + λw>w

= 2X>Xw − 2X>y + 2λw

22 / 48

Closed Form II

∇wJ(w) = 0

∇wJ(w) = 0

⇔ X>Xw −X>y + λw = 0

⇔ (X>X + λI)w = X>y

⇔ w = (X>X + λI)−1X>y

I Need to compute an inverse matrix

I X>X(i, j) = x>i xj

I (X>X + λI) is definite positive, so inversible

23 / 48

Prediction

The modelI To predict on new (unseen) data x′:

y = f(x′) = w>x′

Adding a bias

If we want f(x) = w>x + b, b ∈ R.

I Augment X with a column of ones

I Xbias = [X 1]

24 / 48

Principle

GeneralitiesI Iterative approach to find w

I Optimal on convex objective function, local optimum onothers

I Avoid to compute an inverse of matrix

25 / 48

Gradient Descent

We want:I to find a direction to update w

I that this direction lead to a lower J(w)

Which direction to use ?

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.00

20

40

60

80

100

26 / 48

Gradient descent

GradientI Computed from the derivative of J(w)

I Indicates the direction where J(w) increases locally

DirectionI We want to minimize J(w)

I The opposite of gradient is a good direction

I A better w:w′ = w − s ∗ ∇wJ(w)

I s is the step towards the given direction (or learning rate)

27 / 48

Basic Gradient Descent AlgorithmWith a fixed step size

function gradient descent(w(0),X,y, λ)s← 1w← w(0)

while not conv dod← grad(w,X,y, λ)w← w − s ∗ d

end whilereturn w

end function

28 / 48

When to stop ?

Convergence

I Updates of w needs to have an endI We stop when:

I Loss J(w) is no more decreasingI ∇wJ(w) is vanishingI Number max of iterations is reachedI Introduction of an ε threshold and ite max parameter

Mathematically

Convergence is reached when:

I J(w(k−1))− J(w(k)) < ε

I ‖∇wJ(w)‖ < ε

I k > ite max

29 / 48

Importance of step size

Step s

I The jump towards the direction ∇wJ(w) is quantified bys ∈ R+

I s too small : optimization is slow (small steps)

I s too big: we overshoot the minimization

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.00

20

40

60

80

100

30 / 48

Line Search

What is a good s

I With s = 1, no control on jumps

I We want the biggest s so that:

J(w′) < J(w)

with J(w′) = w − s ∗ d

Line searchSearch a good s towards the direction d

s← 1while J(w − s ∗ d) > J(w) do

s← s ∗ 0.1end while

Logarithmic search is faster than linear

31 / 48

Extensions

I Newton: Make better steps ⇒ Faster convergence

I Stochastic : Estimate gradient from a subset of data ⇒Faster computation of d

I other regularization:I Lasso: L1 norm instead of L2:

J(w) = arg minw

‖Xw − y‖22 + λ‖w‖1

⇒ Sparsity for wI Elastic-net: Somewhere between L1 and L2 norms:

J(w) = arg minw

‖Xw − y‖22 + λ1‖w‖1 + λ2‖w‖2

32 / 48

Introduction

How to learn a “good” model ?

I We want good performance

I Simple as possible

I Able to predict unseen data

33 / 48

Empirical Risk

Error on learning set

I Empirical risk:

Remp(f) =1

N

N∑i=1

L(f(xi), yi)

I L evaluates the performance of prediction f(xi)

I Error is computed on the training set

I The model can be too specialized on this particular dataset

34 / 48

Generalisation

Tentative of DefinitionI Ability of the model to predict well unseen data

I Hard to evaluate

I Real objective of a model

Regularisation

I Regularization term control the model

I Balances between empirical risk and generalization ability

I Need to tune the balance (λ)

35 / 48

How to evaluate to ability to generalize ?

Evaluate on unseen dataI Define and isolate a test set

I Evaluate on the test set

BiasI Avoid to use same data in train and test

I Test set must be totally isolated

36 / 48

Overfitting vs Underfitting

I Overfitting: low Remp, high generalization error

I Underfitting: high Remp, medium generalization errorPr

edic

tion

erro

r

Test set

Learning set

Model Complexitylow High

37 / 48

Hyperparameters

Parameters outside the modelI Some parameters are not learned by the model

I They are “hyperparameters” and must be tuned

I Tuned on data outside the test set

I Example: λ in Ridge Regression

38 / 48

How to tune the hyperparameters ?

Validation setI Split train set into validation and learning set

I Learn model parameters using the learning set

I Evaluate the performance on validation set

I Validation set simulates the test set, aka unseen data

39 / 48

General framework

Dataset

Validation set Test setLearning set

Model Model learned

Model learned & optimal

hyperparametersLearning Cross

ValidationPerformance

Evaluation

Learning performance

Validation performance

Final performance

40 / 48

Validation strategies

How to split validation/training set

I Need of a strategy to split between training and validation sets

I Training is used to tune the parameters of the model

I Validation is used to evaluate the model according tohyperparameters

41 / 48

Train/Validation/Test

Single split

An unique model to learn

May be subject to split bias

Only one evaluation of performance

Available Data

learning set validation set test set

42 / 48

Leave one out

N splits

N models to learn

Validation error is evaluated on 1 data

Available Data

Split 1

Split 2

Split n

...

43 / 48

KFold Cross validation

K splits

K models to learn

I Validation error is evaluated on N/K data

I Some splits may be biased

Available Data

Split 1

Split 2

Split 3

K = 344 / 48

Shuffle Split Cross validation

K splits

I Learn/Valid sets are randomly splited

K models to learn

Avoid bias

Some data may not be evaluated

Available Data

Split 1

Split 2

Split K

...

45 / 48

With scikit-learn

I sklearn.model selection.train test split

I sklearn.model selection.KFold

I sklearn.model selection.ShuffleSplit

I sklearn.model selection.GridSearchCV

46 / 48

Recommandation

Size of splits

I How many splits ?

I How many element by split ?

I Depends on the number of data

I Tradeoff between learning and generalization

Stratified splits

I Splitting may induce to imbalanced datasets

I Take care that the distribution of y is the same for all sets

47 / 48

Conclusion

I A good protocol avoid bias

I Test is never used during tuning of (hyper)parameters

I Perfect protocol doesn’t exists

48 / 48

introduction to data science supervised learning

Documents