a short introduction to statistical learning

A short introduction to statistical learningNathalie Villa-Vialaneix

[email protected]://www.nathalievilla.org

Axe “Apprentissage et Processus”October 15th, 2014 - Unité MIA-T, INRA, Toulouse

Nathalie Villa-Vialaneix | Introduction to statistical learning 1/25

Outline

1 IntroductionBackground and notationsUnderfitting / OverfittingConsistency

Outline

Background

Purpose: predict Y from X ;

What we have: n observations of (X ,Y), (x1, y1), . . . , (xn, yn);

What we want: estimate unknown Y from new X : xn+1, . . . , xm.

X can be:

numeric variables;

or factors;

or a combination of numeric variables and factors.

Y can be:

a numeric variable (Y ∈ R)⇒ (supervised) regression régression;

a factor⇒ (supervised) classification discrimination.

Background

X can be:

numeric variables;

or factors;

Y can be:

Background

X can be:

numeric variables;

or factors;

Y can be:

Background

X can be:

numeric variables;

or factors;

Y can be:

Background

X can be:

numeric variables;

or factors;

Y can be:

BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function fonction declassification;

if Y is a factor, Φn is called a classifier classifieur;

Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!!

ynew = Φn(xnew).

Φn is said to be trained or learned from the observations (xi , yi)i .

Desirable properties

ynew = Φn(xnew).

Underfitting/Overfitting sous/sur - apprentissageFunction x → y to be estimated

Underfitting/Overfitting sous/sur - apprentissageObservations we might have

Underfitting/Overfitting sous/sur - apprentissageObservations we do have

Underfitting/Overfitting sous/sur - apprentissageFirst estimation from the observations: underfitting

Underfitting/Overfitting sous/sur - apprentissageSecond estimation from the observations: accurate estimation

Underfitting/Overfitting sous/sur - apprentissageThird estimation from the observations: overfitting

Underfitting/Overfitting sous/sur - apprentissageSummary

Errors

training error (measures the accuracy to the observations)

I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Errors

training error (measures the accuracy to the observations)I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

n∑i=1

(yi − yi)2

Errors

]{yi , yi , i = 1, . . . , n}n

n∑i=1

(yi − yi)2

Errors

]{yi , yi , i = 1, . . . , n}n

n∑i=1

(yi − yi)2

Errors

]{yi , yi , i = 1, . . . , n}n

n∑i=1

(yi − yi)2

Errors

]{yi , yi , i = 1, . . . , n}n

n∑i=1

(yi − yi)2

ExampleObservations

ExampleTraining/Test datasets

ExampleTraining/Test errors

Example

Summary

Consistency in the parametric/non parametric case

Example in the parametric framework (linear methods)an assumption is made on the form of the relation between X and Y :

Y = βT X + ε

β is estimated from the observations (x1, y1), . . . , (xn, yn) by a givenmethod which calculates a βn.

The estimation is said to be consistent if βn n→+∞−−−−−−→ β under (eventually)

technical assumptions on X , ε, Y .

Consistency in the parametric/non parametric case

Example in the nonparametric frameworkthe form of the relation between X and Y is unknown:

Y = Φ(X) + ε

Φ is estimated from the observations (x1, y1), . . . , (xn, yn) by a givenmethod which calculates a Φn.

The estimation is said to be consistent if Φn n→+∞−−−−−−→ Φ under (eventually)

technical assumptions on X , ε, Y .

Consistency from the statistical learning perspective[Vapnik, 1995]

Question: Are we really interested in estimating Φ or...

... rather in having the smallest prediction error?

Statistical learning perspective: a method that builds a machine Φn fromthe observations is said to be (universally) consistent if, given a riskfunction R : R × R→ R+ (which calculates an error),

E (R(Φn(X),Y))n→+∞−−−−−−→ inf

Φ:X→RE (R(Φ(X),Y)) ,

for any distribution of (X ,Y) ∈ X × R.Definitions: L∗ = infΦ:X→R E (R(Φ(X),Y)) and LΦ = E (R(Φ(X),Y)).

Consistency from the statistical learning perspective[Vapnik, 1995]

Question: Are we really interested in estimating Φ or...... rather in having the smallest prediction error?

Statistical learning perspective: a method that builds a machine Φn fromthe observations is said to be (universally) consistent if, given a riskfunction R : R × R→ R+ (which calculates an error),

E (R(Φn(X),Y))n→+∞−−−−−−→ inf

Φ:X→RE (R(Φ(X),Y)) ,

for any distribution of (X ,Y) ∈ X × R.Definitions: L∗ = infΦ:X→R E (R(Φ(X),Y)) and LΦ = E (R(Φ(X),Y)).

Desirable properties from a mathematical perspectiveSimplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)Learning process: choose a machine Φn in a class of functionsC ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using aSVM).

Error decomposition

LΦn − L∗ ≤(LΦn − inf

Φ∈CLΦ

Φ∈CLΦ − L∗

infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure thatthis term is small);

LΦn − infΦ∈C LΦ ≤ 2 supΦ∈C |LnΦ − LΦ|, LnΦ = 1

i=1 R(Φ(xi), yi) isthe generalization capability of C (i.e., in the worst case, the empiricalerror must be close to the true error: C must not be too rich to ensurethat this term is small).

Desirable properties from a mathematical perspectiveSimplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)Learning process: choose a machine Φn in a class of functionsC ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using aSVM).

Error decomposition

LΦn − L∗ ≤(LΦn − inf

Φ∈CLΦ

Φ∈CLΦ − L∗

infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure thatthis term is small);

LΦn − infΦ∈C LΦ ≤ 2 supΦ∈C |LnΦ − LΦ|, LnΦ = 1

i=1 R(Φ(xi), yi) isthe generalization capability of C (i.e., in the worst case, the empiricalerror must be close to the true error: C must not be too rich to ensurethat this term is small).

Outline

Basic introduction

Binary classification problem: X ∈ H et Y ∈ {−1; 1}A training set is given: (x1, y1), . . . , (xn, yn)

SVM is a method based on kernels. It is universally consistent method,given that the kernel is universal [Steinwart, 2002].

Extensions to the regression case exist (SVR or LS-SVM) that are alsouniversally consistent when the kernel is universal.

Basic introduction

Binary classification problem: X ∈ H et Y ∈ {−1; 1}A training set is given: (x1, y1), . . . , (xn, yn)

SVM is a method based on kernels. It is universally consistent method,given that the kernel is universal [Steinwart, 2002].

Extensions to the regression case exist (SVR or LS-SVM) that are alsouniversally consistent when the kernel is universal.

Optimal margin classification

w is chosen such that:

minw ‖w‖2 (the margin is the largest),

under the constraints: yi(〈w, xi〉+ b) ≥ 1, 1 ≤ i ≤ n (the separationbetween the two classes is perfect).

⇒ ensures a good generalization capability.

Optimal margin classification

Optimal margin classificationw

margin: 1‖w‖2

Support Vector

Optimal margin classificationw

margin: 1‖w‖2

Support Vector

Soft margin classification

w is chosen such that:minw,ξ ‖w‖2 + C

∑ni=1 ξi (the margin is the largest),

under the constraints: yi(〈w, xi〉+ b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

(the separation between the two classes is almost perfect).⇒ allowing a few errors improves the richness of the class.

Soft margin classification

Soft margin classificationw

margin: 1‖w‖2

Support Vector

Soft margin classificationw

margin: 1‖w‖2

Support Vector

Non linear SVMOriginal space X

w ∈ H is chosen such that (PC ,H):

minw,ξ ‖w‖2H + C∑n

i=1 ξi (the margin in the feature space is thelargest),

under the constraints: yi(〈w,Ψ(xi)〉H + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

(the separation between the two classes in the feature space isalmost perfect).

Non linear SVMOriginal space X Feature space H

Ψ (non linear)

SVM from different points of viewA regularization problem: (PC ,H)⇔

(P2λ,H) : min

n∑i=1

R(fw(xi), yi)︸︷︷︸error term

+λ ‖w‖2H︸︷︷︸

penalization term

where fw(x) = 〈Ψ(x),w〉H and R(y, y) = max(0, 1 − yy) (hinge lossfunction)

errors versus y for y = 1:

I blue: hinge loss;

I green: misclassification error.

A dual problem: (PC ,H)⇔

(DC ,X) : maxα∈Rn∑n

i=1 αi −∑n

i=1∑n

j=1 αiαjyiyj ,

with∑N

i=1 αiyi = 0,0 ≤ αi ≤ C , 1 ≤ i ≤ n.

There is no need to know Ψ and H :I choose a function K with a few good properties;I use it as the dot product in H :∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H .

(P2λ,H) : min

n∑i=1

+λ ‖w‖2H︸︷︷︸

penalization term

where fw(x) = 〈Ψ(x),w〉H and R(y, y) = max(0, 1 − yy) (hinge lossfunction)A dual problem: (PC ,H)⇔

i=1 αi −∑n

i=1∑n

j=1 αiαjyiyj〈Ψ(xi),Ψ(xj)〉H ,

with∑N

i=1 αiyi = 0,0 ≤ αi ≤ C , 1 ≤ i ≤ n.

(P2λ,H) : min

n∑i=1

+λ ‖w‖2H︸︷︷︸

penalization term

where fw(x) = 〈Ψ(x),w〉H and R(y, y) = max(0, 1 − yy) (hinge lossfunction)A dual problem: (PC ,H)⇔

i=1 αi −∑n

i=1∑n

j=1 αiαjyiyjK(xi , xj),

with∑N

i=1 αiyi = 0,0 ≤ αi ≤ C , 1 ≤ i ≤ n.

Which kernels?

Minimum properties that a kernel should fulfilledsymmetry: K(u, u′) = K(u′, u)

positivity: ∀N ∈ N, ∀ (αi) ⊂ RN, ∀ (xi) ⊂ X

i,j αiαjK(xi , xj) ≥ 0.

[Aronszajn, 1950]: ∃ a Hilbert space (H , 〈., .〉H) and a function Ψ : X → Hsuch that:

∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H

Examples

the Gaussian kernel: ∀ x, x′ ∈ Rd , K(x, x′) = e−γ‖x−x′‖2 (it is universalfor all bounded subset of Rd);

the linear kernel: ∀ x, x′ ∈ Rd , K(x, x′) = xT (x′) (it is not universal).

Which kernels?

Minimum properties that a kernel should fulfilledsymmetry: K(u, u′) = K(u′, u)

positivity: ∀N ∈ N, ∀ (αi) ⊂ RN, ∀ (xi) ⊂ X

i,j αiαjK(xi , xj) ≥ 0.

[Aronszajn, 1950]: ∃ a Hilbert space (H , 〈., .〉H) and a function Ψ : X → Hsuch that:

∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H

Examples

the Gaussian kernel: ∀ x, x′ ∈ Rd , K(x, x′) = e−γ‖x−x′‖2 (it is universalfor all bounded subset of Rd);

the linear kernel: ∀ x, x′ ∈ Rd , K(x, x′) = xT (x′) (it is not universal).

In summary, how does the solution write????

Φn(x) =∑

αiyiK(xi , x)

where only a few αi , 0. i such that αi , 0 are the support vectors!

I’m almost dead with all these stuffs on my mind!!!What in practice?

data(iris)iris <- iris[iris$Species%in%c("versicolor","virginica"),]plot(iris$Petal.Length, iris$Petal.Width, col=iris$Species,

pch=19)legend("topleft", pch=19, col=c(2,3),

legend=c("versicolor", "virginica"))

library(e1071)res.tune <- tune.svm(Species ~ ., data=iris, kernel="linear",

cost = 2^(-1:4))# Parameter tuning of ’svm’:# - sampling method: 10fold cross validation# - best parameters:# cost# 0.5# - best performance: 0.05res.tune$best.model# Call:# best.svm(x = Species ~ ., data = iris, cost = 2^(-1:4),# kernel = "linear")# Parameters:# SVM-Type: C-classification# SVM-Kernel: linear# cost: 0.5# gamma: 0.25# Number of Support Vectors: 21

table(res.tune$best.model$fitted, iris$Species)% setosa versicolor virginica% setosa 0 0 0% versicolor 0 45 0% virginica 0 5 50

plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))

res.tune <- tune.svm(Species ~ ., data=iris, gamma = 2^(-1:1),cost = 2^(2:4))

# Parameter tuning of ’svm’:# - sampling method: 10fold cross validation# - best parameters:# gamma cost# 0.5 4# - best performance: 0.08res.tune$best.model# Call:# best.svm(x = Species ~ ., data = iris, gamma = 2^(-1:1),# cost = 2^(2:4))# Parameters:# SVM-Type: C-classification# SVM-Kernel: radial# cost: 4# gamma: 0.5# Number of Support Vectors: 32

table(res.tune$best.model$fitted, iris$Species)% setosa versicolor virginica% setosa 0 0 0% versicolor 0 49 0% virginica 0 1 50

plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))

References

Aronszajn, N. (1950).Theory of reproducing kernels.Transactions of the American Mathematical Society, 68(3):337–404.

Steinwart, I. (2002).Support vector machines are universally consistent.Journal of Complexity, 18:768–791.

Vapnik, V. (1995).The Nature of Statistical Learning Theory.Springer Verlag, New York, USA.

and more can be found on my website:http://nathalievilla.org/learning.html

a short introduction to statistical learning

Science

n observations of x

unknown y

numeric variable y

short introduction

statistical learningnathalie

observations xi yii

new data

xn ynwhat