a short introduction to statistical learning

64
A short introduction to statistical learning Nathalie Villa-Vialaneix [email protected] http://www.nathalievilla.org Axe “Apprentissage et Processus” October 15th, 2014 - Unité MIA-T, INRA, Toulouse Nathalie Villa-Vialaneix | Introduction to statistical learning 1/25

Upload: tuxette

Post on 24-May-2015

185 views

Category:

Science


2 download

DESCRIPTION

Groupe de travail de l'Axe Apprentissage statistique et Processus INRA, unité MIA-T October 16th, 2014

TRANSCRIPT

Page 1: A short introduction to statistical learning

A short introduction to statistical learningNathalie Villa-Vialaneix

[email protected]://www.nathalievilla.org

Axe “Apprentissage et Processus”October 15th, 2014 - Unité MIA-T, INRA, Toulouse

Nathalie Villa-Vialaneix | Introduction to statistical learning 1/25

Page 2: A short introduction to statistical learning

Outline

1 IntroductionBackground and notationsUnderfitting / OverfittingConsistency

2 SVM

Nathalie Villa-Vialaneix | Introduction to statistical learning 2/25

Page 3: A short introduction to statistical learning

Outline

1 IntroductionBackground and notationsUnderfitting / OverfittingConsistency

2 SVM

Nathalie Villa-Vialaneix | Introduction to statistical learning 3/25

Page 4: A short introduction to statistical learning

Background

Purpose: predict Y from X ;

What we have: n observations of (X ,Y), (x1, y1), . . . , (xn, yn);

What we want: estimate unknown Y from new X : xn+1, . . . , xm.

X can be:

numeric variables;

or factors;

or a combination of numeric variables and factors.

Y can be:

a numeric variable (Y ∈ R)⇒ (supervised) regression régression;

a factor⇒ (supervised) classification discrimination.

Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25

Page 5: A short introduction to statistical learning

Background

Purpose: predict Y from X ;

What we have: n observations of (X ,Y), (x1, y1), . . . , (xn, yn);

What we want: estimate unknown Y from new X : xn+1, . . . , xm.

X can be:

numeric variables;

or factors;

or a combination of numeric variables and factors.

Y can be:

a numeric variable (Y ∈ R)⇒ (supervised) regression régression;

a factor⇒ (supervised) classification discrimination.

Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25

Page 6: A short introduction to statistical learning

Background

Purpose: predict Y from X ;

What we have: n observations of (X ,Y), (x1, y1), . . . , (xn, yn);

What we want: estimate unknown Y from new X : xn+1, . . . , xm.

X can be:

numeric variables;

or factors;

or a combination of numeric variables and factors.

Y can be:

a numeric variable (Y ∈ R)⇒ (supervised) regression régression;

a factor⇒ (supervised) classification discrimination.

Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25

Page 7: A short introduction to statistical learning

Background

Purpose: predict Y from X ;

What we have: n observations of (X ,Y), (x1, y1), . . . , (xn, yn);

What we want: estimate unknown Y from new X : xn+1, . . . , xm.

X can be:

numeric variables;

or factors;

or a combination of numeric variables and factors.

Y can be:

a numeric variable (Y ∈ R)⇒ (supervised) regression régression;

a factor⇒ (supervised) classification discrimination.

Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25

Page 8: A short introduction to statistical learning

Background

Purpose: predict Y from X ;

What we have: n observations of (X ,Y), (x1, y1), . . . , (xn, yn);

What we want: estimate unknown Y from new X : xn+1, . . . , xm.

X can be:

numeric variables;

or factors;

or a combination of numeric variables and factors.

Y can be:

a numeric variable (Y ∈ R)⇒ (supervised) regression régression;

a factor⇒ (supervised) classification discrimination.

Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25

Page 9: A short introduction to statistical learning

BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function fonction declassification;

if Y is a factor, Φn is called a classifier classifieur;

Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!!

Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25

Page 10: A short introduction to statistical learning

BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function fonction declassification;

if Y is a factor, Φn is called a classifier classifieur;

Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!!

Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25

Page 11: A short introduction to statistical learning

BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function fonction declassification;

if Y is a factor, Φn is called a classifier classifieur;

Φn is said to be trained or learned from the observations (xi , yi)i .

Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!!

Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25

Page 12: A short introduction to statistical learning

BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function fonction declassification;

if Y is a factor, Φn is called a classifier classifieur;

Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!!

Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25

Page 13: A short introduction to statistical learning

BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function fonction declassification;

if Y is a factor, Φn is called a classifier classifieur;

Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!!

Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25

Page 14: A short introduction to statistical learning

BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function fonction declassification;

if Y is a factor, Φn is called a classifier classifieur;

Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!!

Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25

Page 15: A short introduction to statistical learning

Underfitting/Overfitting sous/sur - apprentissageFunction x → y to be estimated

Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25

Page 16: A short introduction to statistical learning

Underfitting/Overfitting sous/sur - apprentissageObservations we might have

Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25

Page 17: A short introduction to statistical learning

Underfitting/Overfitting sous/sur - apprentissageObservations we do have

Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25

Page 18: A short introduction to statistical learning

Underfitting/Overfitting sous/sur - apprentissageFirst estimation from the observations: underfitting

Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25

Page 19: A short introduction to statistical learning

Underfitting/Overfitting sous/sur - apprentissageSecond estimation from the observations: accurate estimation

Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25

Page 20: A short introduction to statistical learning

Underfitting/Overfitting sous/sur - apprentissageThird estimation from the observations: overfitting

Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25

Page 21: A short introduction to statistical learning

Underfitting/Overfitting sous/sur - apprentissageSummary

Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25

Page 22: A short introduction to statistical learning

Errors

training error (measures the accuracy to the observations)

I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25

Page 23: A short introduction to statistical learning

Errors

training error (measures the accuracy to the observations)I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25

Page 24: A short introduction to statistical learning

Errors

training error (measures the accuracy to the observations)I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25

Page 25: A short introduction to statistical learning

Errors

training error (measures the accuracy to the observations)I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25

Page 26: A short introduction to statistical learning

Errors

training error (measures the accuracy to the observations)I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25

Page 27: A short introduction to statistical learning

Errors

training error (measures the accuracy to the observations)I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25

Page 28: A short introduction to statistical learning

ExampleObservations

Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25

Page 29: A short introduction to statistical learning

ExampleTraining/Test datasets

Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25

Page 30: A short introduction to statistical learning

ExampleTraining/Test errors

Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25

Page 31: A short introduction to statistical learning

Example

Summary

Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25

Page 32: A short introduction to statistical learning

Consistency in the parametric/non parametric case

Example in the parametric framework (linear methods)an assumption is made on the form of the relation between X and Y :

Y = βT X + ε

β is estimated from the observations (x1, y1), . . . , (xn, yn) by a givenmethod which calculates a βn.

The estimation is said to be consistent if βn n→+∞−−−−−−→ β under (eventually)

technical assumptions on X , ε, Y .

Nathalie Villa-Vialaneix | Introduction to statistical learning 9/25

Page 33: A short introduction to statistical learning

Consistency in the parametric/non parametric case

Example in the nonparametric frameworkthe form of the relation between X and Y is unknown:

Y = Φ(X) + ε

Φ is estimated from the observations (x1, y1), . . . , (xn, yn) by a givenmethod which calculates a Φn.

The estimation is said to be consistent if Φn n→+∞−−−−−−→ Φ under (eventually)

technical assumptions on X , ε, Y .

Nathalie Villa-Vialaneix | Introduction to statistical learning 9/25

Page 34: A short introduction to statistical learning

Consistency from the statistical learning perspective[Vapnik, 1995]

Question: Are we really interested in estimating Φ or...

... rather in having the smallest prediction error?

Statistical learning perspective: a method that builds a machine Φn fromthe observations is said to be (universally) consistent if, given a riskfunction R : R × R→ R+ (which calculates an error),

E (R(Φn(X),Y))n→+∞−−−−−−→ inf

Φ:X→RE (R(Φ(X),Y)) ,

for any distribution of (X ,Y) ∈ X × R.Definitions: L∗ = infΦ:X→R E (R(Φ(X),Y)) and LΦ = E (R(Φ(X),Y)).

Nathalie Villa-Vialaneix | Introduction to statistical learning 10/25

Page 35: A short introduction to statistical learning

Consistency from the statistical learning perspective[Vapnik, 1995]

Question: Are we really interested in estimating Φ or...... rather in having the smallest prediction error?

Statistical learning perspective: a method that builds a machine Φn fromthe observations is said to be (universally) consistent if, given a riskfunction R : R × R→ R+ (which calculates an error),

E (R(Φn(X),Y))n→+∞−−−−−−→ inf

Φ:X→RE (R(Φ(X),Y)) ,

for any distribution of (X ,Y) ∈ X × R.Definitions: L∗ = infΦ:X→R E (R(Φ(X),Y)) and LΦ = E (R(Φ(X),Y)).

Nathalie Villa-Vialaneix | Introduction to statistical learning 10/25

Page 36: A short introduction to statistical learning

Desirable properties from a mathematical perspectiveSimplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)Learning process: choose a machine Φn in a class of functionsC ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using aSVM).

Error decomposition

LΦn − L∗ ≤(LΦn − inf

Φ∈CLΦ

)+

(inf

Φ∈CLΦ − L∗

)with

infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure thatthis term is small);

LΦn − infΦ∈C LΦ ≤ 2 supΦ∈C |LnΦ − LΦ|, LnΦ = 1

n∑n

i=1 R(Φ(xi), yi) isthe generalization capability of C (i.e., in the worst case, the empiricalerror must be close to the true error: C must not be too rich to ensurethat this term is small).

Nathalie Villa-Vialaneix | Introduction to statistical learning 11/25

Page 37: A short introduction to statistical learning

Desirable properties from a mathematical perspectiveSimplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)Learning process: choose a machine Φn in a class of functionsC ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using aSVM).

Error decomposition

LΦn − L∗ ≤(LΦn − inf

Φ∈CLΦ

)+

(inf

Φ∈CLΦ − L∗

)with

infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure thatthis term is small);

LΦn − infΦ∈C LΦ ≤ 2 supΦ∈C |LnΦ − LΦ|, LnΦ = 1

n∑n

i=1 R(Φ(xi), yi) isthe generalization capability of C (i.e., in the worst case, the empiricalerror must be close to the true error: C must not be too rich to ensurethat this term is small).

Nathalie Villa-Vialaneix | Introduction to statistical learning 11/25

Page 38: A short introduction to statistical learning

Outline

1 IntroductionBackground and notationsUnderfitting / OverfittingConsistency

2 SVM

Nathalie Villa-Vialaneix | Introduction to statistical learning 12/25

Page 39: A short introduction to statistical learning

Basic introduction

Binary classification problem: X ∈ H et Y ∈ {−1; 1}A training set is given: (x1, y1), . . . , (xn, yn)

SVM is a method based on kernels. It is universally consistent method,given that the kernel is universal [Steinwart, 2002].

Extensions to the regression case exist (SVR or LS-SVM) that are alsouniversally consistent when the kernel is universal.

Nathalie Villa-Vialaneix | Introduction to statistical learning 13/25

Page 40: A short introduction to statistical learning

Basic introduction

Binary classification problem: X ∈ H et Y ∈ {−1; 1}A training set is given: (x1, y1), . . . , (xn, yn)

SVM is a method based on kernels. It is universally consistent method,given that the kernel is universal [Steinwart, 2002].

Extensions to the regression case exist (SVR or LS-SVM) that are alsouniversally consistent when the kernel is universal.

Nathalie Villa-Vialaneix | Introduction to statistical learning 13/25

Page 41: A short introduction to statistical learning

Optimal margin classification

w is chosen such that:

minw ‖w‖2 (the margin is the largest),

under the constraints: yi(〈w, xi〉+ b) ≥ 1, 1 ≤ i ≤ n (the separationbetween the two classes is perfect).

⇒ ensures a good generalization capability.

Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25

Page 42: A short introduction to statistical learning

Optimal margin classification

w is chosen such that:

minw ‖w‖2 (the margin is the largest),

under the constraints: yi(〈w, xi〉+ b) ≥ 1, 1 ≤ i ≤ n (the separationbetween the two classes is perfect).

⇒ ensures a good generalization capability.

Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25

Page 43: A short introduction to statistical learning

Optimal margin classificationw

margin: 1‖w‖2

Support Vector

w is chosen such that:

minw ‖w‖2 (the margin is the largest),

under the constraints: yi(〈w, xi〉+ b) ≥ 1, 1 ≤ i ≤ n (the separationbetween the two classes is perfect).

⇒ ensures a good generalization capability.

Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25

Page 44: A short introduction to statistical learning

Optimal margin classificationw

margin: 1‖w‖2

Support Vector

w is chosen such that:

minw ‖w‖2 (the margin is the largest),

under the constraints: yi(〈w, xi〉+ b) ≥ 1, 1 ≤ i ≤ n (the separationbetween the two classes is perfect).

⇒ ensures a good generalization capability.

Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25

Page 45: A short introduction to statistical learning

Soft margin classification

w is chosen such that:minw,ξ ‖w‖2 + C

∑ni=1 ξi (the margin is the largest),

under the constraints: yi(〈w, xi〉+ b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

(the separation between the two classes is almost perfect).⇒ allowing a few errors improves the richness of the class.

Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25

Page 46: A short introduction to statistical learning

Soft margin classification

w is chosen such that:minw,ξ ‖w‖2 + C

∑ni=1 ξi (the margin is the largest),

under the constraints: yi(〈w, xi〉+ b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

(the separation between the two classes is almost perfect).⇒ allowing a few errors improves the richness of the class.

Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25

Page 47: A short introduction to statistical learning

Soft margin classificationw

margin: 1‖w‖2

Support Vector

w is chosen such that:minw,ξ ‖w‖2 + C

∑ni=1 ξi (the margin is the largest),

under the constraints: yi(〈w, xi〉+ b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

(the separation between the two classes is almost perfect).⇒ allowing a few errors improves the richness of the class.

Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25

Page 48: A short introduction to statistical learning

Soft margin classificationw

margin: 1‖w‖2

Support Vector

w is chosen such that:minw,ξ ‖w‖2 + C

∑ni=1 ξi (the margin is the largest),

under the constraints: yi(〈w, xi〉+ b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

(the separation between the two classes is almost perfect).⇒ allowing a few errors improves the richness of the class.

Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25

Page 49: A short introduction to statistical learning

Non linear SVMOriginal space X

w ∈ H is chosen such that (PC ,H):

minw,ξ ‖w‖2H + C∑n

i=1 ξi (the margin in the feature space is thelargest),

under the constraints: yi(〈w,Ψ(xi)〉H + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

(the separation between the two classes in the feature space isalmost perfect).

Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25

Page 50: A short introduction to statistical learning

Non linear SVMOriginal space X Feature space H

Ψ (non linear)

w ∈ H is chosen such that (PC ,H):

minw,ξ ‖w‖2H + C∑n

i=1 ξi (the margin in the feature space is thelargest),

under the constraints: yi(〈w,Ψ(xi)〉H + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

(the separation between the two classes in the feature space isalmost perfect).

Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25

Page 51: A short introduction to statistical learning

Non linear SVMOriginal space X Feature space H

Ψ (non linear)

w ∈ H is chosen such that (PC ,H):

minw,ξ ‖w‖2H + C∑n

i=1 ξi (the margin in the feature space is thelargest),

under the constraints: yi(〈w,Ψ(xi)〉H + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

(the separation between the two classes in the feature space isalmost perfect).

Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25

Page 52: A short introduction to statistical learning

Non linear SVMOriginal space X Feature space H

Ψ (non linear)

w ∈ H is chosen such that (PC ,H):

minw,ξ ‖w‖2H + C∑n

i=1 ξi (the margin in the feature space is thelargest),

under the constraints: yi(〈w,Ψ(xi)〉H + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

(the separation between the two classes in the feature space isalmost perfect).

Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25

Page 53: A short introduction to statistical learning

SVM from different points of viewA regularization problem: (PC ,H)⇔

(P2λ,H) : min

w∈H

1n

n∑i=1

R(fw(xi), yi)︸ ︷︷ ︸error term

+λ ‖w‖2H︸︷︷︸

penalization term

,

where fw(x) = 〈Ψ(x),w〉H and R(y, y) = max(0, 1 − yy) (hinge lossfunction)

errors versus y for y = 1:

I blue: hinge loss;

I green: misclassification error.

A dual problem: (PC ,H)⇔

(DC ,X) : maxα∈Rn∑n

i=1 αi −∑n

i=1∑n

j=1 αiαjyiyj ,

with∑N

i=1 αiyi = 0,0 ≤ αi ≤ C , 1 ≤ i ≤ n.

There is no need to know Ψ and H :I choose a function K with a few good properties;I use it as the dot product in H :∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H .

Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25

Page 54: A short introduction to statistical learning

SVM from different points of viewA regularization problem: (PC ,H)⇔

(P2λ,H) : min

w∈H

1n

n∑i=1

R(fw(xi), yi)︸ ︷︷ ︸error term

+λ ‖w‖2H︸︷︷︸

penalization term

,

where fw(x) = 〈Ψ(x),w〉H and R(y, y) = max(0, 1 − yy) (hinge lossfunction)A dual problem: (PC ,H)⇔

(DC ,X) : maxα∈Rn∑n

i=1 αi −∑n

i=1∑n

j=1 αiαjyiyj〈Ψ(xi),Ψ(xj)〉H ,

with∑N

i=1 αiyi = 0,0 ≤ αi ≤ C , 1 ≤ i ≤ n.

There is no need to know Ψ and H :I choose a function K with a few good properties;I use it as the dot product in H :∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H .

Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25

Page 55: A short introduction to statistical learning

SVM from different points of viewA regularization problem: (PC ,H)⇔

(P2λ,H) : min

w∈H

1n

n∑i=1

R(fw(xi), yi)︸ ︷︷ ︸error term

+λ ‖w‖2H︸︷︷︸

penalization term

,

where fw(x) = 〈Ψ(x),w〉H and R(y, y) = max(0, 1 − yy) (hinge lossfunction)A dual problem: (PC ,H)⇔

(DC ,X) : maxα∈Rn∑n

i=1 αi −∑n

i=1∑n

j=1 αiαjyiyjK(xi , xj),

with∑N

i=1 αiyi = 0,0 ≤ αi ≤ C , 1 ≤ i ≤ n.

There is no need to know Ψ and H :I choose a function K with a few good properties;I use it as the dot product in H :∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H .

Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25

Page 56: A short introduction to statistical learning

Which kernels?

Minimum properties that a kernel should fulfilledsymmetry: K(u, u′) = K(u′, u)

positivity: ∀N ∈ N, ∀ (αi) ⊂ RN, ∀ (xi) ⊂ X

N,∑

i,j αiαjK(xi , xj) ≥ 0.

[Aronszajn, 1950]: ∃ a Hilbert space (H , 〈., .〉H) and a function Ψ : X → Hsuch that:

∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H

Examples

the Gaussian kernel: ∀ x, x′ ∈ Rd , K(x, x′) = e−γ‖x−x′‖2 (it is universalfor all bounded subset of Rd);

the linear kernel: ∀ x, x′ ∈ Rd , K(x, x′) = xT (x′) (it is not universal).

Nathalie Villa-Vialaneix | Introduction to statistical learning 18/25

Page 57: A short introduction to statistical learning

Which kernels?

Minimum properties that a kernel should fulfilledsymmetry: K(u, u′) = K(u′, u)

positivity: ∀N ∈ N, ∀ (αi) ⊂ RN, ∀ (xi) ⊂ X

N,∑

i,j αiαjK(xi , xj) ≥ 0.

[Aronszajn, 1950]: ∃ a Hilbert space (H , 〈., .〉H) and a function Ψ : X → Hsuch that:

∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H

Examples

the Gaussian kernel: ∀ x, x′ ∈ Rd , K(x, x′) = e−γ‖x−x′‖2 (it is universalfor all bounded subset of Rd);

the linear kernel: ∀ x, x′ ∈ Rd , K(x, x′) = xT (x′) (it is not universal).

Nathalie Villa-Vialaneix | Introduction to statistical learning 18/25

Page 58: A short introduction to statistical learning

In summary, how does the solution write????

Φn(x) =∑

i

αiyiK(xi , x)

where only a few αi , 0. i such that αi , 0 are the support vectors!

Nathalie Villa-Vialaneix | Introduction to statistical learning 19/25

Page 59: A short introduction to statistical learning

I’m almost dead with all these stuffs on my mind!!!What in practice?

data(iris)iris <- iris[iris$Species%in%c("versicolor","virginica"),]plot(iris$Petal.Length, iris$Petal.Width, col=iris$Species,

pch=19)legend("topleft", pch=19, col=c(2,3),

legend=c("versicolor", "virginica"))

Nathalie Villa-Vialaneix | Introduction to statistical learning 20/25

Page 60: A short introduction to statistical learning

I’m almost dead with all these stuffs on my mind!!!What in practice?

library(e1071)res.tune <- tune.svm(Species ~ ., data=iris, kernel="linear",

cost = 2^(-1:4))# Parameter tuning of ’svm’:# - sampling method: 10fold cross validation# - best parameters:# cost# 0.5# - best performance: 0.05res.tune$best.model# Call:# best.svm(x = Species ~ ., data = iris, cost = 2^(-1:4),# kernel = "linear")# Parameters:# SVM-Type: C-classification# SVM-Kernel: linear# cost: 0.5# gamma: 0.25# Number of Support Vectors: 21

Nathalie Villa-Vialaneix | Introduction to statistical learning 21/25

Page 61: A short introduction to statistical learning

I’m almost dead with all these stuffs on my mind!!!What in practice?

table(res.tune$best.model$fitted, iris$Species)% setosa versicolor virginica% setosa 0 0 0% versicolor 0 45 0% virginica 0 5 50

plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))

Nathalie Villa-Vialaneix | Introduction to statistical learning 22/25

Page 62: A short introduction to statistical learning

I’m almost dead with all these stuffs on my mind!!!What in practice?

res.tune <- tune.svm(Species ~ ., data=iris, gamma = 2^(-1:1),cost = 2^(2:4))

# Parameter tuning of ’svm’:# - sampling method: 10fold cross validation# - best parameters:# gamma cost# 0.5 4# - best performance: 0.08res.tune$best.model# Call:# best.svm(x = Species ~ ., data = iris, gamma = 2^(-1:1),# cost = 2^(2:4))# Parameters:# SVM-Type: C-classification# SVM-Kernel: radial# cost: 4# gamma: 0.5# Number of Support Vectors: 32

Nathalie Villa-Vialaneix | Introduction to statistical learning 23/25

Page 63: A short introduction to statistical learning

I’m almost dead with all these stuffs on my mind!!!What in practice?

table(res.tune$best.model$fitted, iris$Species)% setosa versicolor virginica% setosa 0 0 0% versicolor 0 49 0% virginica 0 1 50

plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))

Nathalie Villa-Vialaneix | Introduction to statistical learning 24/25

Page 64: A short introduction to statistical learning

References

Aronszajn, N. (1950).Theory of reproducing kernels.Transactions of the American Mathematical Society, 68(3):337–404.

Steinwart, I. (2002).Support vector machines are universally consistent.Journal of Complexity, 18:768–791.

Vapnik, V. (1995).The Nature of Statistical Learning Theory.Springer Verlag, New York, USA.

and more can be found on my website:http://nathalievilla.org/learning.html

Nathalie Villa-Vialaneix | Introduction to statistical learning 25/25