a short introduction to statistical learning
DESCRIPTION
Groupe de travail de l'Axe Apprentissage statistique et Processus INRA, unité MIA-T October 16th, 2014TRANSCRIPT
A short introduction to statistical learningNathalie Villa-Vialaneix
[email protected]://www.nathalievilla.org
Axe “Apprentissage et Processus”October 15th, 2014 - Unité MIA-T, INRA, Toulouse
Nathalie Villa-Vialaneix | Introduction to statistical learning 1/25
Outline
1 IntroductionBackground and notationsUnderfitting / OverfittingConsistency
2 SVM
Nathalie Villa-Vialaneix | Introduction to statistical learning 2/25
Outline
1 IntroductionBackground and notationsUnderfitting / OverfittingConsistency
2 SVM
Nathalie Villa-Vialaneix | Introduction to statistical learning 3/25
Background
Purpose: predict Y from X ;
What we have: n observations of (X ,Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X : xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Y can be:
a numeric variable (Y ∈ R)⇒ (supervised) regression régression;
a factor⇒ (supervised) classification discrimination.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
Background
Purpose: predict Y from X ;
What we have: n observations of (X ,Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X : xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Y can be:
a numeric variable (Y ∈ R)⇒ (supervised) regression régression;
a factor⇒ (supervised) classification discrimination.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
Background
Purpose: predict Y from X ;
What we have: n observations of (X ,Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X : xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Y can be:
a numeric variable (Y ∈ R)⇒ (supervised) regression régression;
a factor⇒ (supervised) classification discrimination.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
Background
Purpose: predict Y from X ;
What we have: n observations of (X ,Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X : xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Y can be:
a numeric variable (Y ∈ R)⇒ (supervised) regression régression;
a factor⇒ (supervised) classification discrimination.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
Background
Purpose: predict Y from X ;
What we have: n observations of (X ,Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X : xn+1, . . . , xm.
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Y can be:
a numeric variable (Y ∈ R)⇒ (supervised) regression régression;
a factor⇒ (supervised) classification discrimination.
Nathalie Villa-Vialaneix | Introduction to statistical learning 4/25
BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:
ynew = Φn(xnew).
if Y is numeric, Φn is called a regression function fonction declassification;
if Y is a factor, Φn is called a classifier classifieur;
Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties
accuracy to the observations: predictions made on known data areclose to observed values;
generalization ability: predictions made on new data are alsoaccurate.
Conflicting objectives!!
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:
ynew = Φn(xnew).
if Y is numeric, Φn is called a regression function fonction declassification;
if Y is a factor, Φn is called a classifier classifieur;
Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties
accuracy to the observations: predictions made on known data areclose to observed values;
generalization ability: predictions made on new data are alsoaccurate.
Conflicting objectives!!
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:
ynew = Φn(xnew).
if Y is numeric, Φn is called a regression function fonction declassification;
if Y is a factor, Φn is called a classifier classifieur;
Φn is said to be trained or learned from the observations (xi , yi)i .
Desirable properties
accuracy to the observations: predictions made on known data areclose to observed values;
generalization ability: predictions made on new data are alsoaccurate.
Conflicting objectives!!
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:
ynew = Φn(xnew).
if Y is numeric, Φn is called a regression function fonction declassification;
if Y is a factor, Φn is called a classifier classifieur;
Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties
accuracy to the observations: predictions made on known data areclose to observed values;
generalization ability: predictions made on new data are alsoaccurate.
Conflicting objectives!!
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:
ynew = Φn(xnew).
if Y is numeric, Φn is called a regression function fonction declassification;
if Y is a factor, Φn is called a classifier classifieur;
Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties
accuracy to the observations: predictions made on known data areclose to observed values;
generalization ability: predictions made on new data are alsoaccurate.
Conflicting objectives!!
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
BasicsFrom (xi , yi)i , definition of a machine, Φn s.t.:
ynew = Φn(xnew).
if Y is numeric, Φn is called a regression function fonction declassification;
if Y is a factor, Φn is called a classifier classifieur;
Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties
accuracy to the observations: predictions made on known data areclose to observed values;
generalization ability: predictions made on new data are alsoaccurate.
Conflicting objectives!!
Nathalie Villa-Vialaneix | Introduction to statistical learning 5/25
Underfitting/Overfitting sous/sur - apprentissageFunction x → y to be estimated
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissageObservations we might have
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissageObservations we do have
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissageFirst estimation from the observations: underfitting
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissageSecond estimation from the observations: accurate estimation
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissageThird estimation from the observations: overfitting
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Underfitting/Overfitting sous/sur - apprentissageSummary
Nathalie Villa-Vialaneix | Introduction to statistical learning 6/25
Errors
training error (measures the accuracy to the observations)
I if y is a factor: misclassification rate
]{yi , yi , i = 1, . . . , n}n
I if y is numeric: mean square error (MSE)
1n
n∑i=1
(yi − yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation
1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
Errors
training error (measures the accuracy to the observations)I if y is a factor: misclassification rate
]{yi , yi , i = 1, . . . , n}n
I if y is numeric: mean square error (MSE)
1n
n∑i=1
(yi − yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation
1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
Errors
training error (measures the accuracy to the observations)I if y is a factor: misclassification rate
]{yi , yi , i = 1, . . . , n}n
I if y is numeric: mean square error (MSE)
1n
n∑i=1
(yi − yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation
1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
Errors
training error (measures the accuracy to the observations)I if y is a factor: misclassification rate
]{yi , yi , i = 1, . . . , n}n
I if y is numeric: mean square error (MSE)
1n
n∑i=1
(yi − yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation
1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
Errors
training error (measures the accuracy to the observations)I if y is a factor: misclassification rate
]{yi , yi , i = 1, . . . , n}n
I if y is numeric: mean square error (MSE)
1n
n∑i=1
(yi − yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation
1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
Errors
training error (measures the accuracy to the observations)I if y is a factor: misclassification rate
]{yi , yi , i = 1, . . . , n}n
I if y is numeric: mean square error (MSE)
1n
n∑i=1
(yi − yi)2
or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)
test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation
1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data
Nathalie Villa-Vialaneix | Introduction to statistical learning 7/25
ExampleObservations
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
ExampleTraining/Test datasets
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
ExampleTraining/Test errors
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
Example
Summary
Nathalie Villa-Vialaneix | Introduction to statistical learning 8/25
Consistency in the parametric/non parametric case
Example in the parametric framework (linear methods)an assumption is made on the form of the relation between X and Y :
Y = βT X + ε
β is estimated from the observations (x1, y1), . . . , (xn, yn) by a givenmethod which calculates a βn.
The estimation is said to be consistent if βn n→+∞−−−−−−→ β under (eventually)
technical assumptions on X , ε, Y .
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/25
Consistency in the parametric/non parametric case
Example in the nonparametric frameworkthe form of the relation between X and Y is unknown:
Y = Φ(X) + ε
Φ is estimated from the observations (x1, y1), . . . , (xn, yn) by a givenmethod which calculates a Φn.
The estimation is said to be consistent if Φn n→+∞−−−−−−→ Φ under (eventually)
technical assumptions on X , ε, Y .
Nathalie Villa-Vialaneix | Introduction to statistical learning 9/25
Consistency from the statistical learning perspective[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...
... rather in having the smallest prediction error?
Statistical learning perspective: a method that builds a machine Φn fromthe observations is said to be (universally) consistent if, given a riskfunction R : R × R→ R+ (which calculates an error),
E (R(Φn(X),Y))n→+∞−−−−−−→ inf
Φ:X→RE (R(Φ(X),Y)) ,
for any distribution of (X ,Y) ∈ X × R.Definitions: L∗ = infΦ:X→R E (R(Φ(X),Y)) and LΦ = E (R(Φ(X),Y)).
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/25
Consistency from the statistical learning perspective[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...... rather in having the smallest prediction error?
Statistical learning perspective: a method that builds a machine Φn fromthe observations is said to be (universally) consistent if, given a riskfunction R : R × R→ R+ (which calculates an error),
E (R(Φn(X),Y))n→+∞−−−−−−→ inf
Φ:X→RE (R(Φ(X),Y)) ,
for any distribution of (X ,Y) ∈ X × R.Definitions: L∗ = infΦ:X→R E (R(Φ(X),Y)) and LΦ = E (R(Φ(X),Y)).
Nathalie Villa-Vialaneix | Introduction to statistical learning 10/25
Desirable properties from a mathematical perspectiveSimplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)Learning process: choose a machine Φn in a class of functionsC ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using aSVM).
Error decomposition
LΦn − L∗ ≤(LΦn − inf
Φ∈CLΦ
)+
(inf
Φ∈CLΦ − L∗
)with
infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure thatthis term is small);
LΦn − infΦ∈C LΦ ≤ 2 supΦ∈C |LnΦ − LΦ|, LnΦ = 1
n∑n
i=1 R(Φ(xi), yi) isthe generalization capability of C (i.e., in the worst case, the empiricalerror must be close to the true error: C must not be too rich to ensurethat this term is small).
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/25
Desirable properties from a mathematical perspectiveSimplified framework: X ∈ X and Y ∈ {−1, 1} (binary classification)Learning process: choose a machine Φn in a class of functionsC ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using aSVM).
Error decomposition
LΦn − L∗ ≤(LΦn − inf
Φ∈CLΦ
)+
(inf
Φ∈CLΦ − L∗
)with
infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure thatthis term is small);
LΦn − infΦ∈C LΦ ≤ 2 supΦ∈C |LnΦ − LΦ|, LnΦ = 1
n∑n
i=1 R(Φ(xi), yi) isthe generalization capability of C (i.e., in the worst case, the empiricalerror must be close to the true error: C must not be too rich to ensurethat this term is small).
Nathalie Villa-Vialaneix | Introduction to statistical learning 11/25
Outline
1 IntroductionBackground and notationsUnderfitting / OverfittingConsistency
2 SVM
Nathalie Villa-Vialaneix | Introduction to statistical learning 12/25
Basic introduction
Binary classification problem: X ∈ H et Y ∈ {−1; 1}A training set is given: (x1, y1), . . . , (xn, yn)
SVM is a method based on kernels. It is universally consistent method,given that the kernel is universal [Steinwart, 2002].
Extensions to the regression case exist (SVR or LS-SVM) that are alsouniversally consistent when the kernel is universal.
Nathalie Villa-Vialaneix | Introduction to statistical learning 13/25
Basic introduction
Binary classification problem: X ∈ H et Y ∈ {−1; 1}A training set is given: (x1, y1), . . . , (xn, yn)
SVM is a method based on kernels. It is universally consistent method,given that the kernel is universal [Steinwart, 2002].
Extensions to the regression case exist (SVR or LS-SVM) that are alsouniversally consistent when the kernel is universal.
Nathalie Villa-Vialaneix | Introduction to statistical learning 13/25
Optimal margin classification
w is chosen such that:
minw ‖w‖2 (the margin is the largest),
under the constraints: yi(〈w, xi〉+ b) ≥ 1, 1 ≤ i ≤ n (the separationbetween the two classes is perfect).
⇒ ensures a good generalization capability.
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
Optimal margin classification
w is chosen such that:
minw ‖w‖2 (the margin is the largest),
under the constraints: yi(〈w, xi〉+ b) ≥ 1, 1 ≤ i ≤ n (the separationbetween the two classes is perfect).
⇒ ensures a good generalization capability.
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
Optimal margin classificationw
margin: 1‖w‖2
Support Vector
w is chosen such that:
minw ‖w‖2 (the margin is the largest),
under the constraints: yi(〈w, xi〉+ b) ≥ 1, 1 ≤ i ≤ n (the separationbetween the two classes is perfect).
⇒ ensures a good generalization capability.
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
Optimal margin classificationw
margin: 1‖w‖2
Support Vector
w is chosen such that:
minw ‖w‖2 (the margin is the largest),
under the constraints: yi(〈w, xi〉+ b) ≥ 1, 1 ≤ i ≤ n (the separationbetween the two classes is perfect).
⇒ ensures a good generalization capability.
Nathalie Villa-Vialaneix | Introduction to statistical learning 14/25
Soft margin classification
w is chosen such that:minw,ξ ‖w‖2 + C
∑ni=1 ξi (the margin is the largest),
under the constraints: yi(〈w, xi〉+ b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes is almost perfect).⇒ allowing a few errors improves the richness of the class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
Soft margin classification
w is chosen such that:minw,ξ ‖w‖2 + C
∑ni=1 ξi (the margin is the largest),
under the constraints: yi(〈w, xi〉+ b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes is almost perfect).⇒ allowing a few errors improves the richness of the class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
Soft margin classificationw
margin: 1‖w‖2
Support Vector
w is chosen such that:minw,ξ ‖w‖2 + C
∑ni=1 ξi (the margin is the largest),
under the constraints: yi(〈w, xi〉+ b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes is almost perfect).⇒ allowing a few errors improves the richness of the class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
Soft margin classificationw
margin: 1‖w‖2
Support Vector
w is chosen such that:minw,ξ ‖w‖2 + C
∑ni=1 ξi (the margin is the largest),
under the constraints: yi(〈w, xi〉+ b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes is almost perfect).⇒ allowing a few errors improves the richness of the class.
Nathalie Villa-Vialaneix | Introduction to statistical learning 15/25
Non linear SVMOriginal space X
w ∈ H is chosen such that (PC ,H):
minw,ξ ‖w‖2H + C∑n
i=1 ξi (the margin in the feature space is thelargest),
under the constraints: yi(〈w,Ψ(xi)〉H + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes in the feature space isalmost perfect).
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
Non linear SVMOriginal space X Feature space H
Ψ (non linear)
w ∈ H is chosen such that (PC ,H):
minw,ξ ‖w‖2H + C∑n
i=1 ξi (the margin in the feature space is thelargest),
under the constraints: yi(〈w,Ψ(xi)〉H + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes in the feature space isalmost perfect).
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
Non linear SVMOriginal space X Feature space H
Ψ (non linear)
w ∈ H is chosen such that (PC ,H):
minw,ξ ‖w‖2H + C∑n
i=1 ξi (the margin in the feature space is thelargest),
under the constraints: yi(〈w,Ψ(xi)〉H + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes in the feature space isalmost perfect).
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
Non linear SVMOriginal space X Feature space H
Ψ (non linear)
w ∈ H is chosen such that (PC ,H):
minw,ξ ‖w‖2H + C∑n
i=1 ξi (the margin in the feature space is thelargest),
under the constraints: yi(〈w,Ψ(xi)〉H + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes in the feature space isalmost perfect).
Nathalie Villa-Vialaneix | Introduction to statistical learning 16/25
SVM from different points of viewA regularization problem: (PC ,H)⇔
(P2λ,H) : min
w∈H
1n
n∑i=1
R(fw(xi), yi)︸ ︷︷ ︸error term
+λ ‖w‖2H︸︷︷︸
penalization term
,
where fw(x) = 〈Ψ(x),w〉H and R(y, y) = max(0, 1 − yy) (hinge lossfunction)
errors versus y for y = 1:
I blue: hinge loss;
I green: misclassification error.
A dual problem: (PC ,H)⇔
(DC ,X) : maxα∈Rn∑n
i=1 αi −∑n
i=1∑n
j=1 αiαjyiyj ,
with∑N
i=1 αiyi = 0,0 ≤ αi ≤ C , 1 ≤ i ≤ n.
There is no need to know Ψ and H :I choose a function K with a few good properties;I use it as the dot product in H :∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H .
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
SVM from different points of viewA regularization problem: (PC ,H)⇔
(P2λ,H) : min
w∈H
1n
n∑i=1
R(fw(xi), yi)︸ ︷︷ ︸error term
+λ ‖w‖2H︸︷︷︸
penalization term
,
where fw(x) = 〈Ψ(x),w〉H and R(y, y) = max(0, 1 − yy) (hinge lossfunction)A dual problem: (PC ,H)⇔
(DC ,X) : maxα∈Rn∑n
i=1 αi −∑n
i=1∑n
j=1 αiαjyiyj〈Ψ(xi),Ψ(xj)〉H ,
with∑N
i=1 αiyi = 0,0 ≤ αi ≤ C , 1 ≤ i ≤ n.
There is no need to know Ψ and H :I choose a function K with a few good properties;I use it as the dot product in H :∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H .
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
SVM from different points of viewA regularization problem: (PC ,H)⇔
(P2λ,H) : min
w∈H
1n
n∑i=1
R(fw(xi), yi)︸ ︷︷ ︸error term
+λ ‖w‖2H︸︷︷︸
penalization term
,
where fw(x) = 〈Ψ(x),w〉H and R(y, y) = max(0, 1 − yy) (hinge lossfunction)A dual problem: (PC ,H)⇔
(DC ,X) : maxα∈Rn∑n
i=1 αi −∑n
i=1∑n
j=1 αiαjyiyjK(xi , xj),
with∑N
i=1 αiyi = 0,0 ≤ αi ≤ C , 1 ≤ i ≤ n.
There is no need to know Ψ and H :I choose a function K with a few good properties;I use it as the dot product in H :∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H .
Nathalie Villa-Vialaneix | Introduction to statistical learning 17/25
Which kernels?
Minimum properties that a kernel should fulfilledsymmetry: K(u, u′) = K(u′, u)
positivity: ∀N ∈ N, ∀ (αi) ⊂ RN, ∀ (xi) ⊂ X
N,∑
i,j αiαjK(xi , xj) ≥ 0.
[Aronszajn, 1950]: ∃ a Hilbert space (H , 〈., .〉H) and a function Ψ : X → Hsuch that:
∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H
Examples
the Gaussian kernel: ∀ x, x′ ∈ Rd , K(x, x′) = e−γ‖x−x′‖2 (it is universalfor all bounded subset of Rd);
the linear kernel: ∀ x, x′ ∈ Rd , K(x, x′) = xT (x′) (it is not universal).
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/25
Which kernels?
Minimum properties that a kernel should fulfilledsymmetry: K(u, u′) = K(u′, u)
positivity: ∀N ∈ N, ∀ (αi) ⊂ RN, ∀ (xi) ⊂ X
N,∑
i,j αiαjK(xi , xj) ≥ 0.
[Aronszajn, 1950]: ∃ a Hilbert space (H , 〈., .〉H) and a function Ψ : X → Hsuch that:
∀ u, v ∈ H , K(u, v) = 〈Ψ(u),Ψ(v)〉H
Examples
the Gaussian kernel: ∀ x, x′ ∈ Rd , K(x, x′) = e−γ‖x−x′‖2 (it is universalfor all bounded subset of Rd);
the linear kernel: ∀ x, x′ ∈ Rd , K(x, x′) = xT (x′) (it is not universal).
Nathalie Villa-Vialaneix | Introduction to statistical learning 18/25
In summary, how does the solution write????
Φn(x) =∑
i
αiyiK(xi , x)
where only a few αi , 0. i such that αi , 0 are the support vectors!
Nathalie Villa-Vialaneix | Introduction to statistical learning 19/25
I’m almost dead with all these stuffs on my mind!!!What in practice?
data(iris)iris <- iris[iris$Species%in%c("versicolor","virginica"),]plot(iris$Petal.Length, iris$Petal.Width, col=iris$Species,
pch=19)legend("topleft", pch=19, col=c(2,3),
legend=c("versicolor", "virginica"))
Nathalie Villa-Vialaneix | Introduction to statistical learning 20/25
I’m almost dead with all these stuffs on my mind!!!What in practice?
library(e1071)res.tune <- tune.svm(Species ~ ., data=iris, kernel="linear",
cost = 2^(-1:4))# Parameter tuning of ’svm’:# - sampling method: 10fold cross validation# - best parameters:# cost# 0.5# - best performance: 0.05res.tune$best.model# Call:# best.svm(x = Species ~ ., data = iris, cost = 2^(-1:4),# kernel = "linear")# Parameters:# SVM-Type: C-classification# SVM-Kernel: linear# cost: 0.5# gamma: 0.25# Number of Support Vectors: 21
Nathalie Villa-Vialaneix | Introduction to statistical learning 21/25
I’m almost dead with all these stuffs on my mind!!!What in practice?
table(res.tune$best.model$fitted, iris$Species)% setosa versicolor virginica% setosa 0 0 0% versicolor 0 45 0% virginica 0 5 50
plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))
Nathalie Villa-Vialaneix | Introduction to statistical learning 22/25
I’m almost dead with all these stuffs on my mind!!!What in practice?
res.tune <- tune.svm(Species ~ ., data=iris, gamma = 2^(-1:1),cost = 2^(2:4))
# Parameter tuning of ’svm’:# - sampling method: 10fold cross validation# - best parameters:# gamma cost# 0.5 4# - best performance: 0.08res.tune$best.model# Call:# best.svm(x = Species ~ ., data = iris, gamma = 2^(-1:1),# cost = 2^(2:4))# Parameters:# SVM-Type: C-classification# SVM-Kernel: radial# cost: 4# gamma: 0.5# Number of Support Vectors: 32
Nathalie Villa-Vialaneix | Introduction to statistical learning 23/25
I’m almost dead with all these stuffs on my mind!!!What in practice?
table(res.tune$best.model$fitted, iris$Species)% setosa versicolor virginica% setosa 0 0 0% versicolor 0 49 0% virginica 0 1 50
plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))
Nathalie Villa-Vialaneix | Introduction to statistical learning 24/25
References
Aronszajn, N. (1950).Theory of reproducing kernels.Transactions of the American Mathematical Society, 68(3):337–404.
Steinwart, I. (2002).Support vector machines are universally consistent.Journal of Complexity, 18:768–791.
Vapnik, V. (1995).The Nature of Statistical Learning Theory.Springer Verlag, New York, USA.
and more can be found on my website:http://nathalievilla.org/learning.html
Nathalie Villa-Vialaneix | Introduction to statistical learning 25/25