fda and statistical learning theory
DESCRIPTION
Short courses on functional data analysis and statistical learning, part 3 CENATAV, Havana, Cuba September 17th, 2008TRANSCRIPT
FDA and Statistical learning theory
Nathalie Villa-Vialaneix - [email protected]://www.nathalievilla.org
Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de PerpignanFrance
La Havane, September 17th, 2008
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 1 / 39
Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 2 / 39
Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is“close” to the model.
The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.More precisely, binary classification case:
we are given a pair of random variable, (X ,Y) from X × {−1, 1}where X is any topological space;
we observe n i.i.d. realizations of (X ,Y), (x1, y1), . . . , (xn, yn), calledthe learning set;
we intend to find a function, built from (x1, y1), . . . , (xn, yn),Ψn : X → {−1, 1} that minimizes
P (Ψn(X) , Y) .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is“close” to the model.The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.
More precisely, binary classification case:
we are given a pair of random variable, (X ,Y) from X × {−1, 1}where X is any topological space;
we observe n i.i.d. realizations of (X ,Y), (x1, y1), . . . , (xn, yn), calledthe learning set;
we intend to find a function, built from (x1, y1), . . . , (xn, yn),Ψn : X → {−1, 1} that minimizes
P (Ψn(X) , Y) .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is“close” to the model.The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.More precisely, binary classification case:
we are given a pair of random variable, (X ,Y) from X × {−1, 1}where X is any topological space;
we observe n i.i.d. realizations of (X ,Y), (x1, y1), . . . , (xn, yn), calledthe learning set;
we intend to find a function, built from (x1, y1), . . . , (xn, yn),Ψn : X → {−1, 1} that minimizes
P (Ψn(X) , Y) .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is“close” to the model.The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.More precisely, binary classification case:
we are given a pair of random variable, (X ,Y) from X × {−1, 1}where X is any topological space;
we observe n i.i.d. realizations of (X ,Y), (x1, y1), . . . , (xn, yn), calledthe learning set;
we intend to find a function, built from (x1, y1), . . . , (xn, yn),Ψn : X → {−1, 1} that minimizes
P (Ψn(X) , Y) .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
Purpose of statistical learning theory
In the previous presentations, the aim was to find an estimator that is“close” to the model.The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.More precisely, binary classification case:
we are given a pair of random variable, (X ,Y) from X × {−1, 1}where X is any topological space;
we observe n i.i.d. realizations of (X ,Y), (x1, y1), . . . , (xn, yn), calledthe learning set;
we intend to find a function, built from (x1, y1), . . . , (xn, yn),Ψn : X → {−1, 1} that minimizes
P (Ψn(X) , Y) .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39
First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) , Y) is the “target” for the expectancy of theerror. This lower bound for error expectancy is called Bayes risk,denoted by L∗.
2 Generally, Ψn is chosen in a restricted class of functions from X to{−1, 1}, C; then the performance of Ψn can be quantified by:
P (Ψn(X) , Y) − L∗
=(P (Ψn(X) , Y) − inf
Ψ∈CP (Ψ(X) , Y)
)︸ ︷︷ ︸
Error due to the training method
+(
infΨ∈CP (Ψ(X) , Y) − L∗
)︸ ︷︷ ︸
Error due to the choice of C
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) , Y) is the “target” for the expectancy of theerror. This lower bound for error expectancy is called Bayes risk,denoted by L∗.
2 Generally, Ψn is chosen in a restricted class of functions from X to{−1, 1}, C; then the performance of Ψn can be quantified by:
P (Ψn(X) , Y) − L∗
=(P (Ψn(X) , Y) − inf
Ψ∈CP (Ψ(X) , Y)
)︸ ︷︷ ︸
Error due to the training method
+(
infΨ∈CP (Ψ(X) , Y) − L∗
)︸ ︷︷ ︸
Error due to the choice of C
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) , Y) is the “target” for the expectancy of theerror. This lower bound for error expectancy is called Bayes risk,denoted by L∗.
2 Generally, Ψn is chosen in a restricted class of functions from X to{−1, 1}, C; then the performance of Ψn can be quantified by:
P (Ψn(X) , Y) − L∗ =(P (Ψn(X) , Y) − inf
Ψ∈CP (Ψ(X) , Y)
)+
(inf
Ψ∈CP (Ψ(X) , Y) − L∗
)
=(P (Ψn(X) , Y) − inf
Ψ∈CP (Ψ(X) , Y)
)︸ ︷︷ ︸
Error due to the training method
+(
infΨ∈CP (Ψ(X) , Y) − L∗
)︸ ︷︷ ︸
Error due to the choice of C
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
First remarks on the aim
1 infΨ:X→{−1,1} P (Ψ(X) , Y) is the “target” for the expectancy of theerror. This lower bound for error expectancy is called Bayes risk,denoted by L∗.
2 Generally, Ψn is chosen in a restricted class of functions from X to{−1, 1}, C; then the performance of Ψn can be quantified by:
P (Ψn(X) , Y) − L∗ =(P (Ψn(X) , Y) − inf
Ψ∈CP (Ψ(X) , Y)
)︸ ︷︷ ︸
Error due to the training method
+(
infΨ∈CP (Ψ(X) , Y) − L∗
)︸ ︷︷ ︸
Error due to the choice of C
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39
Consistency
From this last remark, we can define:
Definition: Weak consistency
A algorithm leading to build the classifier Ψn is said to be (weaklyuniversally) consistent if, for all distribution of the random pair (X ,Y), wehave
E (LΨn)n→+∞−−−−−−→ L∗
where LΨn := P (Ψn(X) , Y | (xi , yi)i)
Definition: Strong consistency
Moreover, it is said to be strongly (universally) consistent if, for alldistribution of the random pair (X ,Y), we have
LΨn n→+∞−−−−−−→ L∗ p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 5 / 39
Consistency
From this last remark, we can define:
Definition: Weak consistency
A algorithm leading to build the classifier Ψn is said to be (weaklyuniversally) consistent if, for all distribution of the random pair (X ,Y), wehave
E (LΨn)n→+∞−−−−−−→ L∗
where LΨn := P (Ψn(X) , Y | (xi , yi)i)
Definition: Strong consistency
Moreover, it is said to be strongly (universally) consistent if, for alldistribution of the random pair (X ,Y), we have
LΨn n→+∞−−−−−−→ L∗ p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 5 / 39
Choice of C and of Ψn
1 The choice of C is of a main importance to obtain good performancesof Ψn:
too small (not rich) C have a poor value of
infΨ∈CP (Ψ(X) , Y) − L∗,
but too rich C have a poor value of
P (Ψn(X) , Y) − infΨ∈CP (Ψ(X) , Y)
because the learning algorithm tends to overfit the data.
2 A naive approach to find a good Ψn over the class C could be tominimize the empirical risk of C:
Ψn := arg minΨ∈C
LnΨ
where LnΨ := 1n∑n
i=1 I{Ψ(xi),yi }.
The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to theaccuracy of the empirical risk.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
Choice of C and of Ψn
1 The choice of C is of a main importance to obtain good performancesof Ψn:
too small (not rich) C have a poor value of
infΨ∈CP (Ψ(X) , Y) − L∗,
but too rich C have a poor value of
P (Ψn(X) , Y) − infΨ∈CP (Ψ(X) , Y)
because the learning algorithm tends to overfit the data.2 A naive approach to find a good Ψn over the class C could be to
minimize the empirical risk of C:
Ψn := arg minΨ∈C
LnΨ
where LnΨ := 1n∑n
i=1 I{Ψ(xi),yi }.
The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to theaccuracy of the empirical risk.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
Choice of C and of Ψn
1 The choice of C is of a main importance to obtain good performancesof Ψn:
too small (not rich) C have a poor value of
infΨ∈CP (Ψ(X) , Y) − L∗,
but too rich C have a poor value of
P (Ψn(X) , Y) − infΨ∈CP (Ψ(X) , Y)
because the learning algorithm tends to overfit the data.2 A naive approach to find a good Ψn over the class C could be to
minimize the empirical risk of C:
Ψn := arg minΨ∈C
LnΨ
where LnΨ := 1n∑n
i=1 I{Ψ(xi),yi }.
The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to theaccuracy of the empirical risk.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39
VC-dimension
A way to quantify the “richness” of a class of functions is to calculate itsVC-dimension:
Definition: VC-dimensionA class of classifiers (functions from X in {−1, 1}), C, is said to shatter aset of data points z1, z2, . . . , zd ∈ X if, for all assignments of labels to thosepoints, m1,m2, . . . ,md ∈ {−1, 1}, there exists a Ψ ∈ C such that:
∀ i = 1, . . . , d, Ψ(zi) = mi .
The VC-dimension of class of functions C is the maximum number ofpoints that can be shattered by C.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 7 / 39
VC-dimension
A way to quantify the “richness” of a class of functions is to calculate itsVC-dimension:
Definition: VC-dimensionA class of classifiers (functions from X in {−1, 1}), C, is said to shatter aset of data points z1, z2, . . . , zd ∈ X if, for all assignments of labels to thosepoints, m1,m2, . . . ,md ∈ {−1, 1}, there exists a Ψ ∈ C such that:
∀ i = 1, . . . , d, Ψ(zi) = mi .
The VC-dimension of class of functions C is the maximum number ofpoints that can be shattered by C.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 7 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}.
Then,
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
2 points are shattered by C:
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
2 points are shattered by C:
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
2 points are shattered by C:
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
3 points are shattered by C:
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
3 points are shattered by C:
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
3 points are shattered by C:
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
3 points are shattered by C:
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
3 points are shattered by C:
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
4 points cannot be shattered by C:
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
4 points canto be shattered by C:
no Ψ ∈ C can have value 1 on the red circles and −1 on the black ones.
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
4 points canto be shattered by C:
then, VC-dimension of C = 3.
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Example: VC-dimension of hyperplans
Suppose that X = R2 andC =
{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R
}. Then,
4 points canto be shattered by C:
More generally, VC-dimension of hyperplans in Rd is d + 1.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39
Relationship between VC-dimension and empiricalerror
Theorem [Vapnik, 1995, Vapnik, 1998]With a probability almost equal to 1 − η,
supΨ∈C
∣∣∣E (LΨ) − LnΨ∣∣∣ ≤ √
VC(C) − log(η/4)
n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 9 / 39
An alternative to VC-dimension
Remark: In most cases, VC-dimension is not enough precise. Then,another quantity can also be considered:
Definition: Shatter coefficientThe k -th shatter coefficient of the set of functions C is the maximumnumbers of partitions of n points into two sets that can be obtained from C.This number, denoted by S(C, n), is almost equal to 2n
Example: If C is the space of hyperplans in Rd ,
S(C, n) =
{2n if n ≤ d2d+1 = 2VC(C) if n ≥ d + 1
Remark: For all n > 2, S(C, n) ≤ nVC(C).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
An alternative to VC-dimension
Remark: In most cases, VC-dimension is not enough precise. Then,another quantity can also be considered:
Definition: Shatter coefficientThe k -th shatter coefficient of the set of functions C is the maximumnumbers of partitions of n points into two sets that can be obtained from C.This number, denoted by S(C, n), is almost equal to 2n
Example: If C is the space of hyperplans in Rd ,
S(C, n) =
{2n if n ≤ d2d+1 = 2VC(C) if n ≥ d + 1
Remark: For all n > 2, S(C, n) ≤ nVC(C).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
An alternative to VC-dimension
Remark: In most cases, VC-dimension is not enough precise. Then,another quantity can also be considered:
Definition: Shatter coefficientThe k -th shatter coefficient of the set of functions C is the maximumnumbers of partitions of n points into two sets that can be obtained from C.This number, denoted by S(C, n), is almost equal to 2n
Example: If C is the space of hyperplans in Rd ,
S(C, n) =
{2n if n ≤ d2d+1 = 2VC(C) if n ≥ d + 1
Remark: For all n > 2, S(C, n) ≤ nVC(C).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39
Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P
(supΨ∈C
∣∣∣LnΨ − E (LΨ)∣∣∣ > ε) ≤ S(C, n)e−nε2/32.
Consequences for the learning error on C: If Ψn has been chosen byminimizing the empirical risk, i.e.,
Ψn := arg minΨ∈C
1n
n∑i=1
I{Ψ(xi),yi }
and then,
P(E (LΨn) − inf
Ψ∈CE (LΨ) > ε
)≤ S(C, n)e−nε2/128.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P
(supΨ∈C
∣∣∣LnΨ − E (LΨ)∣∣∣ > ε) ≤ S(C, n)e−nε2/32.
Consequences for the learning error on C: If Ψn has been chosen byminimizing the empirical risk, i.e.,
Ψn := arg minΨ∈C
1n
n∑i=1
I{Ψ(xi),yi }
and then,
P(E (LΨn) − inf
Ψ∈CE (LΨ) > ε
)≤ S(C, n)e−nε2/128.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P
(supΨ∈C
∣∣∣LnΨ − E (LΨ)∣∣∣ > ε) ≤ S(C, n)e−nε2/32.
Consequences for the learning error on C: If Ψn has been chosen byminimizing the empirical risk, i.e.,
Ψn := arg minΨ∈C
1n
n∑i=1
I{Ψ(xi),yi }
as
P(E (LΨn) − inf
Ψ∈CE (LΨ) > ε
)≤ P
(2 sup
Ψ∈C
∣∣∣LnΨ − E (LΨ)∣∣∣ > ε)
and then,
P(E (LΨn) − inf
Ψ∈CE (LΨ) > ε
)≤ S(C, n)e−nε2/128.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
Vapnik-Chervonenkis inequality
Theorem [Vapnik, 1995, Vapnik, 1998]
P
(supΨ∈C
∣∣∣LnΨ − E (LΨ)∣∣∣ > ε) ≤ S(C, n)e−nε2/32.
Consequences for the learning error on C: If Ψn has been chosen byminimizing the empirical risk, i.e.,
Ψn := arg minΨ∈C
1n
n∑i=1
I{Ψ(xi),yi }
and then,
P(E (LΨn) − inf
Ψ∈CE (LΨ) > ε
)≤ S(C, n)e−nε2/128.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39
Additional notes for the regression case
Same theory can be developed for the regression case under additionalassumptions. To summarize, let (X ,Y) be a random pair taking its valuesin X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of(X ,Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,LΨ = E
((Ψ(X) − Y)2 | (xi , yi)i
);
the Bayes risk is: L∗ = infΨ:X→R E((Ψ(X) − Y)2
). In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);the empirical risk: for Ψ : X → R, LnΨ = 1
n∑n
i=1(yi −Ψ(xi))2.Hence, in this case, a consistent regression scheme, Ψn, satisfies:
limn→+∞
E (LΨn) = L∗;
and a strongly consistent regression scheme, Ψn, satisfies:
limn→+∞
LΨn = L∗ p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
Additional notes for the regression case
Same theory can be developed for the regression case under additionalassumptions. To summarize, let (X ,Y) be a random pair taking its valuesin X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of(X ,Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,LΨ = E
((Ψ(X) − Y)2 | (xi , yi)i
);
the Bayes risk is: L∗ = infΨ:X→R E((Ψ(X) − Y)2
). In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);
the empirical risk: for Ψ : X → R, LnΨ = 1n∑n
i=1(yi −Ψ(xi))2.Hence, in this case, a consistent regression scheme, Ψn, satisfies:
limn→+∞
E (LΨn) = L∗;
and a strongly consistent regression scheme, Ψn, satisfies:
limn→+∞
LΨn = L∗ p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
Additional notes for the regression case
Same theory can be developed for the regression case under additionalassumptions. To summarize, let (X ,Y) be a random pair taking its valuesin X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of(X ,Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,LΨ = E
((Ψ(X) − Y)2 | (xi , yi)i
);
the Bayes risk is: L∗ = infΨ:X→R E((Ψ(X) − Y)2
). In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);the empirical risk: for Ψ : X → R, LnΨ = 1
n∑n
i=1(yi −Ψ(xi))2.
Hence, in this case, a consistent regression scheme, Ψn, satisfies:
limn→+∞
E (LΨn) = L∗;
and a strongly consistent regression scheme, Ψn, satisfies:
limn→+∞
LΨn = L∗ p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
Additional notes for the regression case
Same theory can be developed for the regression case under additionalassumptions. To summarize, let (X ,Y) be a random pair taking its valuesin X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of(X ,Y). Then, we can introduce
the risk as, for example, the mean square error: for Ψ : X → R,LΨ = E
((Ψ(X) − Y)2 | (xi , yi)i
);
the Bayes risk is: L∗ = infΨ:X→R E((Ψ(X) − Y)2
). In this case,
L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);the empirical risk: for Ψ : X → R, LnΨ = 1
n∑n
i=1(yi −Ψ(xi))2.Hence, in this case, a consistent regression scheme, Ψn, satisfies:
limn→+∞
E (LΨn) = L∗;
and a strongly consistent regression scheme, Ψn, satisfies:
limn→+∞
LΨn = L∗ p.s.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39
Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 13 / 39
Remains on functional multilayer perceptron byprojection approach
Data: Suppose that we are given a random pair (X ,Y) taking its values inX × R where (X, 〈., .〉X) is a Hilbert space. Suppose also that we have ni.i.d. observations of (X ,Y), (x1, y1), . . . , (xn, yn).
Functional MLP: The projection approach is based on the knowledge ofa Hilbert basis of X that we will denote by (φk )k≥1. The data (xi)i andalso the weights of the MLP are projected on this basis truncated at q:
Cnq =
Ψ : X → R : ∀ x ∈ X, Ψ(x) =
pn∑l=1
w(2)l G
w(0)l +
q∑k=1
β(1)lk (Pq(x))k
,pn∑
l=1
|w(2)l | ≤ αn
where (pn)n is a sequence of integer, (αn)n is a sequence of positive realnumbers, G is a given continuous functions and the weights (w(2)
l )l ,
(w(0)l )l and (β
(1)lk )l,k have to be learned from the data set in R (see
Presentation 2 for further details).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 14 / 39
Remains on functional multilayer perceptron byprojection approach
Data: Suppose that we are given a random pair (X ,Y) taking its values inX × R where (X, 〈., .〉X) is a Hilbert space. Suppose also that we have ni.i.d. observations of (X ,Y), (x1, y1), . . . , (xn, yn).Functional MLP: The projection approach is based on the knowledge ofa Hilbert basis of X that we will denote by (φk )k≥1. The data (xi)i andalso the weights of the MLP are projected on this basis truncated at q:
Cnq =
Ψ : X → R : ∀ x ∈ X, Ψ(x) =
pn∑l=1
w(2)l G
w(0)l +
q∑k=1
β(1)lk (Pq(x))k
,pn∑
l=1
|w(2)l | ≤ αn
where (pn)n is a sequence of integer, (αn)n is a sequence of positive realnumbers, G is a given continuous functions and the weights (w(2)
l )l ,
(w(0)l )l and (β
(1)lk )l,k have to be learned from the data set in R (see
Presentation 2 for further details).Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 14 / 39
Assumptions for consistency of functional MLP
NoteΨp
n = arg minΨ∈Cn
q
LnΨ.
and suppose that:
(A1) G : R → [0, 1] is monotone, non decreasing, withlimt→+∞G(t) = 1 and limt→−∞G(t) = 0;
(A2) limn→+∞pnαn log(pn logαn)
n = 0 and ∃ δ > 0: limn→+∞α2
nn1−δ = 0;
(A3) Y is squared integrable.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
Assumptions for consistency of functional MLP
NoteΨp
n = arg minΨ∈Cn
q
LnΨ.
and suppose that:
(A1) G : R → [0, 1] is monotone, non decreasing, withlimt→+∞G(t) = 1 and limt→−∞G(t) = 0;
(A2) limn→+∞pnαn log(pn logαn)
n = 0 and ∃ δ > 0: limn→+∞α2
nn1−δ = 0;
(A3) Y is squared integrable.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
Assumptions for consistency of functional MLP
NoteΨp
n = arg minΨ∈Cn
q
LnΨ.
and suppose that:
(A1) G : R → [0, 1] is monotone, non decreasing, withlimt→+∞G(t) = 1 and limt→−∞G(t) = 0;
(A2) limn→+∞pnαn log(pn logαn)
n = 0 and ∃ δ > 0: limn→+∞α2
nn1−δ = 0;
(A3) Y is squared integrable.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
Assumptions for consistency of functional MLP
NoteΨp
n = arg minΨ∈Cn
q
LnΨ.
and suppose that:
(A1) G : R → [0, 1] is monotone, non decreasing, withlimt→+∞G(t) = 1 and limt→−∞G(t) = 0;
(A2) limn→+∞pnαn log(pn logαn)
n = 0 and ∃ δ > 0: limn→+∞α2
nn1−δ = 0;
(A3) Y is squared integrable.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39
Strong consistency of projection based functionalMPL
Theorem [Rossi and Conan-Guez, 2006]Under assumptions (A1)-(A3),
limp→+∞
limn→+∞
LΨpn = L∗ p.s.
Sketch of the proof: The proof is divided into two parts:1 The fist one shows that
L∗p = infΨ∈Rp→+∞
E((Ψ(Pp(X)) − Y)2 | (xi , yi)i
) n→+∞−−−−−−→ L∗ a.s.
2 The second one shows that, for any fixed p
limn→+∞
LΨpn = L∗p .
Remark 2: The principle of the proof is very general and can be applied toany other consistent method in Rp .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
Strong consistency of projection based functionalMPL
Theorem [Rossi and Conan-Guez, 2006]Under assumptions (A1)-(A3),
limp→+∞
limn→+∞
LΨpn = L∗ p.s.
Sketch of the proof: The proof is divided into two parts:1 The fist one shows that
L∗p = infΨ∈Rp→+∞
E((Ψ(Pp(X)) − Y)2 | (xi , yi)i
) n→+∞−−−−−−→ L∗ a.s.
2 The second one shows that, for any fixed p
limn→+∞
LΨpn = L∗p .
Remark 2: The principle of the proof is very general and can be applied toany other consistent method in Rp .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
Strong consistency of projection based functionalMPL
Theorem [Rossi and Conan-Guez, 2006]Under assumptions (A1)-(A3),
limp→+∞
limn→+∞
LΨpn = L∗ p.s.
Sketch of the proof: The proof is divided into two parts:1 The fist one shows that
L∗p = infΨ∈Rp→+∞
E((Ψ(Pp(X)) − Y)2 | (xi , yi)i
) n→+∞−−−−−−→ L∗ a.s.
2 The second one shows that, for any fixed p
limn→+∞
LΨpn = L∗p .
Remark 2: The principle of the proof is very general and can be applied toany other consistent method in Rp .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
Strong consistency of projection based functionalMPL
Theorem [Rossi and Conan-Guez, 2006]Under assumptions (A1)-(A3),
limp→+∞
limn→+∞
LΨpn = L∗ p.s.
Sketch of the proof: The proof is divided into two parts:1 The fist one shows that
L∗p = infΨ∈Rp→+∞
E((Ψ(Pp(X)) − Y)2 | (xi , yi)i
) n→+∞−−−−−−→ L∗ a.s.
2 The second one shows that, for any fixed p
limn→+∞
LΨpn = L∗p .
Remark: The limitation of this result is in the fact that it is a double limitand that no indication on the way n and p should be linked in given.
Remark 2: The principle of the proof is very general and can be applied toany other consistent method in Rp .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
Strong consistency of projection based functionalMPL
Theorem [Rossi and Conan-Guez, 2006]Under assumptions (A1)-(A3),
limp→+∞
limn→+∞
LΨpn = L∗ p.s.
Sketch of the proof: The proof is divided into two parts:1 The fist one shows that
L∗p = infΨ∈Rp→+∞
E((Ψ(Pp(X)) − Y)2 | (xi , yi)i
) n→+∞−−−−−−→ L∗ a.s.
2 The second one shows that, for any fixed p
limn→+∞
LΨpn = L∗p .
Remark 2: The principle of the proof is very general and can be applied toany other consistent method in Rp .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39
Presentation of k -nearest neighbors for functionalclassification
This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].
Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd
i = (xdi1, . . . , x
did) where
∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = 〈xi , φj〉X.k -nearest neighbors for d-dimensional data is then performed on thedataset (xd
1 , y1), . . . , (xdn , yn): if for all u ∈ Rd ,
Vk (u) := {i ∈ [[1, n]] :∥∥∥xd
i − u∥∥∥Rd belongs to the k smallest of these values},
Ψn : x ∈ X →{−1 if
∑i∈Vk (xd) I{yi=−1} >
∑i∈Vk (xd) I{yi=1}
+1 otherwise
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
Presentation of k -nearest neighbors for functionalclassification
This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).
Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd
i = (xdi1, . . . , x
did) where
∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = 〈xi , φj〉X.k -nearest neighbors for d-dimensional data is then performed on thedataset (xd
1 , y1), . . . , (xdn , yn): if for all u ∈ Rd ,
Vk (u) := {i ∈ [[1, n]] :∥∥∥xd
i − u∥∥∥Rd belongs to the k smallest of these values},
Ψn : x ∈ X →{−1 if
∑i∈Vk (xd) I{yi=−1} >
∑i∈Vk (xd) I{yi=1}
+1 otherwise
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
Presentation of k -nearest neighbors for functionalclassification
This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd
i = (xdi1, . . . , x
did) where
∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = 〈xi , φj〉X.
k -nearest neighbors for d-dimensional data is then performed on thedataset (xd
1 , y1), . . . , (xdn , yn): if for all u ∈ Rd ,
Vk (u) := {i ∈ [[1, n]] :∥∥∥xd
i − u∥∥∥Rd belongs to the k smallest of these values},
Ψn : x ∈ X →{−1 if
∑i∈Vk (xd) I{yi=−1} >
∑i∈Vk (xd) I{yi=1}
+1 otherwise
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
Presentation of k -nearest neighbors for functionalclassification
This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd
i = (xdi1, . . . , x
did) where
∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = 〈xi , φj〉X.k -nearest neighbors for d-dimensional data is then performed on thedataset (xd
1 , y1), . . . , (xdn , yn): if for all u ∈ Rd ,
Vk (u) := {i ∈ [[1, n]] :∥∥∥xd
i − u∥∥∥Rd belongs to the k smallest of these values},
Ψn : x ∈ X →{−1 if
∑i∈Vk (xd) I{yi=−1} >
∑i∈Vk (xd) I{yi=1}
+1 otherwise
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39
Selection of the dimension of projection and of theparameter k
d and k are then automatically selected from the dataset by a validationstrategy:
1 For all k ∈ N∗ and all d ∈ N∗,
compute the k -nearest neighborsclassifier, Ψd,l,k
n , from data {(xdi , yi)}i=1,...,l
.
2 Choose
(dn, k n) = arg mink∈N∗, d∈N∗
1n − l
n∑i=l+1
I{Ψd,l,kn (xi),yi
} +λd√
n − l
where λd is a penalization term to avoid the selection of (possiblyoverfitting) very large dimensions.
Then, define Ψn = Ψdn ,l,kn
n
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
Selection of the dimension of projection and of theparameter k
d and k are then automatically selected from the dataset by a validationstrategy:
1 For all k ∈ N∗ and all d ∈ N∗, compute the k -nearest neighborsclassifier, Ψd,l,k
n , from data {(xdi , yi)}i=1,...,l .
2 Choose
(dn, k n) = arg mink∈N∗, d∈N∗
1n − l
n∑i=l+1
I{Ψd,l,kn (xi),yi
} +λd√
n − l
where λd is a penalization term to avoid the selection of (possiblyoverfitting) very large dimensions.
Then, define Ψn = Ψdn ,l,kn
n
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
Selection of the dimension of projection and of theparameter k
d and k are then automatically selected from the dataset by a validationstrategy:
1 For all k ∈ N∗ and all d ∈ N∗, compute the k -nearest neighborsclassifier, Ψd,l,k
n , from data {(xdi , yi)}i=1,...,l .
2 Choose
(dn, k n) = arg mink∈N∗, d∈N∗
1n − l
n∑i=l+1
I{Ψd,l,kn (xi),yi
} +λd√
n − l
where λd is a penalization term to avoid the selection of (possiblyoverfitting) very large dimensions.
Then, define Ψn = Ψdn ,l,kn
n
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
Selection of the dimension of projection and of theparameter k
d and k are then automatically selected from the dataset by a validationstrategy:
1 For all k ∈ N∗ and all d ∈ N∗, compute the k -nearest neighborsclassifier, Ψd,l,k
n , from data {(xdi , yi)}i=1,...,l .
2 Choose
(dn, k n) = arg mink∈N∗, d∈N∗
1n − l
n∑i=l+1
I{Ψd,l,kn (xi),yi
} +λd√
n − l
where λd is a penalization term to avoid the selection of (possiblyoverfitting) very large dimensions.Then, define Ψn = Ψdn ,l,kn
n .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39
An oracle inequality
Oracle inequality [Biau et al., 2005]
Note ∆ =∑+∞
d=1 e−2λ2d < +∞. Then, it exists C > 0, only depending on ∆,
such that ∀ l > 1/∆,
E (LΨn) − L∗ ≤ infd≥1
[(L∗d − L∗) + inf
1≤k≤l
(E
(LΨl,k ,d
n
)− L∗d
)+
λd√
n − l
]+C
√log ln − l
Then, we have:
by a martingale property: limd→+∞ L∗d = L∗,
by consistency of k -nearest neighbors in Rd : for all d ≥ 1,
inf1≤k≤l
(E
(LΨl,k ,d
n
)− L∗d
) l→+∞−−−−−→ 0,
the rest of the right hand side of the inequality can be set to convergeto 0 when n grows to infinity, for suitable choices of n, l and λd .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 19 / 39
An oracle inequality
Oracle inequality [Biau et al., 2005]
Note ∆ =∑+∞
d=1 e−2λ2d < +∞. Then, it exists C > 0, only depending on ∆,
such that ∀ l > 1/∆,
E (LΨn) − L∗ ≤ infd≥1
[(L∗d − L∗) + inf
1≤k≤l
(E
(LΨl,k ,d
n
)− L∗d
)+
λd√
n − l
]+C
√log ln − l
Then, we have:
by a martingale property: limd→+∞ L∗d = L∗,
by consistency of k -nearest neighbors in Rd : for all d ≥ 1,
inf1≤k≤l
(E
(LΨl,k ,d
n
)− L∗d
) l→+∞−−−−−→ 0,
the rest of the right hand side of the inequality can be set to convergeto 0 when n grows to infinity, for suitable choices of n, l and λd .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 19 / 39
Consistency of functional k -nearest neighbors
Theorem [Biau et al., 2005]Suppose that
limn→+∞
l = +∞ limn→+∞
(n − l) = +∞ limn→+∞
log ln − l
= 0
thenlim
n→+∞E (LΨn) = L∗.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 20 / 39
Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 21 / 39
A binary classification problem
Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in {−1, 1}.
Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).We try to learn a classification machine, Ψn, of the formx → Sign (〈x,w〉Rd + b), or, more precisely, of the form
x → Sign (〈φ(x),w〉X + b)
where the exact nature of φ and X will be discussed later.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
A binary classification problem
Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in {−1, 1}.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).
We try to learn a classification machine, Ψn, of the formx → Sign (〈x,w〉Rd + b), or, more precisely, of the form
x → Sign (〈φ(x),w〉X + b)
where the exact nature of φ and X will be discussed later.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
A binary classification problem
Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in {−1, 1}.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).We try to learn a classification machine, Ψn, of the formx → Sign (〈x,w〉Rd + b), or, more precisely, of the form
x → Sign (〈φ(x),w〉X + b)
where the exact nature of φ and X will be discussed later.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39
Linear discrimination with optimal margin
Learn Ψn : x → Sign (〈x,w〉Rd + b)
Learn Ψn : x → Sign (〈x,w〉Rd + b)
w
margin: 1‖w‖2
Rd
Support Vector
w is such that:
minw,b ‖w‖Rd ,
such that: yi(wT xi + b) ≥ 1, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
Linear discrimination with optimal margin
Learn Ψn : x → Sign (〈x,w〉Rd + b)
Learn Ψn : x → Sign (〈x,w〉Rd + b)
w
margin: 1‖w‖2
Rd
Support Vector
w is such that:
minw,b ‖w‖Rd ,
such that: yi(wT xi + b) ≥ 1, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
Linear discrimination with optimal margin
Learn Ψn : x → Sign (〈x,w〉Rd + b)
w
margin: 1‖w‖2
Rd
Support Vector
w is such that:
minw,b ‖w‖Rd ,
such that: yi(wT xi + b) ≥ 1, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
Linear discrimination with optimal margin
Learn Ψn : x → Sign (〈x,w〉Rd + b)
w
margin: 1‖w‖2
Rd
Support Vector
w is such that:
minw,b ‖w‖Rd ,
such that: yi(wT xi + b) ≥ 1, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39
Linear discrimination with soft margin
Learn Ψn : x → Sign (〈x,w〉Rd + b)
Learn Ψn : x → Sign (〈x,w〉Rd + b)
w
margin: 1‖w‖2
Rd
Support Vector
w is such that:
minw,b ,ξ ‖w‖Rd + C∑n
i=1 ξi ,
where: yi(wT xi + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
Linear discrimination with soft margin
Learn Ψn : x → Sign (〈x,w〉Rd + b)
Learn Ψn : x → Sign (〈x,w〉Rd + b)
w
margin: 1‖w‖2
Rd
Support Vector
w is such that:
minw,b ,ξ ‖w‖Rd + C∑n
i=1 ξi ,
where: yi(wT xi + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
Linear discrimination with soft margin
Learn Ψn : x → Sign (〈x,w〉Rd + b)
w
margin: 1‖w‖2
Rd
Support Vector
w is such that:
minw,b ,ξ ‖w‖Rd + C∑n
i=1 ξi ,
where: yi(wT xi + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
Linear discrimination with soft margin
Learn Ψn : x → Sign (〈x,w〉Rd + b)
w
margin: 1‖w‖2
Rd
Support Vector
w is such that:
minw,b ,ξ ‖w‖Rd + C∑n
i=1 ξi ,
where: yi(wT xi + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39
Mapping the data onto a high dimensional space
Learn Ψn : x → Sign (〈φ(x),w〉X + b)
Original space Rd
Learn Ψn : x → Sign (〈φ(x),w〉X + b)
Original space Rd Feature space X
φ (nonlinear)
w is such that:
(PC ,X) minw,b ,ξ ‖w‖X + C∑n
i=1 ξi ,
where: yi(〈w, φ(xi)〉X + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
Mapping the data onto a high dimensional space
Learn Ψn : x → Sign (〈φ(x),w〉X + b)
Original space Rd Feature space X
φ (nonlinear)
Learn Ψn : x → Sign (〈φ(x),w〉X + b)
Original space Rd Feature space X
φ (nonlinear)
w is such that:
(PC ,X) minw,b ,ξ ‖w‖X + C∑n
i=1 ξi ,
where: yi(〈w, φ(xi)〉X + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
Mapping the data onto a high dimensional space
Learn Ψn : x → Sign (〈φ(x),w〉X + b)
Original space Rd Feature space X
φ (nonlinear)
w is such that:
(PC ,X) minw,b ,ξ ‖w‖X + C∑n
i=1 ξi ,
where: yi(〈w, φ(xi)〉X + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
Mapping the data onto a high dimensional space
Learn Ψn : x → Sign (〈φ(x),w〉X + b)
Original space Rd Feature space X
φ (nonlinear)
w is such that:
(PC ,X) minw,b ,ξ ‖w‖X + C∑n
i=1 ξi ,
where: yi(〈w, φ(xi)〉X + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39
Details about the feature space: a regularizationframework
Regularization framework: (PC ,X)⇔
(Rλ,X) minF∈X
1n
n∑i=1
max(0, 1 − yiF(xi)) + λ ‖F‖X .
Dual problem: (PC ,X)⇔
(DC ,X) maxα∑n
i=1 αi −∑n
i=1∑n
j=1 αiαjyiyj〈φ(xi), φ(xj)〉Xwhere
∑Ni=1 αiyi = 0,
0 ≤ αi ≤ C , 1 ≤ i ≤ n.
Inner product in X:∀ u, v ∈ X, K(u, v) = 〈φ(u), φ(v)〉X
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
Details about the feature space: a regularizationframework
Regularization framework: (PC ,X)⇔
(Rλ,X) minF∈X
1n
n∑i=1
max(0, 1 − yiF(xi)) + λ ‖F‖X .
Dual problem: (PC ,X)⇔
(DC ,X) maxα∑n
i=1 αi −∑n
i=1∑n
j=1 αiαjyiyj〈φ(xi), φ(xj)〉Xwhere
∑Ni=1 αiyi = 0,
0 ≤ αi ≤ C , 1 ≤ i ≤ n.
Inner product in X:∀ u, v ∈ X, K(u, v) = 〈φ(u), φ(v)〉X
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
Details about the feature space: a regularizationframework
Regularization framework: (PC ,X)⇔
(Rλ,X) minF∈X
1n
n∑i=1
max(0, 1 − yiF(xi)) + λ ‖F‖X .
Dual problem: (PC ,X)⇔
(DC ,X) maxα∑n
i=1 αi −∑n
i=1∑n
j=1 αiαjyiyj〈φ(xi), φ(xj)〉Xwhere
∑Ni=1 αiyi = 0,
0 ≤ αi ≤ C , 1 ≤ i ≤ n.
Inner product in X:∀ u, v ∈ X, K(u, v) = 〈φ(u), φ(v)〉X
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39
Example of usefull kernels
Provided that
∀m ∈ N∗, (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R,
m∑i,j=1
αiαjK(ui , uj) ≥ 0
K can be used as a kernel mapping the original data onto a highdimensional feature space: [Aronszajn, 1950].
The Gaussian kernel: K(u, v) = e−σ2‖u−v‖2
Rd for σ > 0;
The exponential kernel: K(u, v) = e〈u,v〉R ;
Vovk’s real infinite polynomial: K(u, v) = (1 − 〈u, v〉Rd )−α for α > 0;
. . .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
Example of usefull kernels
Provided that
∀m ∈ N∗, (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R,
m∑i,j=1
αiαjK(ui , uj) ≥ 0
K can be used as a kernel mapping the original data onto a highdimensional feature space: [Aronszajn, 1950].
The Gaussian kernel: K(u, v) = e−σ2‖u−v‖2
Rd for σ > 0;
The exponential kernel: K(u, v) = e〈u,v〉R ;
Vovk’s real infinite polynomial: K(u, v) = (1 − 〈u, v〉Rd )−α for α > 0;
. . .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
Example of usefull kernels
Provided that
∀m ∈ N∗, (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R,
m∑i,j=1
αiαjK(ui , uj) ≥ 0
K can be used as a kernel mapping the original data onto a highdimensional feature space: [Aronszajn, 1950].
The Gaussian kernel: K(u, v) = e−σ2‖u−v‖2
Rd for σ > 0;
The exponential kernel: K(u, v) = e〈u,v〉R ;
Vovk’s real infinite polynomial: K(u, v) = (1 − 〈u, v〉Rd )−α for α > 0;
. . .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
Example of usefull kernels
Provided that
∀m ∈ N∗, (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R,
m∑i,j=1
αiαjK(ui , uj) ≥ 0
K can be used as a kernel mapping the original data onto a highdimensional feature space: [Aronszajn, 1950].
The Gaussian kernel: K(u, v) = e−σ2‖u−v‖2
Rd for σ > 0;
The exponential kernel: K(u, v) = e〈u,v〉R ;
Vovk’s real infinite polynomial: K(u, v) = (1 − 〈u, v〉Rd )−α for α > 0;
. . .
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39
Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd ,W;
the kernel K is universal onW (i.e., the set of all functions
{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);
∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of
balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;
the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O
(nβ−1
)for 0 < β < 1/α.
Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O
(n−d
)
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd ,W;
the kernel K is universal onW (i.e., the set of all functions
{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);
∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of
balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;
the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O
(nβ−1
)for 0 < β < 1/α.
Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O
(n−d
)
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd ,W;
the kernel K is universal onW (i.e., the set of all functions
{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);
∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of
balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;
the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O
(nβ−1
)for 0 < β < 1/α.
Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O
(n−d
)
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd ,W;
the kernel K is universal onW (i.e., the set of all functions
{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);
∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of
balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;
the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O
(nβ−1
)for 0 < β < 1/α.
Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O
(n−d
)
.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
Assumptions for consistency of SVM in Rd
Suppose
X takes its values in a compact subset of Rd ,W;
the kernel K is universal onW (i.e., the set of all functions
{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);
∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of
balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;
the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O
(nβ−1
)for 0 < β < 1/α.
Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O
(n−d
).
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39
Consistency of SVM in Rd
Theorem [Steinwart, 2002]Under assumptions (A1)-(A4), SVM are consistent.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 29 / 39
Why SVM can’t be directly applied to functional data?
Suppose now that X takes its values in a Hilbert space (X, 〈., .〉X).
1 We already talk about the advantages of regularization orprojection of the functional data as a pre-processing;
2 The consistency result can’t be directly applied with infinitedimensional data because the condition of covering number forinfinite dimensional Gaussian kernel is not valid.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
Why SVM can’t be directly applied to functional data?
Suppose now that X takes its values in a Hilbert space (X, 〈., .〉X).
1 We already talk about the advantages of regularization orprojection of the functional data as a pre-processing;
2 The consistency result can’t be directly applied with infinitedimensional data because the condition of covering number forinfinite dimensional Gaussian kernel is not valid.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
Why SVM can’t be directly applied to functional data?
Suppose now that X takes its values in a Hilbert space (X, 〈., .〉X).
1 We already talk about the advantages of regularization orprojection of the functional data as a pre-processing;
2 The consistency result can’t be directly applied with infinitedimensional data because the condition of covering number forinfinite dimensional Gaussian kernel is not valid.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39
A consistent approach based on the ideas of[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;
2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]
Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);Learn a SVM on B1: Ψl,a
n ;Validation on B2:
a∗ = arg mina
L̂n−lΨl,an +
λd√
n − l
with L̂n−lΨl,an = 1
n−l
∑ni=l+1 I
{Ψl,a
n (xi),yi
}.
⇒ The obtained classifier is denoted Ψn.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
A consistent approach based on the ideas of[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]
Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);Learn a SVM on B1: Ψl,a
n ;Validation on B2:
a∗ = arg mina
L̂n−lΨl,an +
λd√
n − l
with L̂n−lΨl,an = 1
n−l
∑ni=l+1 I
{Ψl,a
n (xi),yi
}.⇒ The obtained classifier is denoted Ψn.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
A consistent approach based on the ideas of[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]
Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);
Learn a SVM on B1: Ψl,an ;
Validation on B2:
a∗ = arg mina
L̂n−lΨl,an +
λd√
n − l
with L̂n−lΨl,an = 1
n−l
∑ni=l+1 I
{Ψl,a
n (xi),yi
}.⇒ The obtained classifier is denoted Ψn.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
A consistent approach based on the ideas of[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]
Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);Learn a SVM on B1: Ψl,a
n ;
Validation on B2:
a∗ = arg mina
L̂n−lΨl,an +
λd√
n − l
with L̂n−lΨl,an = 1
n−l
∑ni=l+1 I
{Ψl,a
n (xi),yi
}.⇒ The obtained classifier is denoted Ψn.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
A consistent approach based on the ideas of[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]
Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);Learn a SVM on B1: Ψl,a
n ;Validation on B2:
a∗ = arg mina
L̂n−lΨl,an +
λd√
n − l
with L̂n−lΨl,an = 1
n−l
∑ni=l+1 I
{Ψl,a
n (xi),yi
}.
⇒ The obtained classifier is denoted Ψn.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
A consistent approach based on the ideas of[Biau et al., 2005]
1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]
Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);Learn a SVM on B1: Ψl,a
n ;Validation on B2:
a∗ = arg mina
L̂n−lΨl,an +
λd√
n − l
with L̂n−lΨl,an = 1
n−l
∑ni=l+1 I
{Ψl,a
n (xi),yi
}.⇒ The obtained classifier is denoted Ψn.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39
Assumptions
Assumptions on X
(A1) X takes its values in a bounded subset of X.
Assumptions on the parameters: ∀ d ≥ 1,(A2) Jd is a finite set;(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd and∃νd > 0 : N(Kd , ε) = O (ε−νd );(A4) Cd > 1;(A5)
∑d≥1 |Jd |e−2λ2
d < +∞.
Assumptions on training/validation sets
(A6) limn→+∞ l = +∞;(A7) limn→+∞ n − l = +∞;(A8) limn→+∞
l log(n−l)n−l = 0.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
Assumptions
Assumptions on X
(A1) X takes its values in a bounded subset of X.
Assumptions on the parameters: ∀ d ≥ 1,(A2) Jd is a finite set;(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd and∃νd > 0 : N(Kd , ε) = O (ε−νd );(A4) Cd > 1;(A5)
∑d≥1 |Jd |e−2λ2
d < +∞.
Assumptions on training/validation sets
(A6) limn→+∞ l = +∞;(A7) limn→+∞ n − l = +∞;(A8) limn→+∞
l log(n−l)n−l = 0.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
Assumptions
Assumptions on X
(A1) X takes its values in a bounded subset of X.
Assumptions on the parameters: ∀ d ≥ 1,(A2) Jd is a finite set;(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd and∃νd > 0 : N(Kd , ε) = O (ε−νd );(A4) Cd > 1;(A5)
∑d≥1 |Jd |e−2λ2
d < +∞.
Assumptions on training/validation sets
(A6) limn→+∞ l = +∞;(A7) limn→+∞ n − l = +∞;(A8) limn→+∞
l log(n−l)n−l = 0.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39
Consistency
Theorem [Rossi and Villa, 2006]Under assumptions (A1)-(A8), Ψn is consistent:
E (LΨn)n→+∞−−−−−−→ L∗.
Ideas of the proof: The proof is based on a similar sketch as in the workof [Biau et al., 2005] but the result allows the use of a continuousparameter (the regularization parameter C), based on the shattercoefficient of a class of functions that includes SVM.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 33 / 39
Consistency
Theorem [Rossi and Villa, 2006]Under assumptions (A1)-(A8), Ψn is consistent:
E (LΨn)n→+∞−−−−−−→ L∗.
Ideas of the proof: The proof is based on a similar sketch as in the workof [Biau et al., 2005] but the result allows the use of a continuousparameter (the regularization parameter C), based on the shattercoefficient of a class of functions that includes SVM.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 33 / 39
Application 1: Voice recognition
Description of the data and methods3 problems and for each problem, 100 records sampled at 82 192points;
consistent approach:Projection on a trigonometric basis;Splitting the data base into 50 curves (training) / 49 (validation);Performances calculated by leave-one-out.
Results
Prob. k -nn QDA SVM gau. SVM lin. SVM lin.(proj) (proj) (direct)
yes/no 10% 7% 10% 19% 58%boat/goat 21% 35% 8% 29% 46%
sh/ao 16% 19% 12% 25% 47%
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
Application 1: Voice recognition
Description of the data and methods3 problems and for each problem, 100 records sampled at 82 192points;consistent approach:
Projection on a trigonometric basis;Splitting the data base into 50 curves (training) / 49 (validation);Performances calculated by leave-one-out.
Results
Prob. k -nn QDA SVM gau. SVM lin. SVM lin.(proj) (proj) (direct)
yes/no 10% 7% 10% 19% 58%boat/goat 21% 35% 8% 29% 46%
sh/ao 16% 19% 12% 25% 47%
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
Application 1: Voice recognition
Description of the data and methods3 problems and for each problem, 100 records sampled at 82 192points;consistent approach:
Projection on a trigonometric basis;Splitting the data base into 50 curves (training) / 49 (validation);Performances calculated by leave-one-out.
Results
Prob. k -nn QDA SVM gau. SVM lin. SVM lin.(proj) (proj) (direct)
yes/no 10% 7% 10% 19% 58%boat/goat 21% 35% 8% 29% 46%
sh/ao 16% 19% 12% 25% 47%
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39
Regression by SVM
Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in R.
Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).Once again, we try to learn a regression machine, Ψn, of the form
x → 〈φ(x),w〉X + b
where the exact nature of φ and X will be discussed later.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
Regression by SVM
Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in R.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).
Once again, we try to learn a regression machine, Ψn, of the form
x → 〈φ(x),w〉X + b
where the exact nature of φ and X will be discussed later.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
Regression by SVM
Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in R.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).Once again, we try to learn a regression machine, Ψn, of the form
x → 〈φ(x),w〉X + b
where the exact nature of φ and X will be discussed later.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39
Generalization of the classification case toregression
w and b minimize
C ‖w‖2X
+n∑
i=1
L εk (xi , yi ,w)
where L εk , for k = 1, 2 and ε ≥ 0 is the ε-sensitive loss function:
L εk (xi , yi ,w) = max(0, |yi − 〈φ(xi),w〉X|k − ε
).
or any other loss function.
Remark: A dual version, which is a quadratic optimization problem in Rn,also exists.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 36 / 39
Generalization of the classification case toregression
w and b minimize
C ‖w‖2X
+n∑
i=1
L εk (xi , yi ,w)
where L εk , for k = 1, 2 and ε ≥ 0 is the ε-sensitive loss function:
L εk (xi , yi ,w) = max(0, |yi − 〈φ(xi),w〉X|k − ε
).
or any other loss function.Remark: A dual version, which is a quadratic optimization problem in Rn,also exists.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 36 / 39
A kernel ridge regression
When ε is equal to 0 and k = 2, the previous problem becomes: Find wand b that minimize
Υ ‖w‖2X
+n∑
i=1
(y − 〈φ(xi),w〉X)2
which can be viewed as a kernel ridge regression. This method is alsoknown under the name of Least-Square SVM or LS-SVM.
A multidimensional consistency result is available in[Christmann and Steinwart, 2007]: the same method as for SVMclassifiers can then be used for the regression case !
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 37 / 39
A kernel ridge regression
When ε is equal to 0 and k = 2, the previous problem becomes: Find wand b that minimize
Υ ‖w‖2X
+n∑
i=1
(y − 〈φ(xi),w〉X)2
which can be viewed as a kernel ridge regression. This method is alsoknown under the name of Least-Square SVM or LS-SVM.A multidimensional consistency result is available in[Christmann and Steinwart, 2007]: the same method as for SVMclassifiers can then be used for the regression case !
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 37 / 39
Table of contents
1 Basics in statistical learning theory
2 Examples of consistent methods for FDA
3 SVM
4 References
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 38 / 39
References
Further details for the references are given in the joint document.
Aronszajn, N. (1950).Theory of reproducing kernels.Transactions of the American Mathematical Society, 68(3):337–404.
Biau, G., Bunea, F., and Wegkamp, M. (2005).Functional classification in Hilbert spaces.IEEE Transactions on Information Theory, 51:2163–2172.
Christmann, A. and Steinwart, I. (2007).Consistency and robustness of kernel-based regression in convex riskminimization.Bernouilli, 13(3):799–819.
Laloë, T. (2008).A k-nearest neighbor approach for functional regression.Statistics and Probability Letters, 78(10):1189–1193.
Rossi, F. and Conan-Guez, B. (2006).Theoretical properties of projection based multilayer perceptrons withfunctional inputs.Neural Processing Letters, 23(1):55–70.
Rossi, F. and Villa, N. (2006).Support vector machine for functional data classification.Neurocomputing, 69(7-9):730–742.
Steinwart, I. (2002).Support vector machines are universally consistent.Journal of Complexity, 18:768–791.
Vapnik, V. (1995).The Nature of Statistical Learning Theory.Springer Verlag, New York.
Vapnik, V. (1998).Statistical Learning Theory.Wiley, New York.
Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 39 / 39