fda and statistical learning theory

121
FDA and Statistical learning theory Nathalie Villa-Vialaneix - [email protected] http://www.nathalievilla.org Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de Perpignan France La Havane, September 17th, 2008 Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 1 / 39

Upload: tuxette

Post on 11-May-2015

81 views

Category:

Science


1 download

DESCRIPTION

Short courses on functional data analysis and statistical learning, part 3 CENATAV, Havana, Cuba September 17th, 2008

TRANSCRIPT

Page 1: FDA and Statistical learning theory

FDA and Statistical learning theory

Nathalie Villa-Vialaneix - [email protected]://www.nathalievilla.org

Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de PerpignanFrance

La Havane, September 17th, 2008

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 1 / 39

Page 2: FDA and Statistical learning theory

Table of contents

1 Basics in statistical learning theory

2 Examples of consistent methods for FDA

3 SVM

4 References

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 2 / 39

Page 3: FDA and Statistical learning theory

Purpose of statistical learning theory

In the previous presentations, the aim was to find an estimator that is“close” to the model.

The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.More precisely, binary classification case:

we are given a pair of random variable, (X ,Y) from X × {−1, 1}where X is any topological space;

we observe n i.i.d. realizations of (X ,Y), (x1, y1), . . . , (xn, yn), calledthe learning set;

we intend to find a function, built from (x1, y1), . . . , (xn, yn),Ψn : X → {−1, 1} that minimizes

P (Ψn(X) , Y) .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39

Page 4: FDA and Statistical learning theory

Purpose of statistical learning theory

In the previous presentations, the aim was to find an estimator that is“close” to the model.The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.

More precisely, binary classification case:

we are given a pair of random variable, (X ,Y) from X × {−1, 1}where X is any topological space;

we observe n i.i.d. realizations of (X ,Y), (x1, y1), . . . , (xn, yn), calledthe learning set;

we intend to find a function, built from (x1, y1), . . . , (xn, yn),Ψn : X → {−1, 1} that minimizes

P (Ψn(X) , Y) .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39

Page 5: FDA and Statistical learning theory

Purpose of statistical learning theory

In the previous presentations, the aim was to find an estimator that is“close” to the model.The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.More precisely, binary classification case:

we are given a pair of random variable, (X ,Y) from X × {−1, 1}where X is any topological space;

we observe n i.i.d. realizations of (X ,Y), (x1, y1), . . . , (xn, yn), calledthe learning set;

we intend to find a function, built from (x1, y1), . . . , (xn, yn),Ψn : X → {−1, 1} that minimizes

P (Ψn(X) , Y) .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39

Page 6: FDA and Statistical learning theory

Purpose of statistical learning theory

In the previous presentations, the aim was to find an estimator that is“close” to the model.The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.More precisely, binary classification case:

we are given a pair of random variable, (X ,Y) from X × {−1, 1}where X is any topological space;

we observe n i.i.d. realizations of (X ,Y), (x1, y1), . . . , (xn, yn), calledthe learning set;

we intend to find a function, built from (x1, y1), . . . , (xn, yn),Ψn : X → {−1, 1} that minimizes

P (Ψn(X) , Y) .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39

Page 7: FDA and Statistical learning theory

Purpose of statistical learning theory

In the previous presentations, the aim was to find an estimator that is“close” to the model.The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.More precisely, binary classification case:

we are given a pair of random variable, (X ,Y) from X × {−1, 1}where X is any topological space;

we observe n i.i.d. realizations of (X ,Y), (x1, y1), . . . , (xn, yn), calledthe learning set;

we intend to find a function, built from (x1, y1), . . . , (xn, yn),Ψn : X → {−1, 1} that minimizes

P (Ψn(X) , Y) .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 3 / 39

Page 8: FDA and Statistical learning theory

First remarks on the aim

1 infΨ:X→{−1,1} P (Ψ(X) , Y) is the “target” for the expectancy of theerror. This lower bound for error expectancy is called Bayes risk,denoted by L∗.

2 Generally, Ψn is chosen in a restricted class of functions from X to{−1, 1}, C; then the performance of Ψn can be quantified by:

P (Ψn(X) , Y) − L∗

=(P (Ψn(X) , Y) − inf

Ψ∈CP (Ψ(X) , Y)

)︸ ︷︷ ︸

Error due to the training method

+(

infΨ∈CP (Ψ(X) , Y) − L∗

)︸ ︷︷ ︸

Error due to the choice of C

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39

Page 9: FDA and Statistical learning theory

First remarks on the aim

1 infΨ:X→{−1,1} P (Ψ(X) , Y) is the “target” for the expectancy of theerror. This lower bound for error expectancy is called Bayes risk,denoted by L∗.

2 Generally, Ψn is chosen in a restricted class of functions from X to{−1, 1}, C; then the performance of Ψn can be quantified by:

P (Ψn(X) , Y) − L∗

=(P (Ψn(X) , Y) − inf

Ψ∈CP (Ψ(X) , Y)

)︸ ︷︷ ︸

Error due to the training method

+(

infΨ∈CP (Ψ(X) , Y) − L∗

)︸ ︷︷ ︸

Error due to the choice of C

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39

Page 10: FDA and Statistical learning theory

First remarks on the aim

1 infΨ:X→{−1,1} P (Ψ(X) , Y) is the “target” for the expectancy of theerror. This lower bound for error expectancy is called Bayes risk,denoted by L∗.

2 Generally, Ψn is chosen in a restricted class of functions from X to{−1, 1}, C; then the performance of Ψn can be quantified by:

P (Ψn(X) , Y) − L∗ =(P (Ψn(X) , Y) − inf

Ψ∈CP (Ψ(X) , Y)

)+

(inf

Ψ∈CP (Ψ(X) , Y) − L∗

)

=(P (Ψn(X) , Y) − inf

Ψ∈CP (Ψ(X) , Y)

)︸ ︷︷ ︸

Error due to the training method

+(

infΨ∈CP (Ψ(X) , Y) − L∗

)︸ ︷︷ ︸

Error due to the choice of C

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39

Page 11: FDA and Statistical learning theory

First remarks on the aim

1 infΨ:X→{−1,1} P (Ψ(X) , Y) is the “target” for the expectancy of theerror. This lower bound for error expectancy is called Bayes risk,denoted by L∗.

2 Generally, Ψn is chosen in a restricted class of functions from X to{−1, 1}, C; then the performance of Ψn can be quantified by:

P (Ψn(X) , Y) − L∗ =(P (Ψn(X) , Y) − inf

Ψ∈CP (Ψ(X) , Y)

)︸ ︷︷ ︸

Error due to the training method

+(

infΨ∈CP (Ψ(X) , Y) − L∗

)︸ ︷︷ ︸

Error due to the choice of C

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 4 / 39

Page 12: FDA and Statistical learning theory

Consistency

From this last remark, we can define:

Definition: Weak consistency

A algorithm leading to build the classifier Ψn is said to be (weaklyuniversally) consistent if, for all distribution of the random pair (X ,Y), wehave

E (LΨn)n→+∞−−−−−−→ L∗

where LΨn := P (Ψn(X) , Y | (xi , yi)i)

Definition: Strong consistency

Moreover, it is said to be strongly (universally) consistent if, for alldistribution of the random pair (X ,Y), we have

LΨn n→+∞−−−−−−→ L∗ p.s.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 5 / 39

Page 13: FDA and Statistical learning theory

Consistency

From this last remark, we can define:

Definition: Weak consistency

A algorithm leading to build the classifier Ψn is said to be (weaklyuniversally) consistent if, for all distribution of the random pair (X ,Y), wehave

E (LΨn)n→+∞−−−−−−→ L∗

where LΨn := P (Ψn(X) , Y | (xi , yi)i)

Definition: Strong consistency

Moreover, it is said to be strongly (universally) consistent if, for alldistribution of the random pair (X ,Y), we have

LΨn n→+∞−−−−−−→ L∗ p.s.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 5 / 39

Page 14: FDA and Statistical learning theory

Choice of C and of Ψn

1 The choice of C is of a main importance to obtain good performancesof Ψn:

too small (not rich) C have a poor value of

infΨ∈CP (Ψ(X) , Y) − L∗,

but too rich C have a poor value of

P (Ψn(X) , Y) − infΨ∈CP (Ψ(X) , Y)

because the learning algorithm tends to overfit the data.

2 A naive approach to find a good Ψn over the class C could be tominimize the empirical risk of C:

Ψn := arg minΨ∈C

LnΨ

where LnΨ := 1n∑n

i=1 I{Ψ(xi),yi }.

The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to theaccuracy of the empirical risk.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39

Page 15: FDA and Statistical learning theory

Choice of C and of Ψn

1 The choice of C is of a main importance to obtain good performancesof Ψn:

too small (not rich) C have a poor value of

infΨ∈CP (Ψ(X) , Y) − L∗,

but too rich C have a poor value of

P (Ψn(X) , Y) − infΨ∈CP (Ψ(X) , Y)

because the learning algorithm tends to overfit the data.2 A naive approach to find a good Ψn over the class C could be to

minimize the empirical risk of C:

Ψn := arg minΨ∈C

LnΨ

where LnΨ := 1n∑n

i=1 I{Ψ(xi),yi }.

The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to theaccuracy of the empirical risk.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39

Page 16: FDA and Statistical learning theory

Choice of C and of Ψn

1 The choice of C is of a main importance to obtain good performancesof Ψn:

too small (not rich) C have a poor value of

infΨ∈CP (Ψ(X) , Y) − L∗,

but too rich C have a poor value of

P (Ψn(X) , Y) − infΨ∈CP (Ψ(X) , Y)

because the learning algorithm tends to overfit the data.2 A naive approach to find a good Ψn over the class C could be to

minimize the empirical risk of C:

Ψn := arg minΨ∈C

LnΨ

where LnΨ := 1n∑n

i=1 I{Ψ(xi),yi }.

The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to theaccuracy of the empirical risk.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 6 / 39

Page 17: FDA and Statistical learning theory

VC-dimension

A way to quantify the “richness” of a class of functions is to calculate itsVC-dimension:

Definition: VC-dimensionA class of classifiers (functions from X in {−1, 1}), C, is said to shatter aset of data points z1, z2, . . . , zd ∈ X if, for all assignments of labels to thosepoints, m1,m2, . . . ,md ∈ {−1, 1}, there exists a Ψ ∈ C such that:

∀ i = 1, . . . , d, Ψ(zi) = mi .

The VC-dimension of class of functions C is the maximum number ofpoints that can be shattered by C.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 7 / 39

Page 18: FDA and Statistical learning theory

VC-dimension

A way to quantify the “richness” of a class of functions is to calculate itsVC-dimension:

Definition: VC-dimensionA class of classifiers (functions from X in {−1, 1}), C, is said to shatter aset of data points z1, z2, . . . , zd ∈ X if, for all assignments of labels to thosepoints, m1,m2, . . . ,md ∈ {−1, 1}, there exists a Ψ ∈ C such that:

∀ i = 1, . . . , d, Ψ(zi) = mi .

The VC-dimension of class of functions C is the maximum number ofpoints that can be shattered by C.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 7 / 39

Page 19: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}.

Then,

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 20: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

2 points are shattered by C:

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 21: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

2 points are shattered by C:

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 22: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

2 points are shattered by C:

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 23: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

3 points are shattered by C:

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 24: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

3 points are shattered by C:

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 25: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

3 points are shattered by C:

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 26: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

3 points are shattered by C:

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 27: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

3 points are shattered by C:

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 28: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

4 points cannot be shattered by C:

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 29: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

4 points canto be shattered by C:

no Ψ ∈ C can have value 1 on the red circles and −1 on the black ones.

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 30: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

4 points canto be shattered by C:

then, VC-dimension of C = 3.

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 31: FDA and Statistical learning theory

Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}. Then,

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 8 / 39

Page 32: FDA and Statistical learning theory

Relationship between VC-dimension and empiricalerror

Theorem [Vapnik, 1995, Vapnik, 1998]With a probability almost equal to 1 − η,

supΨ∈C

∣∣∣E (LΨ) − LnΨ∣∣∣ ≤ √

VC(C) − log(η/4)

n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 9 / 39

Page 33: FDA and Statistical learning theory

An alternative to VC-dimension

Remark: In most cases, VC-dimension is not enough precise. Then,another quantity can also be considered:

Definition: Shatter coefficientThe k -th shatter coefficient of the set of functions C is the maximumnumbers of partitions of n points into two sets that can be obtained from C.This number, denoted by S(C, n), is almost equal to 2n

Example: If C is the space of hyperplans in Rd ,

S(C, n) =

{2n if n ≤ d2d+1 = 2VC(C) if n ≥ d + 1

Remark: For all n > 2, S(C, n) ≤ nVC(C).

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39

Page 34: FDA and Statistical learning theory

An alternative to VC-dimension

Remark: In most cases, VC-dimension is not enough precise. Then,another quantity can also be considered:

Definition: Shatter coefficientThe k -th shatter coefficient of the set of functions C is the maximumnumbers of partitions of n points into two sets that can be obtained from C.This number, denoted by S(C, n), is almost equal to 2n

Example: If C is the space of hyperplans in Rd ,

S(C, n) =

{2n if n ≤ d2d+1 = 2VC(C) if n ≥ d + 1

Remark: For all n > 2, S(C, n) ≤ nVC(C).

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39

Page 35: FDA and Statistical learning theory

An alternative to VC-dimension

Remark: In most cases, VC-dimension is not enough precise. Then,another quantity can also be considered:

Definition: Shatter coefficientThe k -th shatter coefficient of the set of functions C is the maximumnumbers of partitions of n points into two sets that can be obtained from C.This number, denoted by S(C, n), is almost equal to 2n

Example: If C is the space of hyperplans in Rd ,

S(C, n) =

{2n if n ≤ d2d+1 = 2VC(C) if n ≥ d + 1

Remark: For all n > 2, S(C, n) ≤ nVC(C).

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 10 / 39

Page 36: FDA and Statistical learning theory

Vapnik-Chervonenkis inequality

Theorem [Vapnik, 1995, Vapnik, 1998]

P

(supΨ∈C

∣∣∣LnΨ − E (LΨ)∣∣∣ > ε) ≤ S(C, n)e−nε2/32.

Consequences for the learning error on C: If Ψn has been chosen byminimizing the empirical risk, i.e.,

Ψn := arg minΨ∈C

1n

n∑i=1

I{Ψ(xi),yi }

and then,

P(E (LΨn) − inf

Ψ∈CE (LΨ) > ε

)≤ S(C, n)e−nε2/128.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39

Page 37: FDA and Statistical learning theory

Vapnik-Chervonenkis inequality

Theorem [Vapnik, 1995, Vapnik, 1998]

P

(supΨ∈C

∣∣∣LnΨ − E (LΨ)∣∣∣ > ε) ≤ S(C, n)e−nε2/32.

Consequences for the learning error on C: If Ψn has been chosen byminimizing the empirical risk, i.e.,

Ψn := arg minΨ∈C

1n

n∑i=1

I{Ψ(xi),yi }

and then,

P(E (LΨn) − inf

Ψ∈CE (LΨ) > ε

)≤ S(C, n)e−nε2/128.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39

Page 38: FDA and Statistical learning theory

Vapnik-Chervonenkis inequality

Theorem [Vapnik, 1995, Vapnik, 1998]

P

(supΨ∈C

∣∣∣LnΨ − E (LΨ)∣∣∣ > ε) ≤ S(C, n)e−nε2/32.

Consequences for the learning error on C: If Ψn has been chosen byminimizing the empirical risk, i.e.,

Ψn := arg minΨ∈C

1n

n∑i=1

I{Ψ(xi),yi }

as

P(E (LΨn) − inf

Ψ∈CE (LΨ) > ε

)≤ P

(2 sup

Ψ∈C

∣∣∣LnΨ − E (LΨ)∣∣∣ > ε)

and then,

P(E (LΨn) − inf

Ψ∈CE (LΨ) > ε

)≤ S(C, n)e−nε2/128.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39

Page 39: FDA and Statistical learning theory

Vapnik-Chervonenkis inequality

Theorem [Vapnik, 1995, Vapnik, 1998]

P

(supΨ∈C

∣∣∣LnΨ − E (LΨ)∣∣∣ > ε) ≤ S(C, n)e−nε2/32.

Consequences for the learning error on C: If Ψn has been chosen byminimizing the empirical risk, i.e.,

Ψn := arg minΨ∈C

1n

n∑i=1

I{Ψ(xi),yi }

and then,

P(E (LΨn) − inf

Ψ∈CE (LΨ) > ε

)≤ S(C, n)e−nε2/128.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 11 / 39

Page 40: FDA and Statistical learning theory

Additional notes for the regression case

Same theory can be developed for the regression case under additionalassumptions. To summarize, let (X ,Y) be a random pair taking its valuesin X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of(X ,Y). Then, we can introduce

the risk as, for example, the mean square error: for Ψ : X → R,LΨ = E

((Ψ(X) − Y)2 | (xi , yi)i

);

the Bayes risk is: L∗ = infΨ:X→R E((Ψ(X) − Y)2

). In this case,

L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);the empirical risk: for Ψ : X → R, LnΨ = 1

n∑n

i=1(yi −Ψ(xi))2.Hence, in this case, a consistent regression scheme, Ψn, satisfies:

limn→+∞

E (LΨn) = L∗;

and a strongly consistent regression scheme, Ψn, satisfies:

limn→+∞

LΨn = L∗ p.s.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39

Page 41: FDA and Statistical learning theory

Additional notes for the regression case

Same theory can be developed for the regression case under additionalassumptions. To summarize, let (X ,Y) be a random pair taking its valuesin X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of(X ,Y). Then, we can introduce

the risk as, for example, the mean square error: for Ψ : X → R,LΨ = E

((Ψ(X) − Y)2 | (xi , yi)i

);

the Bayes risk is: L∗ = infΨ:X→R E((Ψ(X) − Y)2

). In this case,

L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);

the empirical risk: for Ψ : X → R, LnΨ = 1n∑n

i=1(yi −Ψ(xi))2.Hence, in this case, a consistent regression scheme, Ψn, satisfies:

limn→+∞

E (LΨn) = L∗;

and a strongly consistent regression scheme, Ψn, satisfies:

limn→+∞

LΨn = L∗ p.s.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39

Page 42: FDA and Statistical learning theory

Additional notes for the regression case

Same theory can be developed for the regression case under additionalassumptions. To summarize, let (X ,Y) be a random pair taking its valuesin X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of(X ,Y). Then, we can introduce

the risk as, for example, the mean square error: for Ψ : X → R,LΨ = E

((Ψ(X) − Y)2 | (xi , yi)i

);

the Bayes risk is: L∗ = infΨ:X→R E((Ψ(X) − Y)2

). In this case,

L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);the empirical risk: for Ψ : X → R, LnΨ = 1

n∑n

i=1(yi −Ψ(xi))2.

Hence, in this case, a consistent regression scheme, Ψn, satisfies:

limn→+∞

E (LΨn) = L∗;

and a strongly consistent regression scheme, Ψn, satisfies:

limn→+∞

LΨn = L∗ p.s.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39

Page 43: FDA and Statistical learning theory

Additional notes for the regression case

Same theory can be developed for the regression case under additionalassumptions. To summarize, let (X ,Y) be a random pair taking its valuesin X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of(X ,Y). Then, we can introduce

the risk as, for example, the mean square error: for Ψ : X → R,LΨ = E

((Ψ(X) − Y)2 | (xi , yi)i

);

the Bayes risk is: L∗ = infΨ:X→R E((Ψ(X) − Y)2

). In this case,

L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);the empirical risk: for Ψ : X → R, LnΨ = 1

n∑n

i=1(yi −Ψ(xi))2.Hence, in this case, a consistent regression scheme, Ψn, satisfies:

limn→+∞

E (LΨn) = L∗;

and a strongly consistent regression scheme, Ψn, satisfies:

limn→+∞

LΨn = L∗ p.s.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 12 / 39

Page 44: FDA and Statistical learning theory

Table of contents

1 Basics in statistical learning theory

2 Examples of consistent methods for FDA

3 SVM

4 References

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 13 / 39

Page 45: FDA and Statistical learning theory

Remains on functional multilayer perceptron byprojection approach

Data: Suppose that we are given a random pair (X ,Y) taking its values inX × R where (X, 〈., .〉X) is a Hilbert space. Suppose also that we have ni.i.d. observations of (X ,Y), (x1, y1), . . . , (xn, yn).

Functional MLP: The projection approach is based on the knowledge ofa Hilbert basis of X that we will denote by (φk )k≥1. The data (xi)i andalso the weights of the MLP are projected on this basis truncated at q:

Cnq =

Ψ : X → R : ∀ x ∈ X, Ψ(x) =

pn∑l=1

w(2)l G

w(0)l +

q∑k=1

β(1)lk (Pq(x))k

,pn∑

l=1

|w(2)l | ≤ αn

where (pn)n is a sequence of integer, (αn)n is a sequence of positive realnumbers, G is a given continuous functions and the weights (w(2)

l )l ,

(w(0)l )l and (β

(1)lk )l,k have to be learned from the data set in R (see

Presentation 2 for further details).

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 14 / 39

Page 46: FDA and Statistical learning theory

Remains on functional multilayer perceptron byprojection approach

Data: Suppose that we are given a random pair (X ,Y) taking its values inX × R where (X, 〈., .〉X) is a Hilbert space. Suppose also that we have ni.i.d. observations of (X ,Y), (x1, y1), . . . , (xn, yn).Functional MLP: The projection approach is based on the knowledge ofa Hilbert basis of X that we will denote by (φk )k≥1. The data (xi)i andalso the weights of the MLP are projected on this basis truncated at q:

Cnq =

Ψ : X → R : ∀ x ∈ X, Ψ(x) =

pn∑l=1

w(2)l G

w(0)l +

q∑k=1

β(1)lk (Pq(x))k

,pn∑

l=1

|w(2)l | ≤ αn

where (pn)n is a sequence of integer, (αn)n is a sequence of positive realnumbers, G is a given continuous functions and the weights (w(2)

l )l ,

(w(0)l )l and (β

(1)lk )l,k have to be learned from the data set in R (see

Presentation 2 for further details).Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 14 / 39

Page 47: FDA and Statistical learning theory

Assumptions for consistency of functional MLP

NoteΨp

n = arg minΨ∈Cn

q

LnΨ.

and suppose that:

(A1) G : R → [0, 1] is monotone, non decreasing, withlimt→+∞G(t) = 1 and limt→−∞G(t) = 0;

(A2) limn→+∞pnαn log(pn logαn)

n = 0 and ∃ δ > 0: limn→+∞α2

nn1−δ = 0;

(A3) Y is squared integrable.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39

Page 48: FDA and Statistical learning theory

Assumptions for consistency of functional MLP

NoteΨp

n = arg minΨ∈Cn

q

LnΨ.

and suppose that:

(A1) G : R → [0, 1] is monotone, non decreasing, withlimt→+∞G(t) = 1 and limt→−∞G(t) = 0;

(A2) limn→+∞pnαn log(pn logαn)

n = 0 and ∃ δ > 0: limn→+∞α2

nn1−δ = 0;

(A3) Y is squared integrable.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39

Page 49: FDA and Statistical learning theory

Assumptions for consistency of functional MLP

NoteΨp

n = arg minΨ∈Cn

q

LnΨ.

and suppose that:

(A1) G : R → [0, 1] is monotone, non decreasing, withlimt→+∞G(t) = 1 and limt→−∞G(t) = 0;

(A2) limn→+∞pnαn log(pn logαn)

n = 0 and ∃ δ > 0: limn→+∞α2

nn1−δ = 0;

(A3) Y is squared integrable.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39

Page 50: FDA and Statistical learning theory

Assumptions for consistency of functional MLP

NoteΨp

n = arg minΨ∈Cn

q

LnΨ.

and suppose that:

(A1) G : R → [0, 1] is monotone, non decreasing, withlimt→+∞G(t) = 1 and limt→−∞G(t) = 0;

(A2) limn→+∞pnαn log(pn logαn)

n = 0 and ∃ δ > 0: limn→+∞α2

nn1−δ = 0;

(A3) Y is squared integrable.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 15 / 39

Page 51: FDA and Statistical learning theory

Strong consistency of projection based functionalMPL

Theorem [Rossi and Conan-Guez, 2006]Under assumptions (A1)-(A3),

limp→+∞

limn→+∞

LΨpn = L∗ p.s.

Sketch of the proof: The proof is divided into two parts:1 The fist one shows that

L∗p = infΨ∈Rp→+∞

E((Ψ(Pp(X)) − Y)2 | (xi , yi)i

) n→+∞−−−−−−→ L∗ a.s.

2 The second one shows that, for any fixed p

limn→+∞

LΨpn = L∗p .

Remark 2: The principle of the proof is very general and can be applied toany other consistent method in Rp .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39

Page 52: FDA and Statistical learning theory

Strong consistency of projection based functionalMPL

Theorem [Rossi and Conan-Guez, 2006]Under assumptions (A1)-(A3),

limp→+∞

limn→+∞

LΨpn = L∗ p.s.

Sketch of the proof: The proof is divided into two parts:1 The fist one shows that

L∗p = infΨ∈Rp→+∞

E((Ψ(Pp(X)) − Y)2 | (xi , yi)i

) n→+∞−−−−−−→ L∗ a.s.

2 The second one shows that, for any fixed p

limn→+∞

LΨpn = L∗p .

Remark 2: The principle of the proof is very general and can be applied toany other consistent method in Rp .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39

Page 53: FDA and Statistical learning theory

Strong consistency of projection based functionalMPL

Theorem [Rossi and Conan-Guez, 2006]Under assumptions (A1)-(A3),

limp→+∞

limn→+∞

LΨpn = L∗ p.s.

Sketch of the proof: The proof is divided into two parts:1 The fist one shows that

L∗p = infΨ∈Rp→+∞

E((Ψ(Pp(X)) − Y)2 | (xi , yi)i

) n→+∞−−−−−−→ L∗ a.s.

2 The second one shows that, for any fixed p

limn→+∞

LΨpn = L∗p .

Remark 2: The principle of the proof is very general and can be applied toany other consistent method in Rp .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39

Page 54: FDA and Statistical learning theory

Strong consistency of projection based functionalMPL

Theorem [Rossi and Conan-Guez, 2006]Under assumptions (A1)-(A3),

limp→+∞

limn→+∞

LΨpn = L∗ p.s.

Sketch of the proof: The proof is divided into two parts:1 The fist one shows that

L∗p = infΨ∈Rp→+∞

E((Ψ(Pp(X)) − Y)2 | (xi , yi)i

) n→+∞−−−−−−→ L∗ a.s.

2 The second one shows that, for any fixed p

limn→+∞

LΨpn = L∗p .

Remark: The limitation of this result is in the fact that it is a double limitand that no indication on the way n and p should be linked in given.

Remark 2: The principle of the proof is very general and can be applied toany other consistent method in Rp .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39

Page 55: FDA and Statistical learning theory

Strong consistency of projection based functionalMPL

Theorem [Rossi and Conan-Guez, 2006]Under assumptions (A1)-(A3),

limp→+∞

limn→+∞

LΨpn = L∗ p.s.

Sketch of the proof: The proof is divided into two parts:1 The fist one shows that

L∗p = infΨ∈Rp→+∞

E((Ψ(Pp(X)) − Y)2 | (xi , yi)i

) n→+∞−−−−−−→ L∗ a.s.

2 The second one shows that, for any fixed p

limn→+∞

LΨpn = L∗p .

Remark 2: The principle of the proof is very general and can be applied toany other consistent method in Rp .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 16 / 39

Page 56: FDA and Statistical learning theory

Presentation of k -nearest neighbors for functionalclassification

This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].

Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd

i = (xdi1, . . . , x

did) where

∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = 〈xi , φj〉X.k -nearest neighbors for d-dimensional data is then performed on thedataset (xd

1 , y1), . . . , (xdn , yn): if for all u ∈ Rd ,

Vk (u) := {i ∈ [[1, n]] :∥∥∥xd

i − u∥∥∥Rd belongs to the k smallest of these values},

Ψn : x ∈ X →{−1 if

∑i∈Vk (xd) I{yi=−1} >

∑i∈Vk (xd) I{yi=1}

+1 otherwise

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39

Page 57: FDA and Statistical learning theory

Presentation of k -nearest neighbors for functionalclassification

This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).

Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd

i = (xdi1, . . . , x

did) where

∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = 〈xi , φj〉X.k -nearest neighbors for d-dimensional data is then performed on thedataset (xd

1 , y1), . . . , (xdn , yn): if for all u ∈ Rd ,

Vk (u) := {i ∈ [[1, n]] :∥∥∥xd

i − u∥∥∥Rd belongs to the k smallest of these values},

Ψn : x ∈ X →{−1 if

∑i∈Vk (xd) I{yi=−1} >

∑i∈Vk (xd) I{yi=1}

+1 otherwise

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39

Page 58: FDA and Statistical learning theory

Presentation of k -nearest neighbors for functionalclassification

This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd

i = (xdi1, . . . , x

did) where

∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = 〈xi , φj〉X.

k -nearest neighbors for d-dimensional data is then performed on thedataset (xd

1 , y1), . . . , (xdn , yn): if for all u ∈ Rd ,

Vk (u) := {i ∈ [[1, n]] :∥∥∥xd

i − u∥∥∥Rd belongs to the k smallest of these values},

Ψn : x ∈ X →{−1 if

∑i∈Vk (xd) I{yi=−1} >

∑i∈Vk (xd) I{yi=1}

+1 otherwise

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39

Page 59: FDA and Statistical learning theory

Presentation of k -nearest neighbors for functionalclassification

This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd

i = (xdi1, . . . , x

did) where

∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = 〈xi , φj〉X.k -nearest neighbors for d-dimensional data is then performed on thedataset (xd

1 , y1), . . . , (xdn , yn): if for all u ∈ Rd ,

Vk (u) := {i ∈ [[1, n]] :∥∥∥xd

i − u∥∥∥Rd belongs to the k smallest of these values},

Ψn : x ∈ X →{−1 if

∑i∈Vk (xd) I{yi=−1} >

∑i∈Vk (xd) I{yi=1}

+1 otherwise

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 17 / 39

Page 60: FDA and Statistical learning theory

Selection of the dimension of projection and of theparameter k

d and k are then automatically selected from the dataset by a validationstrategy:

1 For all k ∈ N∗ and all d ∈ N∗,

compute the k -nearest neighborsclassifier, Ψd,l,k

n , from data {(xdi , yi)}i=1,...,l

.

2 Choose

(dn, k n) = arg mink∈N∗, d∈N∗

1n − l

n∑i=l+1

I{Ψd,l,kn (xi),yi

} +λd√

n − l

where λd is a penalization term to avoid the selection of (possiblyoverfitting) very large dimensions.

Then, define Ψn = Ψdn ,l,kn

n

.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39

Page 61: FDA and Statistical learning theory

Selection of the dimension of projection and of theparameter k

d and k are then automatically selected from the dataset by a validationstrategy:

1 For all k ∈ N∗ and all d ∈ N∗, compute the k -nearest neighborsclassifier, Ψd,l,k

n , from data {(xdi , yi)}i=1,...,l .

2 Choose

(dn, k n) = arg mink∈N∗, d∈N∗

1n − l

n∑i=l+1

I{Ψd,l,kn (xi),yi

} +λd√

n − l

where λd is a penalization term to avoid the selection of (possiblyoverfitting) very large dimensions.

Then, define Ψn = Ψdn ,l,kn

n

.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39

Page 62: FDA and Statistical learning theory

Selection of the dimension of projection and of theparameter k

d and k are then automatically selected from the dataset by a validationstrategy:

1 For all k ∈ N∗ and all d ∈ N∗, compute the k -nearest neighborsclassifier, Ψd,l,k

n , from data {(xdi , yi)}i=1,...,l .

2 Choose

(dn, k n) = arg mink∈N∗, d∈N∗

1n − l

n∑i=l+1

I{Ψd,l,kn (xi),yi

} +λd√

n − l

where λd is a penalization term to avoid the selection of (possiblyoverfitting) very large dimensions.

Then, define Ψn = Ψdn ,l,kn

n

.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39

Page 63: FDA and Statistical learning theory

Selection of the dimension of projection and of theparameter k

d and k are then automatically selected from the dataset by a validationstrategy:

1 For all k ∈ N∗ and all d ∈ N∗, compute the k -nearest neighborsclassifier, Ψd,l,k

n , from data {(xdi , yi)}i=1,...,l .

2 Choose

(dn, k n) = arg mink∈N∗, d∈N∗

1n − l

n∑i=l+1

I{Ψd,l,kn (xi),yi

} +λd√

n − l

where λd is a penalization term to avoid the selection of (possiblyoverfitting) very large dimensions.Then, define Ψn = Ψdn ,l,kn

n .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 18 / 39

Page 64: FDA and Statistical learning theory

An oracle inequality

Oracle inequality [Biau et al., 2005]

Note ∆ =∑+∞

d=1 e−2λ2d < +∞. Then, it exists C > 0, only depending on ∆,

such that ∀ l > 1/∆,

E (LΨn) − L∗ ≤ infd≥1

[(L∗d − L∗) + inf

1≤k≤l

(E

(LΨl,k ,d

n

)− L∗d

)+

λd√

n − l

]+C

√log ln − l

Then, we have:

by a martingale property: limd→+∞ L∗d = L∗,

by consistency of k -nearest neighbors in Rd : for all d ≥ 1,

inf1≤k≤l

(E

(LΨl,k ,d

n

)− L∗d

) l→+∞−−−−−→ 0,

the rest of the right hand side of the inequality can be set to convergeto 0 when n grows to infinity, for suitable choices of n, l and λd .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 19 / 39

Page 65: FDA and Statistical learning theory

An oracle inequality

Oracle inequality [Biau et al., 2005]

Note ∆ =∑+∞

d=1 e−2λ2d < +∞. Then, it exists C > 0, only depending on ∆,

such that ∀ l > 1/∆,

E (LΨn) − L∗ ≤ infd≥1

[(L∗d − L∗) + inf

1≤k≤l

(E

(LΨl,k ,d

n

)− L∗d

)+

λd√

n − l

]+C

√log ln − l

Then, we have:

by a martingale property: limd→+∞ L∗d = L∗,

by consistency of k -nearest neighbors in Rd : for all d ≥ 1,

inf1≤k≤l

(E

(LΨl,k ,d

n

)− L∗d

) l→+∞−−−−−→ 0,

the rest of the right hand side of the inequality can be set to convergeto 0 when n grows to infinity, for suitable choices of n, l and λd .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 19 / 39

Page 66: FDA and Statistical learning theory

Consistency of functional k -nearest neighbors

Theorem [Biau et al., 2005]Suppose that

limn→+∞

l = +∞ limn→+∞

(n − l) = +∞ limn→+∞

log ln − l

= 0

thenlim

n→+∞E (LΨn) = L∗.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 20 / 39

Page 67: FDA and Statistical learning theory

Table of contents

1 Basics in statistical learning theory

2 Examples of consistent methods for FDA

3 SVM

4 References

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 21 / 39

Page 68: FDA and Statistical learning theory

A binary classification problem

Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in {−1, 1}.

Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).We try to learn a classification machine, Ψn, of the formx → Sign (〈x,w〉Rd + b), or, more precisely, of the form

x → Sign (〈φ(x),w〉X + b)

where the exact nature of φ and X will be discussed later.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39

Page 69: FDA and Statistical learning theory

A binary classification problem

Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in {−1, 1}.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).

We try to learn a classification machine, Ψn, of the formx → Sign (〈x,w〉Rd + b), or, more precisely, of the form

x → Sign (〈φ(x),w〉X + b)

where the exact nature of φ and X will be discussed later.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39

Page 70: FDA and Statistical learning theory

A binary classification problem

Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in {−1, 1}.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).We try to learn a classification machine, Ψn, of the formx → Sign (〈x,w〉Rd + b), or, more precisely, of the form

x → Sign (〈φ(x),w〉X + b)

where the exact nature of φ and X will be discussed later.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 22 / 39

Page 71: FDA and Statistical learning theory

Linear discrimination with optimal margin

Learn Ψn : x → Sign (〈x,w〉Rd + b)

Learn Ψn : x → Sign (〈x,w〉Rd + b)

w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ‖w‖Rd ,

such that: yi(wT xi + b) ≥ 1, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39

Page 72: FDA and Statistical learning theory

Linear discrimination with optimal margin

Learn Ψn : x → Sign (〈x,w〉Rd + b)

Learn Ψn : x → Sign (〈x,w〉Rd + b)

w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ‖w‖Rd ,

such that: yi(wT xi + b) ≥ 1, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39

Page 73: FDA and Statistical learning theory

Linear discrimination with optimal margin

Learn Ψn : x → Sign (〈x,w〉Rd + b)

w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ‖w‖Rd ,

such that: yi(wT xi + b) ≥ 1, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39

Page 74: FDA and Statistical learning theory

Linear discrimination with optimal margin

Learn Ψn : x → Sign (〈x,w〉Rd + b)

w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ‖w‖Rd ,

such that: yi(wT xi + b) ≥ 1, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 23 / 39

Page 75: FDA and Statistical learning theory

Linear discrimination with soft margin

Learn Ψn : x → Sign (〈x,w〉Rd + b)

Learn Ψn : x → Sign (〈x,w〉Rd + b)

w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ,ξ ‖w‖Rd + C∑n

i=1 ξi ,

where: yi(wT xi + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39

Page 76: FDA and Statistical learning theory

Linear discrimination with soft margin

Learn Ψn : x → Sign (〈x,w〉Rd + b)

Learn Ψn : x → Sign (〈x,w〉Rd + b)

w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ,ξ ‖w‖Rd + C∑n

i=1 ξi ,

where: yi(wT xi + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39

Page 77: FDA and Statistical learning theory

Linear discrimination with soft margin

Learn Ψn : x → Sign (〈x,w〉Rd + b)

w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ,ξ ‖w‖Rd + C∑n

i=1 ξi ,

where: yi(wT xi + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39

Page 78: FDA and Statistical learning theory

Linear discrimination with soft margin

Learn Ψn : x → Sign (〈x,w〉Rd + b)

w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ,ξ ‖w‖Rd + C∑n

i=1 ξi ,

where: yi(wT xi + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 24 / 39

Page 79: FDA and Statistical learning theory

Mapping the data onto a high dimensional space

Learn Ψn : x → Sign (〈φ(x),w〉X + b)

Original space Rd

Learn Ψn : x → Sign (〈φ(x),w〉X + b)

Original space Rd Feature space X

φ (nonlinear)

w is such that:

(PC ,X) minw,b ,ξ ‖w‖X + C∑n

i=1 ξi ,

where: yi(〈w, φ(xi)〉X + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39

Page 80: FDA and Statistical learning theory

Mapping the data onto a high dimensional space

Learn Ψn : x → Sign (〈φ(x),w〉X + b)

Original space Rd Feature space X

φ (nonlinear)

Learn Ψn : x → Sign (〈φ(x),w〉X + b)

Original space Rd Feature space X

φ (nonlinear)

w is such that:

(PC ,X) minw,b ,ξ ‖w‖X + C∑n

i=1 ξi ,

where: yi(〈w, φ(xi)〉X + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39

Page 81: FDA and Statistical learning theory

Mapping the data onto a high dimensional space

Learn Ψn : x → Sign (〈φ(x),w〉X + b)

Original space Rd Feature space X

φ (nonlinear)

w is such that:

(PC ,X) minw,b ,ξ ‖w‖X + C∑n

i=1 ξi ,

where: yi(〈w, φ(xi)〉X + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39

Page 82: FDA and Statistical learning theory

Mapping the data onto a high dimensional space

Learn Ψn : x → Sign (〈φ(x),w〉X + b)

Original space Rd Feature space X

φ (nonlinear)

w is such that:

(PC ,X) minw,b ,ξ ‖w‖X + C∑n

i=1 ξi ,

where: yi(〈w, φ(xi)〉X + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 25 / 39

Page 83: FDA and Statistical learning theory

Details about the feature space: a regularizationframework

Regularization framework: (PC ,X)⇔

(Rλ,X) minF∈X

1n

n∑i=1

max(0, 1 − yiF(xi)) + λ ‖F‖X .

Dual problem: (PC ,X)⇔

(DC ,X) maxα∑n

i=1 αi −∑n

i=1∑n

j=1 αiαjyiyj〈φ(xi), φ(xj)〉Xwhere

∑Ni=1 αiyi = 0,

0 ≤ αi ≤ C , 1 ≤ i ≤ n.

Inner product in X:∀ u, v ∈ X, K(u, v) = 〈φ(u), φ(v)〉X

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39

Page 84: FDA and Statistical learning theory

Details about the feature space: a regularizationframework

Regularization framework: (PC ,X)⇔

(Rλ,X) minF∈X

1n

n∑i=1

max(0, 1 − yiF(xi)) + λ ‖F‖X .

Dual problem: (PC ,X)⇔

(DC ,X) maxα∑n

i=1 αi −∑n

i=1∑n

j=1 αiαjyiyj〈φ(xi), φ(xj)〉Xwhere

∑Ni=1 αiyi = 0,

0 ≤ αi ≤ C , 1 ≤ i ≤ n.

Inner product in X:∀ u, v ∈ X, K(u, v) = 〈φ(u), φ(v)〉X

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39

Page 85: FDA and Statistical learning theory

Details about the feature space: a regularizationframework

Regularization framework: (PC ,X)⇔

(Rλ,X) minF∈X

1n

n∑i=1

max(0, 1 − yiF(xi)) + λ ‖F‖X .

Dual problem: (PC ,X)⇔

(DC ,X) maxα∑n

i=1 αi −∑n

i=1∑n

j=1 αiαjyiyj〈φ(xi), φ(xj)〉Xwhere

∑Ni=1 αiyi = 0,

0 ≤ αi ≤ C , 1 ≤ i ≤ n.

Inner product in X:∀ u, v ∈ X, K(u, v) = 〈φ(u), φ(v)〉X

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 26 / 39

Page 86: FDA and Statistical learning theory

Example of usefull kernels

Provided that

∀m ∈ N∗, (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R,

m∑i,j=1

αiαjK(ui , uj) ≥ 0

K can be used as a kernel mapping the original data onto a highdimensional feature space: [Aronszajn, 1950].

The Gaussian kernel: K(u, v) = e−σ2‖u−v‖2

Rd for σ > 0;

The exponential kernel: K(u, v) = e〈u,v〉R ;

Vovk’s real infinite polynomial: K(u, v) = (1 − 〈u, v〉Rd )−α for α > 0;

. . .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39

Page 87: FDA and Statistical learning theory

Example of usefull kernels

Provided that

∀m ∈ N∗, (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R,

m∑i,j=1

αiαjK(ui , uj) ≥ 0

K can be used as a kernel mapping the original data onto a highdimensional feature space: [Aronszajn, 1950].

The Gaussian kernel: K(u, v) = e−σ2‖u−v‖2

Rd for σ > 0;

The exponential kernel: K(u, v) = e〈u,v〉R ;

Vovk’s real infinite polynomial: K(u, v) = (1 − 〈u, v〉Rd )−α for α > 0;

. . .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39

Page 88: FDA and Statistical learning theory

Example of usefull kernels

Provided that

∀m ∈ N∗, (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R,

m∑i,j=1

αiαjK(ui , uj) ≥ 0

K can be used as a kernel mapping the original data onto a highdimensional feature space: [Aronszajn, 1950].

The Gaussian kernel: K(u, v) = e−σ2‖u−v‖2

Rd for σ > 0;

The exponential kernel: K(u, v) = e〈u,v〉R ;

Vovk’s real infinite polynomial: K(u, v) = (1 − 〈u, v〉Rd )−α for α > 0;

. . .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39

Page 89: FDA and Statistical learning theory

Example of usefull kernels

Provided that

∀m ∈ N∗, (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R,

m∑i,j=1

αiαjK(ui , uj) ≥ 0

K can be used as a kernel mapping the original data onto a highdimensional feature space: [Aronszajn, 1950].

The Gaussian kernel: K(u, v) = e−σ2‖u−v‖2

Rd for σ > 0;

The exponential kernel: K(u, v) = e〈u,v〉R ;

Vovk’s real infinite polynomial: K(u, v) = (1 − 〈u, v〉Rd )−α for α > 0;

. . .

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 27 / 39

Page 90: FDA and Statistical learning theory

Assumptions for consistency of SVM in Rd

Suppose

X takes its values in a compact subset of Rd ,W;

the kernel K is universal onW (i.e., the set of all functions

{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);

∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of

balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;

the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O

(nβ−1

)for 0 < β < 1/α.

Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O

(n−d

)

.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39

Page 91: FDA and Statistical learning theory

Assumptions for consistency of SVM in Rd

Suppose

X takes its values in a compact subset of Rd ,W;

the kernel K is universal onW (i.e., the set of all functions

{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);

∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of

balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;

the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O

(nβ−1

)for 0 < β < 1/α.

Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O

(n−d

)

.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39

Page 92: FDA and Statistical learning theory

Assumptions for consistency of SVM in Rd

Suppose

X takes its values in a compact subset of Rd ,W;

the kernel K is universal onW (i.e., the set of all functions

{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);

∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of

balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;

the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O

(nβ−1

)for 0 < β < 1/α.

Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O

(n−d

)

.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39

Page 93: FDA and Statistical learning theory

Assumptions for consistency of SVM in Rd

Suppose

X takes its values in a compact subset of Rd ,W;

the kernel K is universal onW (i.e., the set of all functions

{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);

∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of

balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;

the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O

(nβ−1

)for 0 < β < 1/α.

Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O

(n−d

)

.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39

Page 94: FDA and Statistical learning theory

Assumptions for consistency of SVM in Rd

Suppose

X takes its values in a compact subset of Rd ,W;

the kernel K is universal onW (i.e., the set of all functions

{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);

∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of

balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;

the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O

(nβ−1

)for 0 < β < 1/α.

Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O

(n−d

).

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 28 / 39

Page 95: FDA and Statistical learning theory

Consistency of SVM in Rd

Theorem [Steinwart, 2002]Under assumptions (A1)-(A4), SVM are consistent.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 29 / 39

Page 96: FDA and Statistical learning theory

Why SVM can’t be directly applied to functional data?

Suppose now that X takes its values in a Hilbert space (X, 〈., .〉X).

1 We already talk about the advantages of regularization orprojection of the functional data as a pre-processing;

2 The consistency result can’t be directly applied with infinitedimensional data because the condition of covering number forinfinite dimensional Gaussian kernel is not valid.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39

Page 97: FDA and Statistical learning theory

Why SVM can’t be directly applied to functional data?

Suppose now that X takes its values in a Hilbert space (X, 〈., .〉X).

1 We already talk about the advantages of regularization orprojection of the functional data as a pre-processing;

2 The consistency result can’t be directly applied with infinitedimensional data because the condition of covering number forinfinite dimensional Gaussian kernel is not valid.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39

Page 98: FDA and Statistical learning theory

Why SVM can’t be directly applied to functional data?

Suppose now that X takes its values in a Hilbert space (X, 〈., .〉X).

1 We already talk about the advantages of regularization orprojection of the functional data as a pre-processing;

2 The consistency result can’t be directly applied with infinitedimensional data because the condition of covering number forinfinite dimensional Gaussian kernel is not valid.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 30 / 39

Page 99: FDA and Statistical learning theory

A consistent approach based on the ideas of[Biau et al., 2005]

1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;

2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]

Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);Learn a SVM on B1: Ψl,a

n ;Validation on B2:

a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l

with L̂n−lΨl,an = 1

n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi

}.

⇒ The obtained classifier is denoted Ψn.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39

Page 100: FDA and Statistical learning theory

A consistent approach based on the ideas of[Biau et al., 2005]

1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]

Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);Learn a SVM on B1: Ψl,a

n ;Validation on B2:

a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l

with L̂n−lΨl,an = 1

n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi

}.⇒ The obtained classifier is denoted Ψn.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39

Page 101: FDA and Statistical learning theory

A consistent approach based on the ideas of[Biau et al., 2005]

1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]

Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);

Learn a SVM on B1: Ψl,an ;

Validation on B2:

a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l

with L̂n−lΨl,an = 1

n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi

}.⇒ The obtained classifier is denoted Ψn.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39

Page 102: FDA and Statistical learning theory

A consistent approach based on the ideas of[Biau et al., 2005]

1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]

Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);Learn a SVM on B1: Ψl,a

n ;

Validation on B2:

a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l

with L̂n−lΨl,an = 1

n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi

}.⇒ The obtained classifier is denoted Ψn.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39

Page 103: FDA and Statistical learning theory

A consistent approach based on the ideas of[Biau et al., 2005]

1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]

Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);Learn a SVM on B1: Ψl,a

n ;Validation on B2:

a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l

with L̂n−lΨl,an = 1

n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi

}.

⇒ The obtained classifier is denoted Ψn.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39

Page 104: FDA and Statistical learning theory

A consistent approach based on the ideas of[Biau et al., 2005]

1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]

Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);Learn a SVM on B1: Ψl,a

n ;Validation on B2:

a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l

with L̂n−lΨl,an = 1

n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi

}.⇒ The obtained classifier is denoted Ψn.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 31 / 39

Page 105: FDA and Statistical learning theory

Assumptions

Assumptions on X

(A1) X takes its values in a bounded subset of X.

Assumptions on the parameters: ∀ d ≥ 1,(A2) Jd is a finite set;(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd and∃νd > 0 : N(Kd , ε) = O (ε−νd );(A4) Cd > 1;(A5)

∑d≥1 |Jd |e−2λ2

d < +∞.

Assumptions on training/validation sets

(A6) limn→+∞ l = +∞;(A7) limn→+∞ n − l = +∞;(A8) limn→+∞

l log(n−l)n−l = 0.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39

Page 106: FDA and Statistical learning theory

Assumptions

Assumptions on X

(A1) X takes its values in a bounded subset of X.

Assumptions on the parameters: ∀ d ≥ 1,(A2) Jd is a finite set;(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd and∃νd > 0 : N(Kd , ε) = O (ε−νd );(A4) Cd > 1;(A5)

∑d≥1 |Jd |e−2λ2

d < +∞.

Assumptions on training/validation sets

(A6) limn→+∞ l = +∞;(A7) limn→+∞ n − l = +∞;(A8) limn→+∞

l log(n−l)n−l = 0.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39

Page 107: FDA and Statistical learning theory

Assumptions

Assumptions on X

(A1) X takes its values in a bounded subset of X.

Assumptions on the parameters: ∀ d ≥ 1,(A2) Jd is a finite set;(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd and∃νd > 0 : N(Kd , ε) = O (ε−νd );(A4) Cd > 1;(A5)

∑d≥1 |Jd |e−2λ2

d < +∞.

Assumptions on training/validation sets

(A6) limn→+∞ l = +∞;(A7) limn→+∞ n − l = +∞;(A8) limn→+∞

l log(n−l)n−l = 0.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 32 / 39

Page 108: FDA and Statistical learning theory

Consistency

Theorem [Rossi and Villa, 2006]Under assumptions (A1)-(A8), Ψn is consistent:

E (LΨn)n→+∞−−−−−−→ L∗.

Ideas of the proof: The proof is based on a similar sketch as in the workof [Biau et al., 2005] but the result allows the use of a continuousparameter (the regularization parameter C), based on the shattercoefficient of a class of functions that includes SVM.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 33 / 39

Page 109: FDA and Statistical learning theory

Consistency

Theorem [Rossi and Villa, 2006]Under assumptions (A1)-(A8), Ψn is consistent:

E (LΨn)n→+∞−−−−−−→ L∗.

Ideas of the proof: The proof is based on a similar sketch as in the workof [Biau et al., 2005] but the result allows the use of a continuousparameter (the regularization parameter C), based on the shattercoefficient of a class of functions that includes SVM.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 33 / 39

Page 110: FDA and Statistical learning theory

Application 1: Voice recognition

Description of the data and methods3 problems and for each problem, 100 records sampled at 82 192points;

consistent approach:Projection on a trigonometric basis;Splitting the data base into 50 curves (training) / 49 (validation);Performances calculated by leave-one-out.

Results

Prob. k -nn QDA SVM gau. SVM lin. SVM lin.(proj) (proj) (direct)

yes/no 10% 7% 10% 19% 58%boat/goat 21% 35% 8% 29% 46%

sh/ao 16% 19% 12% 25% 47%

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39

Page 111: FDA and Statistical learning theory

Application 1: Voice recognition

Description of the data and methods3 problems and for each problem, 100 records sampled at 82 192points;consistent approach:

Projection on a trigonometric basis;Splitting the data base into 50 curves (training) / 49 (validation);Performances calculated by leave-one-out.

Results

Prob. k -nn QDA SVM gau. SVM lin. SVM lin.(proj) (proj) (direct)

yes/no 10% 7% 10% 19% 58%boat/goat 21% 35% 8% 29% 46%

sh/ao 16% 19% 12% 25% 47%

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39

Page 112: FDA and Statistical learning theory

Application 1: Voice recognition

Description of the data and methods3 problems and for each problem, 100 records sampled at 82 192points;consistent approach:

Projection on a trigonometric basis;Splitting the data base into 50 curves (training) / 49 (validation);Performances calculated by leave-one-out.

Results

Prob. k -nn QDA SVM gau. SVM lin. SVM lin.(proj) (proj) (direct)

yes/no 10% 7% 10% 19% 58%boat/goat 21% 35% 8% 29% 46%

sh/ao 16% 19% 12% 25% 47%

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 34 / 39

Page 113: FDA and Statistical learning theory

Regression by SVM

Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in R.

Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).Once again, we try to learn a regression machine, Ψn, of the form

x → 〈φ(x),w〉X + b

where the exact nature of φ and X will be discussed later.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39

Page 114: FDA and Statistical learning theory

Regression by SVM

Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in R.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).

Once again, we try to learn a regression machine, Ψn, of the form

x → 〈φ(x),w〉X + b

where the exact nature of φ and X will be discussed later.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39

Page 115: FDA and Statistical learning theory

Regression by SVM

Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in R.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).Once again, we try to learn a regression machine, Ψn, of the form

x → 〈φ(x),w〉X + b

where the exact nature of φ and X will be discussed later.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 35 / 39

Page 116: FDA and Statistical learning theory

Generalization of the classification case toregression

w and b minimize

C ‖w‖2X

+n∑

i=1

L εk (xi , yi ,w)

where L εk , for k = 1, 2 and ε ≥ 0 is the ε-sensitive loss function:

L εk (xi , yi ,w) = max(0, |yi − 〈φ(xi),w〉X|k − ε

).

or any other loss function.

Remark: A dual version, which is a quadratic optimization problem in Rn,also exists.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 36 / 39

Page 117: FDA and Statistical learning theory

Generalization of the classification case toregression

w and b minimize

C ‖w‖2X

+n∑

i=1

L εk (xi , yi ,w)

where L εk , for k = 1, 2 and ε ≥ 0 is the ε-sensitive loss function:

L εk (xi , yi ,w) = max(0, |yi − 〈φ(xi),w〉X|k − ε

).

or any other loss function.Remark: A dual version, which is a quadratic optimization problem in Rn,also exists.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 36 / 39

Page 118: FDA and Statistical learning theory

A kernel ridge regression

When ε is equal to 0 and k = 2, the previous problem becomes: Find wand b that minimize

Υ ‖w‖2X

+n∑

i=1

(y − 〈φ(xi),w〉X)2

which can be viewed as a kernel ridge regression. This method is alsoknown under the name of Least-Square SVM or LS-SVM.

A multidimensional consistency result is available in[Christmann and Steinwart, 2007]: the same method as for SVMclassifiers can then be used for the regression case !

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 37 / 39

Page 119: FDA and Statistical learning theory

A kernel ridge regression

When ε is equal to 0 and k = 2, the previous problem becomes: Find wand b that minimize

Υ ‖w‖2X

+n∑

i=1

(y − 〈φ(xi),w〉X)2

which can be viewed as a kernel ridge regression. This method is alsoknown under the name of Least-Square SVM or LS-SVM.A multidimensional consistency result is available in[Christmann and Steinwart, 2007]: the same method as for SVMclassifiers can then be used for the regression case !

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 37 / 39

Page 120: FDA and Statistical learning theory

Table of contents

1 Basics in statistical learning theory

2 Examples of consistent methods for FDA

3 SVM

4 References

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 38 / 39

Page 121: FDA and Statistical learning theory

References

Further details for the references are given in the joint document.

Aronszajn, N. (1950).Theory of reproducing kernels.Transactions of the American Mathematical Society, 68(3):337–404.

Biau, G., Bunea, F., and Wegkamp, M. (2005).Functional classification in Hilbert spaces.IEEE Transactions on Information Theory, 51:2163–2172.

Christmann, A. and Steinwart, I. (2007).Consistency and robustness of kernel-based regression in convex riskminimization.Bernouilli, 13(3):799–819.

Laloë, T. (2008).A k-nearest neighbor approach for functional regression.Statistics and Probability Letters, 78(10):1189–1193.

Rossi, F. and Conan-Guez, B. (2006).Theoretical properties of projection based multilayer perceptrons withfunctional inputs.Neural Processing Letters, 23(1):55–70.

Rossi, F. and Villa, N. (2006).Support vector machine for functional data classification.Neurocomputing, 69(7-9):730–742.

Steinwart, I. (2002).Support vector machines are universally consistent.Journal of Complexity, 18:768–791.

Vapnik, V. (1995).The Nature of Statistical Learning Theory.Springer Verlag, New York.

Vapnik, V. (1998).Statistical Learning Theory.Wiley, New York.

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 39 / 39