fda and statistical learning theory

FDA and Statistical learning theory

Nathalie Villa-Vialaneix - [email protected]://www.nathalievilla.org

Institut de Mathématiques de Toulouse - IUT de Carcassonne, Université de PerpignanFrance

La Havane, September 17th, 2008

Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 1 / 39

Table of contents

1 Basics in statistical learning theory

2 Examples of consistent methods for FDA

3 SVM

4 References


Purpose of statistical learning theory

In the previous presentations, the aim was to find an estimator that is“close” to the model.

The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.More precisely, binary classification case:

we are given a pair of random variable, (X ,Y) from X × {−1, 1}where X is any topological space;

we observe n i.i.d. realizations of (X ,Y), (x1, y1), . . . , (xn, yn), calledthe learning set;

we intend to find a function, built from (x1, y1), . . . , (xn, yn),Ψn : X → {−1, 1} that minimizes

P (Ψn(X) , Y) .



In the previous presentations, the aim was to find an estimator that is“close” to the model.The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.

More precisely, binary classification case:




P (Ψn(X) , Y) .



In the previous presentations, the aim was to find an estimator that is“close” to the model.The aim of statistical learning theory is slightly different: find a regressionfunction that has a small error.More precisely, binary classification case:




P (Ψn(X) , Y) .


First remarks on the aim

1 infΨ:X→{−1,1} P (Ψ(X) , Y) is the “target” for the expectancy of theerror. This lower bound for error expectancy is called Bayes risk,denoted by L∗.

2 Generally, Ψn is chosen in a restricted class of functions from X to{−1, 1}, C; then the performance of Ψn can be quantified by:

P (Ψn(X) , Y) − L∗

=(P (Ψn(X) , Y) − inf

Ψ∈CP (Ψ(X) , Y)

)︸︷︷︸

Error due to the training method

+(

infΨ∈CP (Ψ(X) , Y) − L∗

)︸︷︷︸

Error due to the choice of C





P (Ψn(X) , Y) − L∗ =(P (Ψn(X) , Y) − inf

Ψ∈CP (Ψ(X) , Y)

)+

(inf

Ψ∈CP (Ψ(X) , Y) − L∗

)

=(P (Ψn(X) , Y) − inf

Ψ∈CP (Ψ(X) , Y)

)︸︷︷︸


+(


)︸︷︷︸






P (Ψn(X) , Y) − L∗ =(P (Ψn(X) , Y) − inf

Ψ∈CP (Ψ(X) , Y)

)︸︷︷︸


+(


)︸︷︷︸



Consistency

From this last remark, we can define:

Definition: Weak consistency

A algorithm leading to build the classifier Ψn is said to be (weaklyuniversally) consistent if, for all distribution of the random pair (X ,Y), wehave

E (LΨn)n→+∞−−−−−−→ L∗

where LΨn := P (Ψn(X) , Y | (xi , yi)i)

Definition: Strong consistency

Moreover, it is said to be strongly (universally) consistent if, for alldistribution of the random pair (X ,Y), we have

LΨn n→+∞−−−−−−→ L∗ p.s.


Choice of C and of Ψn

1 The choice of C is of a main importance to obtain good performancesof Ψn:

too small (not rich) C have a poor value of

infΨ∈CP (Ψ(X) , Y) − L∗,

but too rich C have a poor value of

P (Ψn(X) , Y) − infΨ∈CP (Ψ(X) , Y)

because the learning algorithm tends to overfit the data.

2 A naive approach to find a good Ψn over the class C could be tominimize the empirical risk of C:

Ψn := arg minΨ∈C

LnΨ

where LnΨ := 1n∑n

i=1 I{Ψ(xi),yi }.

The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to theaccuracy of the empirical risk.


Choice of C and of Ψn

1 The choice of C is of a main importance to obtain good performancesof Ψn:

too small (not rich) C have a poor value of

infΨ∈CP (Ψ(X) , Y) − L∗,

but too rich C have a poor value of

P (Ψn(X) , Y) − infΨ∈CP (Ψ(X) , Y)

because the learning algorithm tends to overfit the data.2 A naive approach to find a good Ψn over the class C could be to

minimize the empirical risk of C:


LnΨ

where LnΨ := 1n∑n

i=1 I{Ψ(xi),yi }.

The work of [Vapnik, 1995, Vapnik, 1998] links the choice of C to theaccuracy of the empirical risk.


VC-dimension

A way to quantify the “richness” of a class of functions is to calculate itsVC-dimension:

Definition: VC-dimensionA class of classifiers (functions from X in {−1, 1}), C, is said to shatter aset of data points z1, z2, . . . , zd ∈ X if, for all assignments of labels to thosepoints, m1,m2, . . . ,md ∈ {−1, 1}, there exists a Ψ ∈ C such that:

∀ i = 1, . . . , d, Ψ(zi) = mi .

The VC-dimension of class of functions C is the maximum number ofpoints that can be shattered by C.


Example: VC-dimension of hyperplans

Suppose that X = R2 andC =

{Ψ : x ∈ R2 → ±Sign(aT x + b), a ∈ R2 and b ∈ R

}.

Then,

4 points canto be shattered by C:

More generally, VC-dimension of hyperplans in Rd is d + 1.





}. Then,

2 points are shattered by C:







}. Then,

3 points are shattered by C:







}. Then,

4 points cannot be shattered by C:







}. Then,


no Ψ ∈ C can have value 1 on the red circles and −1 on the black ones.







}. Then,


then, VC-dimension of C = 3.







}. Then,




Relationship between VC-dimension and empiricalerror

Theorem [Vapnik, 1995, Vapnik, 1998]With a probability almost equal to 1 − η,

supΨ∈C

∣∣∣E (LΨ) − LnΨ∣∣∣ ≤ √

VC(C) − log(η/4)

n.


An alternative to VC-dimension

Remark: In most cases, VC-dimension is not enough precise. Then,another quantity can also be considered:

Definition: Shatter coefficientThe k -th shatter coefficient of the set of functions C is the maximumnumbers of partitions of n points into two sets that can be obtained from C.This number, denoted by S(C, n), is almost equal to 2n

Example: If C is the space of hyperplans in Rd ,

S(C, n) =

{2n if n ≤ d2d+1 = 2VC(C) if n ≥ d + 1

Remark: For all n > 2, S(C, n) ≤ nVC(C).


Vapnik-Chervonenkis inequality

Theorem [Vapnik, 1995, Vapnik, 1998]

P

(supΨ∈C

∣∣∣LnΨ − E (LΨ)∣∣∣ > ε) ≤ S(C, n)e−nε2/32.

Consequences for the learning error on C: If Ψn has been chosen byminimizing the empirical risk, i.e.,


1n

n∑i=1

I{Ψ(xi),yi }

and then,

P(E (LΨn) − inf

Ψ∈CE (LΨ) > ε

)≤ S(C, n)e−nε2/128.




P

(supΨ∈C

∣∣∣LnΨ − E (LΨ)∣∣∣ > ε) ≤ S(C, n)e−nε2/32.



1n

n∑i=1

I{Ψ(xi),yi }

as

P(E (LΨn) − inf

Ψ∈CE (LΨ) > ε

)≤ P

(2 sup

Ψ∈C

∣∣∣LnΨ − E (LΨ)∣∣∣ > ε)

and then,

P(E (LΨn) − inf

Ψ∈CE (LΨ) > ε

)≤ S(C, n)e−nε2/128.




P

(supΨ∈C

∣∣∣LnΨ − E (LΨ)∣∣∣ > ε) ≤ S(C, n)e−nε2/32.



1n

n∑i=1

I{Ψ(xi),yi }

and then,

P(E (LΨn) − inf

Ψ∈CE (LΨ) > ε

)≤ S(C, n)e−nε2/128.


Additional notes for the regression case

Same theory can be developed for the regression case under additionalassumptions. To summarize, let (X ,Y) be a random pair taking its valuesin X × R and (x1, y1), . . . , (xn, yn) a training set of n i.i.d. realizations of(X ,Y). Then, we can introduce

the risk as, for example, the mean square error: for Ψ : X → R,LΨ = E

((Ψ(X) − Y)2 | (xi , yi)i

);

the Bayes risk is: L∗ = infΨ:X→R E((Ψ(X) − Y)2

). In this case,

L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);the empirical risk: for Ψ : X → R, LnΨ = 1

n∑n

i=1(yi −Ψ(xi))2.Hence, in this case, a consistent regression scheme, Ψn, satisfies:

limn→+∞

E (LΨn) = L∗;

and a strongly consistent regression scheme, Ψn, satisfies:

limn→+∞

LΨn = L∗ p.s.





((Ψ(X) − Y)2 | (xi , yi)i

);


). In this case,

L∗ = E (LΨ∗) where Ψ∗ = E (Y | X);

the empirical risk: for Ψ : X → R, LnΨ = 1n∑n


limn→+∞

E (LΨn) = L∗;


limn→+∞

LΨn = L∗ p.s.





((Ψ(X) − Y)2 | (xi , yi)i

);


). In this case,


n∑n

i=1(yi −Ψ(xi))2.

Hence, in this case, a consistent regression scheme, Ψn, satisfies:

limn→+∞

E (LΨn) = L∗;


limn→+∞

LΨn = L∗ p.s.





((Ψ(X) − Y)2 | (xi , yi)i

);


). In this case,


n∑n


limn→+∞

E (LΨn) = L∗;


limn→+∞

LΨn = L∗ p.s.


Table of contents



3 SVM

4 References


Remains on functional multilayer perceptron byprojection approach

Data: Suppose that we are given a random pair (X ,Y) taking its values inX × R where (X, 〈., .〉X) is a Hilbert space. Suppose also that we have ni.i.d. observations of (X ,Y), (x1, y1), . . . , (xn, yn).

Functional MLP: The projection approach is based on the knowledge ofa Hilbert basis of X that we will denote by (φk )k≥1. The data (xi)i andalso the weights of the MLP are projected on this basis truncated at q:

Cnq =

Ψ : X → R : ∀ x ∈ X, Ψ(x) =

pn∑l=1

w(2)l G

w(0)l +

q∑k=1

β(1)lk (Pq(x))k

,pn∑

l=1

|w(2)l | ≤ αn

where (pn)n is a sequence of integer, (αn)n is a sequence of positive realnumbers, G is a given continuous functions and the weights (w(2)

l )l ,

(w(0)l )l and (β

(1)lk )l,k have to be learned from the data set in R (see

Presentation 2 for further details).


Remains on functional multilayer perceptron byprojection approach

Data: Suppose that we are given a random pair (X ,Y) taking its values inX × R where (X, 〈., .〉X) is a Hilbert space. Suppose also that we have ni.i.d. observations of (X ,Y), (x1, y1), . . . , (xn, yn).Functional MLP: The projection approach is based on the knowledge ofa Hilbert basis of X that we will denote by (φk )k≥1. The data (xi)i andalso the weights of the MLP are projected on this basis truncated at q:

Cnq =

Ψ : X → R : ∀ x ∈ X, Ψ(x) =

pn∑l=1

w(2)l G

w(0)l +

q∑k=1

β(1)lk (Pq(x))k

,pn∑

l=1

|w(2)l | ≤ αn

where (pn)n is a sequence of integer, (αn)n is a sequence of positive realnumbers, G is a given continuous functions and the weights (w(2)

l )l ,

(w(0)l )l and (β

(1)lk )l,k have to be learned from the data set in R (see

Presentation 2 for further details).Nathalie Villa (IMT & UPVD) Presentation 3 La Havane, Sept. 17th, 2008 14 / 39

Assumptions for consistency of functional MLP

NoteΨp

n = arg minΨ∈Cn

q

LnΨ.

and suppose that:

(A1) G : R → [0, 1] is monotone, non decreasing, withlimt→+∞G(t) = 1 and limt→−∞G(t) = 0;

(A2) limn→+∞pnαn log(pn logαn)

n = 0 and ∃ δ > 0: limn→+∞α2

nn1−δ = 0;

(A3) Y is squared integrable.


Strong consistency of projection based functionalMPL

Theorem [Rossi and Conan-Guez, 2006]Under assumptions (A1)-(A3),

limp→+∞

limn→+∞

LΨpn = L∗ p.s.

Sketch of the proof: The proof is divided into two parts:1 The fist one shows that

L∗p = infΨ∈Rp→+∞

E((Ψ(Pp(X)) − Y)2 | (xi , yi)i

) n→+∞−−−−−−→ L∗ a.s.

2 The second one shows that, for any fixed p

limn→+∞

LΨpn = L∗p .

Remark 2: The principle of the proof is very general and can be applied toany other consistent method in Rp .




limp→+∞

limn→+∞

LΨpn = L∗ p.s.



E((Ψ(Pp(X)) − Y)2 | (xi , yi)i

) n→+∞−−−−−−→ L∗ a.s.


limn→+∞

LΨpn = L∗p .

Remark: The limitation of this result is in the fact that it is a double limitand that no indication on the way n and p should be linked in given.





limp→+∞

limn→+∞

LΨpn = L∗ p.s.



E((Ψ(Pp(X)) − Y)2 | (xi , yi)i

) n→+∞−−−−−−→ L∗ a.s.


limn→+∞

LΨpn = L∗p .



Presentation of k -nearest neighbors for functionalclassification

This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].

Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd

i = (xdi1, . . . , x

did) where

∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = 〈xi , φj〉X.k -nearest neighbors for d-dimensional data is then performed on thedataset (xd

1 , y1), . . . , (xdn , yn): if for all u ∈ Rd ,

Vk (u) := {i ∈ [[1, n]] :∥∥∥xd

i − u∥∥∥Rd belongs to the k smallest of these values},

Ψn : x ∈ X →{−1 if

∑i∈Vk (xd) I{yi=−1} >

∑i∈Vk (xd) I{yi=1}

+1 otherwise



This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).

Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd

i = (xdi1, . . . , x

did) where



Vk (u) := {i ∈ [[1, n]] :∥∥∥xd


Ψn : x ∈ X →{−1 if

∑i∈Vk (xd) I{yi=−1} >


+1 otherwise



This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd

i = (xdi1, . . . , x

did) where

∀ i = 1, . . . , n and ∀ j = 1, . . . , d, xij = 〈xi , φj〉X.

k -nearest neighbors for d-dimensional data is then performed on thedataset (xd


Vk (u) := {i ∈ [[1, n]] :∥∥∥xd


Ψn : x ∈ X →{−1 if

∑i∈Vk (xd) I{yi=−1} >


+1 otherwise



This method has been introduced in [Biau et al., 2005] for the binaryclassification case and it exists a regression version in the work of[Laloë, 2008].Context: We are given a random pair (X ,Y) taking its values inX × {−1, 1} where (X, 〈., .〉X) is a Hilbert space. Moreover, we are given ni.i.d. observations of (X ,Y) that are denoted (x1, y1), . . . , (xn, yn).Functional k -nearest neighbors also consists in using the projection ofthe data on a Hilbert basis, (φj)j≥1: denote xd

i = (xdi1, . . . , x

did) where



Vk (u) := {i ∈ [[1, n]] :∥∥∥xd


Ψn : x ∈ X →{−1 if

∑i∈Vk (xd) I{yi=−1} >


+1 otherwise


Selection of the dimension of projection and of theparameter k

d and k are then automatically selected from the dataset by a validationstrategy:

1 For all k ∈ N∗ and all d ∈ N∗,

compute the k -nearest neighborsclassifier, Ψd,l,k

n , from data {(xdi , yi)}i=1,...,l

.

2 Choose

(dn, k n) = arg mink∈N∗, d∈N∗

1n − l

n∑i=l+1

I{Ψd,l,kn (xi),yi

} +λd√

n − l

where λd is a penalization term to avoid the selection of (possiblyoverfitting) very large dimensions.

Then, define Ψn = Ψdn ,l,kn

n

.




1 For all k ∈ N∗ and all d ∈ N∗, compute the k -nearest neighborsclassifier, Ψd,l,k

n , from data {(xdi , yi)}i=1,...,l .

2 Choose


1n − l

n∑i=l+1

I{Ψd,l,kn (xi),yi

} +λd√

n − l

where λd is a penalization term to avoid the selection of (possiblyoverfitting) very large dimensions.

Then, define Ψn = Ψdn ,l,kn

n

.




1 For all k ∈ N∗ and all d ∈ N∗, compute the k -nearest neighborsclassifier, Ψd,l,k

n , from data {(xdi , yi)}i=1,...,l .

2 Choose


1n − l

n∑i=l+1

I{Ψd,l,kn (xi),yi

} +λd√

n − l

where λd is a penalization term to avoid the selection of (possiblyoverfitting) very large dimensions.Then, define Ψn = Ψdn ,l,kn

n .


An oracle inequality

Oracle inequality [Biau et al., 2005]

Note ∆ =∑+∞

d=1 e−2λ2d < +∞. Then, it exists C > 0, only depending on ∆,

such that ∀ l > 1/∆,

E (LΨn) − L∗ ≤ infd≥1

[(L∗d − L∗) + inf

1≤k≤l

(E

(LΨl,k ,d

n

)− L∗d

)+

λd√

n − l

]+C

√log ln − l

Then, we have:

by a martingale property: limd→+∞ L∗d = L∗,

by consistency of k -nearest neighbors in Rd : for all d ≥ 1,

inf1≤k≤l

(E

(LΨl,k ,d

n

)− L∗d

) l→+∞−−−−−→ 0,

the rest of the right hand side of the inequality can be set to convergeto 0 when n grows to infinity, for suitable choices of n, l and λd .


Consistency of functional k -nearest neighbors

Theorem [Biau et al., 2005]Suppose that

limn→+∞

l = +∞ limn→+∞

(n − l) = +∞ limn→+∞

log ln − l

= 0

thenlim

n→+∞E (LΨn) = L∗.


Table of contents



3 SVM

4 References


A binary classification problem

Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in {−1, 1}.

Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).We try to learn a classification machine, Ψn, of the formx → Sign (〈x,w〉Rd + b), or, more precisely, of the form

x → Sign (〈φ(x),w〉X + b)

where the exact nature of φ and X will be discussed later.



Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in {−1, 1}.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).

We try to learn a classification machine, Ψn, of the formx → Sign (〈x,w〉Rd + b), or, more precisely, of the form





Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in {−1, 1}.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).We try to learn a classification machine, Ψn, of the formx → Sign (〈x,w〉Rd + b), or, more precisely, of the form




Linear discrimination with optimal margin

Learn Ψn : x → Sign (〈x,w〉Rd + b)


w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ‖w‖Rd ,

such that: yi(wT xi + b) ≥ 1, 1 ≤ i ≤ n.


Linear discrimination with optimal margin


w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ‖w‖Rd ,

such that: yi(wT xi + b) ≥ 1, 1 ≤ i ≤ n.


Linear discrimination with soft margin



w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ,ξ ‖w‖Rd + C∑n

i=1 ξi ,

where: yi(wT xi + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.


Linear discrimination with soft margin


w

margin: 1‖w‖2

Rd

Support Vector

w is such that:

minw,b ,ξ ‖w‖Rd + C∑n

i=1 ξi ,

where: yi(wT xi + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.


Mapping the data onto a high dimensional space

Learn Ψn : x → Sign (〈φ(x),w〉X + b)

Original space Rd


Original space Rd Feature space X

φ (nonlinear)

w is such that:

(PC ,X) minw,b ,ξ ‖w‖X + C∑n

i=1 ξi ,

where: yi(〈w, φ(xi)〉X + b) ≥ 1 − ξi , 1 ≤ i ≤ n,ξi ≥ 0, 1 ≤ i ≤ n.





φ (nonlinear)



φ (nonlinear)

w is such that:


i=1 ξi ,






φ (nonlinear)

w is such that:


i=1 ξi ,



Details about the feature space: a regularizationframework

Regularization framework: (PC ,X)⇔

(Rλ,X) minF∈X

1n

n∑i=1

max(0, 1 − yiF(xi)) + λ ‖F‖X .

Dual problem: (PC ,X)⇔

(DC ,X) maxα∑n

i=1 αi −∑n

i=1∑n

j=1 αiαjyiyj〈φ(xi), φ(xj)〉Xwhere

∑Ni=1 αiyi = 0,

0 ≤ αi ≤ C , 1 ≤ i ≤ n.

Inner product in X:∀ u, v ∈ X, K(u, v) = 〈φ(u), φ(v)〉X


Example of usefull kernels

Provided that

∀m ∈ N∗, (ui)i=1,...,m ∈ Rd , (αi)i=1,...,m ∈ R,

m∑i,j=1

αiαjK(ui , uj) ≥ 0

K can be used as a kernel mapping the original data onto a highdimensional feature space: [Aronszajn, 1950].

The Gaussian kernel: K(u, v) = e−σ2‖u−v‖2

Rd for σ > 0;

The exponential kernel: K(u, v) = e〈u,v〉R ;

Vovk’s real infinite polynomial: K(u, v) = (1 − 〈u, v〉Rd )−α for α > 0;

. . .


Assumptions for consistency of SVM in Rd

Suppose

X takes its values in a compact subset of Rd ,W;

the kernel K is universal onW (i.e., the set of all functions

{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);

∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of

balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;

the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O

(nβ−1

)for 0 < β < 1/α.

Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O

(n−d

)

.


Assumptions for consistency of SVM in Rd

Suppose

X takes its values in a compact subset of Rd ,W;

the kernel K is universal onW (i.e., the set of all functions

{u ∈ W → 〈w, φ(u)〉X, w ∈ X} is dense in C0(W);

∀ ε > 0, the ε-covering number of φ(W), that is, the minimum number of

balls of radius ε that are needed to cover φ(W), is such that:N(K , ε) = O (ε−α) for α > 0;

the regularization parameter, C, depends on n by:limn→+∞ nCn = +∞, Cn = O

(nβ−1

)for 0 < β < 1/α.

Remark: The Gaussian kernel satisfies all these assumptions withN(K , ε) = O

(n−d

).


Consistency of SVM in Rd

Theorem [Steinwart, 2002]Under assumptions (A1)-(A4), SVM are consistent.


Why SVM can’t be directly applied to functional data?

Suppose now that X takes its values in a Hilbert space (X, 〈., .〉X).

1 We already talk about the advantages of regularization orprojection of the functional data as a pre-processing;

2 The consistency result can’t be directly applied with infinitedimensional data because the condition of covering number forinfinite dimensional Gaussian kernel is not valid.


A consistent approach based on the ideas of[Biau et al., 2005]

1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;

2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]

Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);Learn a SVM on B1: Ψl,a

n ;Validation on B2:

a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l

with L̂n−lΨl,an = 1

n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi

}.

⇒ The obtained classifier is denoted Ψn.



1 (ψj)j is a Hilbert basis of X: Projection on (ψj)j=1,...,d ;2 Choice of the parameters: a ≡ d ∈ N, K ∈ Jd , C ∈ [0;Cd ]



a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l


n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi

}.⇒ The obtained classifier is denoted Ψn.




Splitting the data : B1 = (x1, y1), . . . , (xl , yl) andB2 = (xl+1, yl+1), . . . , (xn, yn);

Learn a SVM on B1: Ψl,an ;

Validation on B2:

a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l


n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi






n ;

Validation on B2:

a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l


n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi







a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l


n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi

}.

⇒ The obtained classifier is denoted Ψn.






a∗ = arg mina

L̂n−lΨl,an +

λd√

n − l


n−l

∑ni=l+1 I

{Ψl,a

n (xi),yi



Assumptions

Assumptions on X

(A1) X takes its values in a bounded subset of X.

Assumptions on the parameters: ∀ d ≥ 1,(A2) Jd is a finite set;(A3) ∃Kd ∈ Jd such that: Kd is universal on any compact of Rd and∃νd > 0 : N(Kd , ε) = O (ε−νd );(A4) Cd > 1;(A5)

∑d≥1 |Jd |e−2λ2

d < +∞.

Assumptions on training/validation sets

(A6) limn→+∞ l = +∞;(A7) limn→+∞ n − l = +∞;(A8) limn→+∞

l log(n−l)n−l = 0.


Consistency

Theorem [Rossi and Villa, 2006]Under assumptions (A1)-(A8), Ψn is consistent:

E (LΨn)n→+∞−−−−−−→ L∗.

Ideas of the proof: The proof is based on a similar sketch as in the workof [Biau et al., 2005] but the result allows the use of a continuousparameter (the regularization parameter C), based on the shattercoefficient of a class of functions that includes SVM.


Application 1: Voice recognition

Description of the data and methods3 problems and for each problem, 100 records sampled at 82 192points;

consistent approach:Projection on a trigonometric basis;Splitting the data base into 50 curves (training) / 49 (validation);Performances calculated by leave-one-out.

Results

Prob. k -nn QDA SVM gau. SVM lin. SVM lin.(proj) (proj) (direct)

yes/no 10% 7% 10% 19% 58%boat/goat 21% 35% 8% 29% 46%

sh/ao 16% 19% 12% 25% 47%


Application 1: Voice recognition

Description of the data and methods3 problems and for each problem, 100 records sampled at 82 192points;consistent approach:

Projection on a trigonometric basis;Splitting the data base into 50 curves (training) / 49 (validation);Performances calculated by leave-one-out.

Results

Prob. k -nn QDA SVM gau. SVM lin. SVM lin.(proj) (proj) (direct)

yes/no 10% 7% 10% 19% 58%boat/goat 21% 35% 8% 29% 46%

sh/ao 16% 19% 12% 25% 47%


Regression by SVM

Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in R.

Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).Once again, we try to learn a regression machine, Ψn, of the form

x → 〈φ(x),w〉X + b



Regression by SVM

Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in R.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).

Once again, we try to learn a regression machine, Ψn, of the form

x → 〈φ(x),w〉X + b



Regression by SVM

Suppose that we are given a random pair of variables (X ,Y) where Xtakes its values in Rd and that Y takes its values in R.Moreover, we know n i.i.d. realizations of the random pair (X ,Y) thatwe denote by (x1, y1), . . . , (xn, yn).Once again, we try to learn a regression machine, Ψn, of the form

x → 〈φ(x),w〉X + b



Generalization of the classification case toregression

w and b minimize

C ‖w‖2X

+n∑

i=1

L εk (xi , yi ,w)

where L εk , for k = 1, 2 and ε ≥ 0 is the ε-sensitive loss function:

L εk (xi , yi ,w) = max(0, |yi − 〈φ(xi),w〉X|k − ε

).

or any other loss function.

Remark: A dual version, which is a quadratic optimization problem in Rn,also exists.


Generalization of the classification case toregression

w and b minimize

C ‖w‖2X

+n∑

i=1

L εk (xi , yi ,w)

where L εk , for k = 1, 2 and ε ≥ 0 is the ε-sensitive loss function:

L εk (xi , yi ,w) = max(0, |yi − 〈φ(xi),w〉X|k − ε

).

or any other loss function.Remark: A dual version, which is a quadratic optimization problem in Rn,also exists.


A kernel ridge regression

When ε is equal to 0 and k = 2, the previous problem becomes: Find wand b that minimize

Υ ‖w‖2X

+n∑

i=1

(y − 〈φ(xi),w〉X)2

which can be viewed as a kernel ridge regression. This method is alsoknown under the name of Least-Square SVM or LS-SVM.

A multidimensional consistency result is available in[Christmann and Steinwart, 2007]: the same method as for SVMclassifiers can then be used for the regression case !


A kernel ridge regression

When ε is equal to 0 and k = 2, the previous problem becomes: Find wand b that minimize

Υ ‖w‖2X

+n∑

i=1

(y − 〈φ(xi),w〉X)2

which can be viewed as a kernel ridge regression. This method is alsoknown under the name of Least-Square SVM or LS-SVM.A multidimensional consistency result is available in[Christmann and Steinwart, 2007]: the same method as for SVMclassifiers can then be used for the regression case !


Table of contents



3 SVM

4 References


References

Further details for the references are given in the joint document.

Aronszajn, N. (1950).Theory of reproducing kernels.Transactions of the American Mathematical Society, 68(3):337–404.

Biau, G., Bunea, F., and Wegkamp, M. (2005).Functional classification in Hilbert spaces.IEEE Transactions on Information Theory, 51:2163–2172.

Christmann, A. and Steinwart, I. (2007).Consistency and robustness of kernel-based regression in convex riskminimization.Bernouilli, 13(3):799–819.

Laloë, T. (2008).A k-nearest neighbor approach for functional regression.Statistics and Probability Letters, 78(10):1189–1193.

Rossi, F. and Conan-Guez, B. (2006).Theoretical properties of projection based multilayer perceptrons withfunctional inputs.Neural Processing Letters, 23(1):55–70.

Rossi, F. and Villa, N. (2006).Support vector machine for functional data classification.Neurocomputing, 69(7-9):730–742.

Steinwart, I. (2002).Support vector machines are universally consistent.Journal of Complexity, 18:768–791.

Vapnik, V. (1995).The Nature of Statistical Learning Theory.Springer Verlag, New York.

Vapnik, V. (1998).Statistical Learning Theory.Wiley, New York.


fda and statistical learning theory

Science

p n x y inf c p x y

p n x y xi

realizations of x

random pair x

performance of n

classier n

error expectancy

e ln n