representing functional data using support vector machines

6
Representing functional data using support vector machines Alberto Muñoz, Javier González * Universidad Carlos III de Madrid, c/Madrid 126, 28903 Getafe, Spain article info Article history: Available online 25 July 2009 Keywords: Functional Data Analysis (FDA) Kernel methods Support vector machines Cluster Classification abstract Functional data are difficult to manage for most classical statistical techniques, given the very high (or intrinsically infinite) dimensionality. The reason lies in that functional data are functions and most algo- rithms are designed to work with low dimensional vectors. In this paper we propose a functional analysis technique to obtain finite-dimensional representations of functional data. The key idea is to consider each functional datum as a point in a general function space and then to project these points onto a Reproduc- ing Kernel Hilbert Space with the aid of a support vector machine. We show some theoretical properties of the method and illustrate its performance in some classification examples. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction The field of Functional Data Analysis (FDA) (Ramsay and Silverman, 2006; Ferraty and Vieu, 2003) deals naturally with data of very high (or intrinsically infinite) dimensionality. Typi- cal examples are functions describing processes of interest, such as physical processes, genetic data, control quality charts or spectra of the functional data in Chemometrics. In practice, a functional datum is given as a set of discrete measured values. FDA methods first convert these values to a function and then apply some generalized multivariate procedure able to cope with functions. A common problem in FDA is to find a (generally low dimen- sional) representation of a set of curves by their projections onto the span of an orthogonal basis of functions f/ 1 ; ... ; / d g where d 2 N. This approach has been extensively studied, and many pa- pers in FDA deal with the choice of the best basis (Ramsay and Silverman, 2006): Fourier analysis, Wavelets, B-splines basis and Functional Principal Component Analysis (FPCA) constitute some common examples. The key idea in our proposal is to see each function as a point in a given function space, and then to project these points onto some finite-dimensional function subspace. In particular, we propose a finite-dimensional representation for functional data based on a particular projection of the original functions onto a Reproducing Kernel Hilbert Space (RKHS) (Aroszajn, 1950; Wahba, 1990; Mog- uerza and Muñoz, 2006; Boser et al., 1992). The idea is to consider a general covariance function (called Kernel in the RKHS framework) and then approximate the function of interest by means of linear combinations of the Kernel evaluated at the points in the sample. We will adopt the point of view of Reg- ularization Theory (Tikhonov and Arsenin, 1977), to cope with the RKHSs. The role of the Support Vector Machine (SVM) in this ap- proach will be made clear in next section. An interesting fact here is the close relationship between covariance functions and similarity functions. Often happens that the context information available of the problem can be embedded into a similarity function. For instance, in handwritten recognition we can choose different distance functions to reflect particular fea- tures of the data set. The RHKS approach is quite appropriate in this context given the natural correspondence between Kernel and distance similarity evaluations (Burges, 1998). In Section 2 we formulate the functional data representation in the context of regularization theory for the -insensitive loss func- tion (the RKHS approach to the SVM theory). We also show how to approximate the eigenfunctions of the Kernel when working with finite samples and/or sample Kernel matrices. In Section 3 we illus- trate the performance of the proposed functional data representa- tion in two real classification problems. Section 4 outlines some future research lines and concludes. 2. Representing functional data in a reproducing Kernel Hilbert Space We want to transform each curve (functional datum) into a point of a RKHS. Let f^ c 1 ; ... ; ^ c m g denote the available sample of curves. Each sampled curve ^ c l is identified with a data set x i ; y il Þ2 X Y g n i¼1 . X is the space of input variables and, in most cases, Y ¼ R. We assume that, for each ^ c l , there exists a continuous function c l : X ! Y such that E½y l jx¼ c l ðxÞ (with respect to some probability measure). Thus ^ c l is the sample version of c l . Notice that, for simplicity in notation, we assume that the x i are common 0167-8655/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2009.07.014 * Corresponding author. Fax: +34 91 624 98 49. E-mail addresses: [email protected] (A. Muñoz), [email protected] (J. González). Pattern Recognition Letters 31 (2010) 511–516 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Upload: alberto-munoz

Post on 26-Jun-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Representing functional data using support vector machines

Pattern Recognition Letters 31 (2010) 511–516

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Representing functional data using support vector machines

Alberto Muñoz, Javier González *

Universidad Carlos III de Madrid, c/Madrid 126, 28903 Getafe, Spain

a r t i c l e i n f o

Article history:Available online 25 July 2009

Keywords:Functional Data Analysis (FDA)Kernel methodsSupport vector machinesClusterClassification

0167-8655/$ - see front matter � 2009 Elsevier B.V. Adoi:10.1016/j.patrec.2009.07.014

* Corresponding author. Fax: +34 91 624 98 49.E-mail addresses: [email protected] (A. Muñ

(J. González).

a b s t r a c t

Functional data are difficult to manage for most classical statistical techniques, given the very high (orintrinsically infinite) dimensionality. The reason lies in that functional data are functions and most algo-rithms are designed to work with low dimensional vectors. In this paper we propose a functional analysistechnique to obtain finite-dimensional representations of functional data. The key idea is to consider eachfunctional datum as a point in a general function space and then to project these points onto a Reproduc-ing Kernel Hilbert Space with the aid of a support vector machine. We show some theoretical propertiesof the method and illustrate its performance in some classification examples.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

The field of Functional Data Analysis (FDA) (Ramsay andSilverman, 2006; Ferraty and Vieu, 2003) deals naturally withdata of very high (or intrinsically infinite) dimensionality. Typi-cal examples are functions describing processes of interest, suchas physical processes, genetic data, control quality charts orspectra of the functional data in Chemometrics. In practice, afunctional datum is given as a set of discrete measured values.FDA methods first convert these values to a function and thenapply some generalized multivariate procedure able to copewith functions.

A common problem in FDA is to find a (generally low dimen-sional) representation of a set of curves by their projections ontothe span of an orthogonal basis of functions f/1; . . . ;/dg whered 2 N. This approach has been extensively studied, and many pa-pers in FDA deal with the choice of the best basis (Ramsay andSilverman, 2006): Fourier analysis, Wavelets, B-splines basis andFunctional Principal Component Analysis (FPCA) constitute somecommon examples.

The key idea in our proposal is to see each function as a point ina given function space, and then to project these points onto somefinite-dimensional function subspace. In particular, we propose afinite-dimensional representation for functional data based on aparticular projection of the original functions onto a ReproducingKernel Hilbert Space (RKHS) (Aroszajn, 1950; Wahba, 1990; Mog-uerza and Muñoz, 2006; Boser et al., 1992).

The idea is to consider a general covariance function (calledKernel in the RKHS framework) and then approximate the function

ll rights reserved.

oz), [email protected]

of interest by means of linear combinations of the Kernel evaluatedat the points in the sample. We will adopt the point of view of Reg-ularization Theory (Tikhonov and Arsenin, 1977), to cope with theRKHSs. The role of the Support Vector Machine (SVM) in this ap-proach will be made clear in next section.

An interesting fact here is the close relationship betweencovariance functions and similarity functions. Often happens thatthe context information available of the problem can be embeddedinto a similarity function. For instance, in handwritten recognitionwe can choose different distance functions to reflect particular fea-tures of the data set. The RHKS approach is quite appropriate inthis context given the natural correspondence between Kerneland distance similarity evaluations (Burges, 1998).

In Section 2 we formulate the functional data representation inthe context of regularization theory for the �-insensitive loss func-tion (the RKHS approach to the SVM theory). We also show how toapproximate the eigenfunctions of the Kernel when working withfinite samples and/or sample Kernel matrices. In Section 3 we illus-trate the performance of the proposed functional data representa-tion in two real classification problems. Section 4 outlines somefuture research lines and concludes.

2. Representing functional data in a reproducing Kernel HilbertSpace

We want to transform each curve (functional datum) into apoint of a RKHS. Let fc1; . . . ; cmg denote the available sample ofcurves. Each sampled curve cl is identified with a data setfðxi; yilÞ 2 X � Ygn

i¼1. X is the space of input variables and, in mostcases, Y ¼ R. We assume that, for each cl, there exists a continuousfunction cl : X ! Y such that E½yljx� ¼ clðxÞ (with respect to someprobability measure). Thus cl is the sample version of cl. Noticethat, for simplicity in notation, we assume that the xi are common

Page 2: Representing functional data using support vector machines

512 A. Muñoz, J. González / Pattern Recognition Letters 31 (2010) 511–516

for all the curves, as it is the habitual case in the literature (Ramsayand Silverman, 2006).

There are several ways to introduce RKHS (see Moguerza andMuñoz, 2006; Aroszajn, 1950; Cucker and Smale, 2002; Wahba,1990). In a nutshell, the essential ingredient for a Hilbert functionspace H to be a RKHS is the existence of a symmetric positive def-inite function K : X � X ! R named Mercer Kernel (or reproduc-ing Kernel) for H (Aroszajn, 1950). The elements of H, called HK

in the sequel, can be expressed as finite linear combinations ofthe form h ¼

PsksKðxs; �Þ where ks 2 IR and xs 2 X.

Consider the linear integral operator TK associated to the KernelK defined by TKðf Þ ¼

RX Kð�; sÞf ðsÞds. If we impose thatRR

K2ðx; yÞdxdy <1, then TK has a countable sequence of eigen-values fkjg and (orthogonal) eigenfunctions f/jg and K can be ex-pressed as Kðx; yÞ ¼

Pjkj/jðxÞ/jðyÞ (where the convergence is

absolute and uniform).Given a function f in a function general space (that contains HK

as a subspace), it will be projected to HK using the operator TK .Thus, the projection f � will belong to the range of TK : f � ¼ TKðf Þ.Applying the Spectral Theorem to TK we get:

f � ¼ TKðf Þ ¼X

j

kjhf ;/ji/j: ð1Þ

Next we want to obtain c�l for each cl (the function corresponding tothe sample functional data point cl � fðxi; yilÞ 2 X � Ygn

i¼1). To findthe coefficients of c�l in Eq. (1), we use a SVM to express the approx-imation of cl in terms of a Kernel expansion. To this aim, the SVMseeks the function c�l that solves the following optimization prob-lem (Cucker and Smale, 2002; Moguerza and Muñoz, 2006):

arg minc2HK

1n

Xn

i¼1

Lðyi; cðxiÞÞ þ ckck2K ; ð2Þ

where c > 0, kckK is the norm of the function c in HK , yi ¼ yil andLðyi; cðxiÞÞ ¼ ðjcðxiÞ � yij � eÞþ, e P 0 in the SVM approach (Mog-uerza and Muñoz, 2006). Expression (2) measures the trade-off be-tween the fitness of the function to the data and the complexity ofthe solution (measured by kck2

K ). By the representer theorem(Kimeldorf and Wahba, 1971; Schölkopf et al., 2001), the solutionc�l to the problem (2) exists, is unique and admits a representationof the form

c�l ðxÞ ¼Xn

i¼1

ailKðxi; xÞ 8x 2 X where ail 2 R: ð3Þ

In practice, the solution to (2) is obtained by solving a quadraticoptimization problem and efficient methods specific for SVMs havebeen developed in the literature. In addition, due to the definition ofthe loss function (the SVM choice), the solution to (2) generally de-pends on a small number of data points called support vectors.Thus, the Kernel expansion (3) that represents the projected curvec�l will contain only a small number of non null ail coefficients.

In next section we study the properties of this representationand we define a new one starting from Eq. (3).

2.1. Functional data projections onto the eigenfunctions space

By minimizing the risk functional (2) we obtain the functionpoints c�1; . . . ; c�m in HK corresponding to the original curvesfc1; . . . ; cmg. Eq. (3) gives a first approximation to the representa-tion of each curve cl, namely the set of coefficients a1l; . . . ;anl. How-ever, this representation has a serious drawback: it is not acontinuous function in the input variables and, therefore, if wehave a slightly different sample ðx0iÞ, may be that the correspondingy0il are quite different, making the representation system not validfor pattern recognition purposes. From an intuitive point of view,consider that if there is a small change in the sample, then one

or more support vectors can change and, therefore, the representa-tion of cl given by al ¼ ða1l; . . . ;anlÞ can be dramatically different(see Example 1 for a practical illustration). However, a new contin-uous representation can be obtained from the elements of Eq. (3).

Proposition 1. Let c be a curve, whose sample version isc ¼ fðxi; yiÞ 2 X � Ygn

i¼1 and K a Kernel with eigenfunctionsf/1 . . . ;/d; . . .g (basis of HK ). Then, the projected curve c�ðxÞ, givenby the minimization of (2), can be expressed as

c�l ðxÞ ¼Xd

j¼1

k�j /jðxÞ: ð4Þ

where k�j are the weights of the projection of c�ðxÞ onto the functionspace generated by the eigenfunctions of K (Spanf/1 . . . ;/dg) and dthe dimension of HK (possible infinity). In practice (where a finite sam-ple is available) k�j can be estimated by

k�j ¼ kj

Xn

i¼1

ai/ji for j ¼ 1; . . . ; d; ð5Þ

being kj the jth eigenvalue corresponding to the eigenvector /j of thematrix KS ¼ ðKðxi; xjÞÞi;j , d ¼ rðKSÞ (an estimator of d), and ai the solu-tion to (2).

Though the sequence of eigenvalues of TK could be infinite (butcountable), in practice we never are able to estimate more eigen-functions than rðKSÞ. Thus, in practice, we will assume d as finite.

The RKHS representation estimated by (5) has several interest-ing properties. Given a sample curve c � fðxi; yiÞ 2 X � Ygn

i¼1, if weconsider a ‘close’ curve c� � fðx�i ; y�i Þ 2 X � Ygn

i¼1, such thatdðx; x�Þ < � then cðxÞ ’ c�ðxÞ and given that the /j are a basisfor HK , it must happen that k�j ’ k��j . Therefore, this representationis continuous in the input data and it allows us to study each c�l interms of the basis of the RKHS. In addition, the number of nonnull values in fk�1; . . . ; k�dg can be used to estimate the dimensionof the projected curves in the basis of HK . Furthermore, given afinite basis of bounded functions in X, it is always possible to esti-mate the associated Kernel (Rakotomamonjy and Canu, 2005),which allows to consider usual FDA techniques as particular casesof our approach. Notice that the basis of functions that generatesthe function projection space is indirectly given by K (as theeigenfunctions of K). This means that in our approach we havemore basis functions available for FDA techniques (for instancethe eigenfunctions of exponential Kernels) that cannot be analyt-ically expressed.

Next, we illustrate the behavior of the Kernel expansion (3) andthe proposed representation in (4) in a real example.

Example 1. The data in this example are the X–Y coordinates of 20replications of writing the script ‘‘fda” (Ramsay and Silverman,2006). Each replication is represented by 1401 coordinate valuesand each string ‘‘fda” can be analyzed as two curves correspondingto the the values of the X and Y coordinates in the writtensequence.

Now we consider two particular X-curves of two different butsimilar ‘‘fda” sequences (see Fig. 1). We use the Gaussian KernelKðx; yÞ ¼ expf�0:5kx� yk2g and obtain, for the two curves, thevectors a1, a2 corresponding to the Kernel expansion and k1, k2 forthe RKHS representation. The dimension of both representationssystems are 1401 (the number of data) and 25 (the rank of theKernel matrix), respectively. Results are shown in Fig. 2.

Notice that both, the ‘‘fda” strings and the curves representingthem are similar and, as the RKHS representation does, anycontinuous projection should transform them into close points inthe feature space. However, their aj values are totally different,what makes this representation inappropriate for pattern recogni-tion purposes.

Page 3: Representing functional data using support vector machines

-0.03 -0.02 -0.01 0.00 0.01 0.02 0.03

-0.0

4-0

.02

0.00

0.02

X

Y

0 200 400 600 800 1000 1200 1400-0

.03

-0.0

2-0

.01

0.00

0.01

0.02

0.03

X0 200 400 600 800 1000 1200 1400

-0.0

4-0

.02

0.00

0.02

Y

Fig. 1. Two handwritten samples of the word ‘‘fda” (left) and the functional curves associated to the X (center) and Y (right) components, respectively.

0 200 400 600 800 1000 1200 1400

-0.0

2-0

.01

0.00

0.01

Curve 1, kernel expansion

0 200 400 600 800 1000 1200 1400

-0.0

2-0

.01

0.00

0.01

0.02

Curve 2, kernel expansion

0 200 400 600 800 1000 1200 1400-0

.02

-0.0

10.

000.

010.

02

Difference

5 10 15 20 25

-0.3

-0.2

-0.1

0.0

0.1

Curve 1, RKHS

5 10 15 20 25

-0.3

-0.2

-0.1

0.0

0.1

Curve 2, RKHS

5 10 15 20 25

Difference

-0.2

5-0

.20

-0.1

5-0

.10

-0.0

50.

000.

05

Fig. 2. Bar plots of the components of the Kernel expansion components and RKHS representations for two particular curves. Left, Kernel expansion (top) and RKHSrepresentation (down) for curve 1. The same in the middle plots for curve 2. Right, differences a1—a2 (top) and k1—k2 (down).

A. Muñoz, J. González / Pattern Recognition Letters 31 (2010) 511–516 513

2.2. Truncation error analysis

Given the Kernel expansion Kðx; yÞ ¼P

j¼1kj/jðxÞ/jðyÞ, it con-tains d ¼ rðTKÞ non null terms (being TK the integral operator asso-ciated to K). In practice, we work with a finite sample of size n andthe approximation errors that appear when we estimate the k�j val-ues by (4) must be taken into account.

We wonder about the quality of the approximation of the trun-cated Kernel K ½d�ðx; yÞ ¼

Pdj¼1kj/jðxÞ/jðyÞ to Kðx; yÞ, where KS is the

Kernel matrix and d ¼ rðKSÞ is an estimator of rðTKÞ. If rðKSÞ ¼ rðTKÞthen K ½d� ¼ K and there is no loss in using K ½d� (all the eigenfunc-tions of K can be approximated). If rðKSÞ < rðTKÞ (what can onlyhappen when rðKSÞ ¼ n), then number of eigenfunctions of K is lar-ger than the number of data points and K ½d� takes only into accountthe first n eigenvalues of K.

Let c�ðxÞ ¼P

j¼1k�j /jðxÞ and c�½d�ðxÞ ¼

Pdj¼1k

�j /jðxÞ a curve and its

truncated version. It is immediate to prove that the truncation er-ror is given by

Ed ¼ kc� � c�½d�k2 ¼

Xj¼dþ1

k�2j : ð6Þ

However, as n the number of data points increases, kj ! 0 (becausethe sum that defines K converges) and so does the truncation error(since all the non null k�j can be approximated). By applying theTchebychev inequality (see Casella and Berger (2002, p. 233, Theo-rem 5.5.2, for an equivalent proof)) and taking n!1 it can beshown that

limn!1

P1n

Xn

i¼1

kc�½d�ðxiÞ � c�ðxiÞkP �

!¼ 0 ð7Þ

for any random sample fx1; . . . ; xng. As consequence, the estimationof a curve given by c�½d�ðxÞ ¼

Pdj¼1k

�j /jðxÞ converges to

c�ðxÞ ¼P

jk�j /jðxÞ when the sample size increases.

Page 4: Representing functional data using support vector machines

0.20

0.25

0.30

Cross validation errors

rrors

514 A. Muñoz, J. González / Pattern Recognition Letters 31 (2010) 511–516

3. Experiments

In this section we check the performance of the RKHS projectionfor classification purposes. We classify different sets of curves fol-lowing two main steps:

1. Project the data curves onto an RKHS. That is, estimate ðk�l Þ, forl ¼ 1; . . . ;m by projecting the curves onto the eigenfunctions ofsome RKHSs by Eqs. (4) and (5).

2. Classify the curves applying a linear SVM to the set of samplecurves, cl represented by k�l ¼ fk�1l; . . . ; k�dlg.

E

5 10 15 20

0.10

0.15

Number of eigenfunctions

Test

Fig. 4. Cross validations error (for 100 runs) over the number of components offour different RKHSs for Gaussian Kernels of parameters r 2 f0:0025; 0:0050;0:0075;0:01g.

Table 1Comparative results for the phoneme data after 100 runs.

Method Phoneme data

Dim Test error Std. dev.

Rawdataþ SVM 0.0869 0.0013SVM � FPCA 5 0.0854 0.0014RKHSr¼0:0025 þ SVM 8 0.0798 0.0011RKHSr¼0:0050 þ SVM 22 0.0788 0.0012RKHSr¼0:0075 þ SVM 18 0.0770 0.0012RKHSr¼0:0100 þ SVM 18 0.0778 0.0012

PSR – 0.2319 0.0018NPCD=MPLSR 7 0.0814 0.0013

3.1. Phoneme data analysis

In this example we test the RKHS representation system ina classification problem. The data set corresponds to 2000 dis-cretized log-periodograms of length 150 of the phonemes‘‘sh”, ‘‘iy”, ‘‘dcl”, ‘‘aa” and ‘‘ao” which define five differentclasses. We used the 80% of the data for training and theremaining 20% for testing. A plot of 25 series of each classis shown in Fig. 3.

In order to test or methodology, we consider several RKHSscorresponding to Gaussian Kernels Krðx; yÞ ¼ expf�rkx� yk2g.In particular we selected four different Kernels (with parame-ters r 2 f0:0025;0:0050;0:0075;0:01g) that induce projectionsof the data curves with a good fitting of the original series.We consider C ¼ 1=2nc ¼ 100 and � ¼ 0:1 in Eq. (2) and decidethe number of eigenfunctions of each space to retain by crossvalidation of the misclassification error over the dimension ofthe space (see Fig. 4). In this case, the best errors for eachKernel were obtained for dimensions 8, 22, 18, and 18, respec-tively. Anyway, these errors stabilize for dimensions larger than6. In addition, we include the results for a SVM (with linearKernel) trained on the raw data to be able to compare classifi-cation results with a technique that does not preprocess thedata.

0 50 100 150

-50

510

1520

25

sh

0 50

-50

510

1520

25

iy

0 50 100 150

-50

510

1520

25

aa

0 50

-50

510

1520

25

ao

Fig. 3. Phoneme curves by classes. The projection of the curves onto the two first supervisclasses.

Next we compare our methodology with two specific tech-niques designed to deal with functional data that have been proven

100 150 0 50 100 150

-50

510

1520

25

dcl

100 150 0 5 10

-10

-50

5

Supervised Fisher LDA

Projecion 1

Proj

ectio

n 2

ed Fisher discriminant components is also shown. There is a clear overlapping of the

Page 5: Representing functional data using support vector machines

0 5 10 15 20 25 30

8010

012

014

016

018

020

0 Boys

age

cms

0 5 10 15 20 25 30

8010

012

014

016

018

020

0 Girls

age

cms

Fig. 5. Two classes of growth data. Left: boys. Right: girls.

A. Muñoz, J. González / Pattern Recognition Letters 31 (2010) 511–516 515

to obtain very competitive results: P-spline signal regression (PSR)(Marx and Eilers, 1999), and NPCD/MPLSR with PLS semimetric(Ferraty and Vieu, 2003). Finally, we estimate the Functional Prin-cipal Components of the data. We consider the first 5 of them (se-lected by cross validation) as an alternative representation systemto classify the curves with a SVM with linear Kernel (C ¼ 100). Re-sults are shown in Table 1. The linear SVM using the RKHS pro-posed representation technique (r ¼ 0; 0075 and 7.70% error)improves significantly the performance of both, PSR and NPCD/MPLSR with p-values for the (paired) difference of means t-testsof 0.000 and 0.0270 respectively, being the difference more signif-icant (lower p-value) for PSR (23.19% error) than for NPCD/MPLSR(8.14%). Furthermore, notice that the SVM results are similar forthe four projections that shows that this methodology is robustagainst small changes of the r parameter. Notice that the RKHSrepresentation is also able to improve the performance of a linearSVM in a particularly favorable case for the linear SVM (Moguerzaand Muñoz, 2006, p. 384).

3.2. Growth data and FPCA projections

The data set of this experiment consists of 93 growth curves fora sample of 54 boys and 39 girls (Ramsay and Silverman, 2006)(see Fig. 5). The observations are measured at a set of 29 ages from1 to 18 years old. The data were originally smoothed by using aspline basis. We divide the data set in train (60 data) and test(33 data) and regularize the curves using KernelsKðx; yÞ ¼ ðx0y þ 1Þp with p ¼ 2;3 C ¼ 100 and � ¼ 0:01. We usedthe two first eigenfunctions of each space (the most discriminantin this example), and we classified the curves by using a SVM witha liner Kernel and parameter C ¼ 10 (following the scheme of thefirst example). We also included the PSR and the NPCD/MPLSR pro-cedures to compare the classification results. After 100 runs of theexperiment, the test errors are shown in Table 2.

Table 2Comparative results for the growth data after 100 runs.

Method Growth data

Dim Test error Std. dev.

Rawdataþ SVM 0.0667 0.0034SVM � FPCA 2 0.0409 0.0039

RKHSpolyð2Þ þ SVM 2 0.0227 0.0021RKHSpolyð3Þ þ SVM 2 0.0288 0.0027

PSR – 0.0521 0.0045NPCD=MPLSR 5 0.0494 0.0040

The discriminative power of the two first eigenfunctions of thepolynomial Kernels achieve the best classification results with per-centages 2.27% and 2.88% of misclassification errors (for p ¼ 2;3respectively). It is of special interest to comment the poor perfor-mance of the FPCA projection compared with the polynomial Ker-nels. The PFCA search new components that maximize thevariability of the data, what can be, as is the case, inappropriatein classification problems.

In addition the PSR and NPCD/MPLSR are also improved whatconfirm the validity of our methodology in the task of curvesclassification.

4. Conclusions

In this work we have proposed a system to represent functionaldata, by projecting the original functions onto the eigenfunctionsof a Mercer Kernel. The projection is achieved by using supportvector machines. A main advantage is that we do not have to spec-ify the basis of eigenfunctions, but we can concentrate in the Ker-nel, following the general philosophy of Kernel methods. Theproposed representation seems to work well in the experiments,capturing the interesting features of functional data and perform-ing well in clustering tasks.

Regarding future work, we want to investigate the choice ofKernels appropriate for prespecified tasks or data sets. The idea isto specify objective functions in terms of distance criteria (as ithappens, for instance, for principal component analysis). Giventhe direct relationship existing between Kernel functions and dis-tance functions, this gives as a method to specify optimal Kernelsin advance and to obtain, in consequence, optimal representationsystems for given tasks.

Acknowledgement

We thank the support of the Spanish Grant Nos. 2007/04438001, 2008/000591002 and the three anonymous reviewersfor their very helpful comments.

Appendix A. Proofs

Proof (Proposition 1). The projected curve c�ðxÞ admits a repre-sentation as linear combination of the eigenfunctions f/1; . . . ;/dgas (4) since it belongs to HK . Denote d ¼ rðTKÞ the number of nonnull terms in the sum Kðx; yÞ ¼

Pdj kj/jðxÞ/jðyÞ, where rðTKÞ is the

rank of the operator TK (kj ¼ 0 for j > rðTKÞ). Then, operating fromEq. (3), we have that

Page 6: Representing functional data using support vector machines

516 A. Muñoz, J. González / Pattern Recognition Letters 31 (2010) 511–516

c�ðxÞ ¼Xn

i¼1

aiKðxi; xÞ ¼Xn

i¼1

ai

Xd

j¼1

kj/jðxiÞ/jðxÞ !

¼Xd

j¼1

kj

Xn

i¼1

ai/jðxiÞ !

/jðxÞ ¼Xd

j¼1

k�j /jðxÞ

In an ideal case, where we know the expression for both theeigenfunctions and eigenvalues of the Kernel function K, wehave just shown that k�j ¼

Pni¼1kjai/jðxiÞ. However, often we only

know the Kernel matrix KS (not the analytical expression of K),obtained by evaluating the Kernel at the sample, and we cannot know the real eigenvalues kj and their corresponding eigen-functions /j.

To end the proof, we only need to show that the eigenvaluesand eigenvectors of KS converge, respectively, to the eigenvaluesand eigenfunctions of TK : kj ! kj and / ! /. And this is the casebecause this convergence holds always for positive-definite matri-ces, including Kernel functions (see Schlesinger, 1957). For morespecific theorems restricted to the context of Kernel functions (seeBengio et al., 2004). h

References

Aroszajn, N., 1950. Theory of reproducing Kernels. Trans. Amer. Math. Soc. 68 (3),337–404.

Bengio, Y., Delalleau, O., Le Roux, N., Paiement, J.-F., Vincent, P., Ouimet, M., 2004.Learning eigenfunctions links spectral embedding and Kernel PCA. NeuralComput. 16, 2197–2219.

Boser, B.E., Guyon, I., Vapnik, V., 1992. A training algorithm for optimal marginclassifiers. In: Proc. Fifth ACM Workshop on Computational Learning Theory(COLT). ACM Press, New York, pp. 144–152.

Burges, C., 1998. Geometry and Invariance in Kernel Based Methods Advances inKernel Methods. Support Vector Learning. MIT Press, Cambridge, USA.

Casella, G., Berger, R.L., 2002. Statistical inference. Duxbury Advances Series.Cucker, F., Smale, S., 2002. On the mathematical foundations of learning. Bull. Amer.

Math. Soc. 39 (1), 1–49.Ferraty, F., Vieu, P., 2003. Curves discrimination: A nonparametric functional

approach. Comput. Statist. Data Anal. 44, 161–173.Kimeldorf, G.S., Wahba, G., 1971. A correspondence between Bayesian estimation

on stochastic processes and smoothing by splines. Ann. Math. Statist. 2, 495–502.

Marx, B., Eilers, P., 1999. Generalized linear regression on sampled signals andcurves: A P-spline approach. Technometrics 41 (1), 1–13.

Moguerza, J.M., Muñoz, A., 2006. Support vector machines with applications. Statist.Sci. 21 (3), 322–357.

Rakotomamonjy, A., Canu, S., 2005. Frames, reproducing Kernels, regularization andlearning. J. Machine Learn. Res. 6, 1485–1515.

Ramsay, J.O., Silverman, B.W., 2006. Functional Data Analysis, second ed. Springer,New York.

Schlesinger, S., 1957. Approximating eigenvalues and eigenfunctions of symmetricKernels. J. Soc. Ind. Appl. Math. 6 (1), 1–14.

Schölkopf, B., Herbrich, R., Smola, A.J., Williamson, R.C., 2001. A GeneralizedRepresenter Theorem. Lecture Notes in Artificial Intelligence, vol. 2111.Springer. pp. 416–426.

Tikhonov, A.N., Arsenin, V.Y., 1977. Solutions of Ill-Posed Problems. John Wiley &Sons, New York.

Wahba, G., 1990. Spline Models for Observational Data. Series in AppliedMathematics, vol. 59. SIAM, Philadelphia.