functional principal component analysis of fmri data · functional principal component analysis of...

Functional Principal ComponentAnalysis of fMRI Data

Roberto Viviani, * Georg Gron, and Manfred Spitzer

Department of Psychiatry III, University of Ulm, Ulm, Germany

� �

Abstract: We describe a principal component analysis (PCA) method for functional magnetic resonanceimaging (fMRI) data based on functional data analysis, an advanced nonparametric approach. The datadelivered by the fMRI scans are viewed as continuous functions of time sampled at the interscan intervaland subject to observational noise, and are used accordingly to estimate an image in which smoothfunctions replace the voxels. The techniques of functional data analysis are used to carry out PCA directlyon these functions. We show that functional PCA is more effective than is its ordinary counterpart inrecovering the signal of interest, even if limited or no prior knowledge of the form of hemodynamicfunction or the structure of the experimental design is specified. We discuss the rationale and advantagesof the proposed approach relative to other exploratory methods, such as clustering or independentcomponent analysis, as well as the differences from methods based on expanded design matrices. HumBrain Mapp 24:109–129, 2005. © 2004 Wiley-Liss, Inc.

Key words: principal component analysis (PCA); functional data analysis; independent component anal-ysis (ICA); multivariate linear models (MLM); explorative methods

� �

INTRODUCTION

Principal component analysis (PCA) is a technique toindividuate important modes of variation in high-dimen-sional data as a set of orthogonal directions in space [Jolliffe,1986]. Usually, only the few directions where most of thevariation occurs are considered to be of interest. When PCAis used as an explorative technique, these directions areadopted as a new coordinate system to reveal the underly-ing structure of the data. The uses of PCA are not limited toexploration. When the data are composed of a set of inter-related variables, it is often useful to transform them withPCA to avoid problems arising from collinearity in multi-variate regression settings. Because of the orthogonality ofthe directions of variation, in the new coordinate system the

transformed data are uncorrelated, while retaining most oftheir variance. In fact, common techniques to carry out PCAformulate the problem as one of identifying the eigenvalues(amount of variation) and eigenvectors (direction of varia-tion) of the covariance matrix of the data. We focus, how-ever, on PCA as an explorative technique to visualize thevariance introduced into the data by alternating experimen-tal conditions without using much information from theexperimental setting or making specific assumptions on theform of the regressor. As in other explorative techniques, theobjective is that of providing an initial assessment that willgive the data a chance “to speak for themselves” before anappropriate model is chosen. For example, it might be de-sirable to analyze data from a sample of patients withoutmaking assumptions on the form of the blood oxygenationlevel-dependent (BOLD) response, to verify that possibledifferences in the activations are genuine and do not dependon altered BOLD responses in the clinical condition [Calli-cott and Weinberger, 1999]. Another possible application ofexplorative techniques consists of the analysis of functionalimaging data when there is uncertainty as to the duration ofa mental state induced by an experimental stimulus, forexample an emotional reaction, or as to the moment of itsoccurrence. In general, recourse to explorative techniques

*Correspondence to: Roberto Viviani, Department of Psychiatry III,University of Ulm, Leimgrubenweg 12, 89075 Ulm, Germany.E-mail: [email protected] for publication 17 December 2002; Accepted 21 June 2004DOI: 10.1002/hbm.20074Published online in Wiley InterScience (www.interscience.wiley.com).

� Human Brain Mapping 24:109–129(2005) �

© 2004 Wiley-Liss, Inc.

may be justified by questions about differences in the formand shape of the hemodynamic response [Friston et al.,1995a], or when the shape of the hemodynamic response isitself the object of enquiry.

When applied to magnetic resonance images, ordinaryPCA runs into serious difficulties because of the extremelyhigh number of dimensions in the data relative to the num-ber of observations. Even if the geometric properties of PCAremain valid and numerical techniques deliver stable re-sults, the covariance matrix on which the analysis is carriedout is sometimes a poor estimate of the real populationcovariance.

Because of these difficulties, ordinary PCA is often limitedto regions of interest identified previously by the experimen-tal model [F-masking; Friston et al., 1993; Friston, 1997]. Thismeans that the PCA is carried out mostly in a subspacewhere the experimental regressors dominate, which will bemuch smaller than the space of the original data. Althoughthis is a meaningful way of employing ordinary PCA, it alsolimits its scope in exploratory analyses, because the signal ofinterest has already been identified by other means. Anotherapproach to constrain the outcome of PCA is partial leastsquares [McIntosh et al., 1996], in which the directions inspace are selected that both maximize the variance of thedata and correlate with the predictors of the design matrix.As it has been noted, the results are not invariant underchanges of the design matrix [Petersson et al., 1999], whichdemonstrates that as in F-masking, information about theexperimental setting is used directly by the method.

The approach adopted here stems from a field of statisticsknown as functional data analysis [Ramsay and Silverman,1997], which has its roots in much earlier studies on growthcurves [Rao, 1958] and in nonparametric regression [Eu-bank, 1988]. Functional data analysis exploits the fact thatfunctions defined on a specific domain form an inner prod-uct vector space, and in most circumstances can be treatedalgebraically like vectors. Counterparts of conventional mul-tivariate statistical methods are carried out in functionalspace rather than in the space spanned by vectors of indi-vidual observations.

Consequently, the data delivered by the functional mag-netic resonance imaging (fMRI) scans will be here consid-ered as continuous functions of time sampled at the inter-scan interval and subject to observational noise. Thesefunctions may be estimated by fitting a set of basis functionsto each voxel time series. Collectively, the functions replacethe voxels of a series of images with a single “functionalimage.” In functional PCA, the eigenanalysis is carried outdirectly on these functions.

As a consequence, the eigenanalysis takes place in thespace spanned by the basis functions set. The decisive ad-vantage of this approach consists in the possibility of spec-ifying a set of assumptions in the choice of the basis functionset and in the error functional minimized by the fit. Theseassumptions will be weaker than the specification of a pre-defined hemodynamic function and a set of events or con-ditions as in F-masking, thus preserving the exploratory

character of the procedure; however, the assumptions mightremain stringent enough to overcome the difficulties of or-dinary PCA. At a minimum, these assumptions will includethe continuity and some degree of smoothness of the esti-mated functions, but in principle any set of differentialequations may be used to specify a restriction on the func-tional estimation space. In the following, we will sometimesadopt periodic boundary conditions as a further restrictionon the estimated functions to exploit knowledge of the pe-riodicity of the experimental design.

As a multivariate technique, functional PCA differs fromother multivariate approaches such as multivariate linearmodels (MLM) [Worsley et al., 1997] in the way the above-mentioned assumptions are harnessed to constrain the esti-mate. This point will be expanded upon further in the finaldiscussion.

As an explorative tool, functional PCA also has to contendagainst other advanced explorative techniques such as inde-pendent component analysis (ICA) [McKeown et al., 1998]and clustering [Wismuller et al., 2002]. We comment brieflyon the reasons that justify the use of functional PCA inaddition or as an alternative to such methods. As a prelim-inary remark, we note that in an experimental setting thecontrolled application of the experimental conditions ortreatments constitutes a source of systematic variance in thedata. A technique that selects components by capturingvariance therefore seems a plausible complement to tech-niques such as ICA that recover a signal based on statisticalindependence and the extent of its departure from Gaussiannormality [Hyvarinen and Oja, 2000; Lee et al., 1999]. Incontrast, we make no distributional assumption on the formtaken by the systematic variance introduced by the experi-mental manipulations.

Because the aims of functional PCA are exploratory, wewill not be concerned with inferential issues such as pro-ducing a parametric image of significance values.

SUBJECTS AND METHODS

The fMRI data were obtained from three healthy youngvolunteers with a 1.5-Tesla Magnetom Vision (Siemens, Er-langen, Germany) whole-body MRI system equipped with ahead volume coil after obtaining written consent. For block-design data (working memory and episodic memory encod-ing), images were obtained using echo-planar imaging (EPI)in axial orientation (T2*-weighted, TR/TE � 3,980/66 ms).Image size was 64 � 64 pixels (3.6 � 3.6 mm). For eachvolume, 32 slices were acquired. Slice thickness was 3.0 mmwith a gap of 0.6 mm; thus, voxel size was isotropic. Afterdiscarding the first 6 volumes to allow for equilibrationeffects, 96 volumes were acquired. For further details on thetasks, see Walter et al. [2003a] and Gron et al. [2001]. Forevent-related data (TR/TE � 2,496/66 ms), in-planar reso-lution of the axial images was the same as above; 22 sliceswere acquired for each volume, and slice thickness wasagain 3.6 mm. During the event-related paradigm, the sub-ject had to press a response button with his right indexfinger after a visual cue. After discarding the first 6 volumes,

� Viviani et al. �

� 110 �

within the time series of the remaining 102 volumes therewere nine repetitions with a mean interscan interval (ISI) of28.1 s.

All code was developed on MATLAB 6.1 R12 (Math-Works, Natick, MA) installed on a Pentium PC runningWindows 2000 (Microsoft, Redmond, WA). Our implemen-tation is written as a statistical parametric mapping (SPM)toolbox (SPM99; Wellcome Department of Cognitive Neu-rology, London; online at http://www.fil.ion.ucl.ac.uk). Wetherefore made use of the routines in that package for realign-ment, stereotactic normalization, segmentation, smoothing, ap-plication of a high-pass filter, and visualization of data [Ash-burner and Friston, 1997], as well as for the comparisonanalyses using SPMs obtained from an experimental regressor[Holmes et al., 1997]. For the functional data analysis methods,our code builds on the library developed by Ramsay andSilverman [2001]. To display slices of principal componentimages, software developed by Brett [2000] was used.

Estimation of the Functional Image

Usually, the first step of any functional data analysisconsists in subjecting the images to some preprocessing suchas realignment, stereotactic normalization, and smoothing[Ashburner and Friston, 1997]. Unless low-frequency signalsare of specific interest, a high-pass temporal filter shouldalso be applied. In the case studies presented here, weapplied standard normalization procedures and spatiallysmoothed the data with a Gaussian kernel of 8 mm. Toreduce the influence of sources of undesired variance, a graymatter mask was applied to the image. To prepare the mask,the normalized image was segmented into gray, white mat-ter, and cerebrospinal fluid (CSF). The mask was obtainedby thresholding the parametric image of gray matter atabout 0.1. Finally, the average of each voxel time series iscalculated and subtracted from each voxel of the series.

The preprocessed N scans constitute the N � M matrix ofdata Y, in which each row d�1,d�2, . . . d�N is one volume, andeach column y1, y2, …, yM is a time series composed of oneof the M voxels sampled in each scan. The estimated func-tional image consists of M image functions f1, f2, …, fM, eachobtained from independently fitting the basis function set tothe columns of Y. In the fit, therefore, the predictor variableis an index representing the time point at which the scan wastaken. The response variable is the signal recorded by thescanner at a specific voxel, after the preprocessing stepsdescribed above.

The ways in which the predictors are obtained may beclassified into two broad groups. When the experimentaldesign consists of a periodic repetition of a sequence ofconditions (as is commonly the case in a block design, forexample, or in an event-related design where the stimuli arepresented at sufficiently large intervals), the time indices ofeach point of the voxel time series may be folded around theperiod of the repetition. In this case, the range of the pre-dictor variable is not the time span of the whole series, butthat of the period itself. A variation of this scheme, applica-ble to relatively rare events, consists in defining a time

window that is applied repeatedly in correspondence ofeach event, then discarding the other scans. This first groupof “periodic” schemes is characterized by the fact thatknowledge about the occurrence of an experimentally con-trolled condition is incorporated in the fitting process.

The second type of scheme for the generation of the pre-dictor vector applies when the events or conditions are notknown, occur irregularly or so close to each other that theresulting hemodynamic response cannot be expected to berepeated periodically. In this “nonperiodic” case, the predic-tor variable ranges over the time points of the scans. Finally,in all schemes the indices of the voxel time series are shiftedto take interslice acquisition time into account, unless thevolumes were preprocessed previously to correct for theinterslice acquisition delays.

Formally, these schemes define a function t(n), mappingthe elements of scan sequence n � 1, 2, …, N to the elementsof a predictor vector t. If we denote with a and b the mini-mum and maximum of the elements of t, the closed interval[a, b] represents the domain over which the estimated func-tions will be defined. (More precisely, if the predictor vectorhas been shifted to account for the interslice acquisition timeand is not periodic, not all voxel time series will start andend at the same time-point. In this case, the functions will beestimated on the interval containing the points shared by allintervals. In the interest of clarity, we will ignore this largelyimplementational detail in the rest of the discussion). Withinthe interval [a, b], P will be the number of distinct points atwhich a scan is taken. For periodic schemes, P � N ifrepeated observations are taken at the same point of theperiod; in all other cases, P � N.

We can now specify our model as follows:

ynm � fm(t(n)) � �nm, n � 1,2, . . . N; m � 1,2, . . . M

(1)

where the value ynm of the mth voxel in the nth volume isgiven by a function fm sampled at the time points of thepredictor vector, and the error term �nm is subject to theusual assumptions of being distributed independently andidentically with zero mean and constant finite variance. Toensure their tractability by the mathematical machinery thatfollows, we will also need to make some preliminary mini-mal assumptions on all fm of little practical consequence,such as being continuously differentiable with square inte-grable second derivative.

Each function is estimated separately in each voxel timeseries; hence, to simplify notation, the subscript m indexingthe voxel will not be specified when the context allows it. Toestimate each f as a continuous function of t, t � [a, b], wewill express it as a linear combination of a complete set of Ksuitable basis functions �k: f(t) � ¥k�1

K ck�k(t), k � 1, 2, . . . , K,so that the outcome of the estimate will be, as in a standardlinear model, the set of coefficients ck. The best-known basis setis probably the Fourier series f(t) � c0 � ¥k�1

K (c2k � 1sin k�t �c2kcos k�t), with period 2�/�. The advantage of this basis set isthat for uniformly spaced arguments, the basis functions are

� Functional PCA of fMRI Data �

� 111 �

orthogonal, and k has an explicit interpretation as the index ofthe increasing frequency of the basis functions. This basis set,however, tends to impose a uniform curvature on the esti-mated f across its domain. For this reason, most case studiespresented here make use of regression splines, which are func-tions obtained by joining segments of polynomials smoothly atpoints called knots. The smoothness of the joints is ensured byimposing equality of a number of derivatives (usually equal tothe degree of the polynomial) at the interior knots. Severalschemes have been devised to represent splines conveniently[see de Boor, 1978]; the implementation adopted here makesuse of B-splines. In principle, other basis sets can be used tomodel f, but not all can be used subsequently for PCA with thesame ease.

The function f is estimated independently at each voxeltime series by minimizing the penalized functional

SSPE� � �n � 1

N

[yn � f(t(n))]2 � � �a

b

[Lf(t)]2dt

�SSE��PEN(f). (2)

This functional is the sum of two terms: the first term SSE isthe ordinary sum of square residuals; the second termPEN(f) is a penalty regulating the amount of smoothness ofthe estimated function f, defined by the linear differentialoperator L. The positive smoothness parameter � regulatesthe trade-off between minimizing the square error of the fitand the penalty term [for details on the statistical propertiesof this type of estimators, see Eubank, 1988].

By far the most common penalized functional, which wasalso adopted here, imposes a smoothness constraint by pe-nalizing the square integral of the second derivative of theestimated function:

PEN2f) � �a

b

[f(t)]2dt. (3)

In the periodic case with repeated observations, and es-pecially when B-splines are used, it is usually necessary toconstrain the solution further by excluding functions notsatisfying the condition

f(u)(a) � f(u)(b), u � 0, 1, . . . l, (4)

where f(u) is the uth derivative of the function f evaluated atthe boundaries a, b of the function domain, and l is thedegree of the spline polynomials. This condition specifiesthat the estimated function at a is the smooth continuation ofthe function at b. If this condition is not specified, the splinefit tends to be underconstrained in proximity of the bound-aries, and the estimate to have high variance in those re-gions. Given that the BOLD signal generated by neuralactivity in the gray matter is small, this variance will inevi-tably dominate in the PCA. After de Boor [1978], satisfaction

of this constraint was approximated in our implementationby duplicating the data beyond the two boundaries andfitting the B-spline on the enlarged interval.

A remarkable property of the penalized functional inequation (2) is that because we have only a finite number ofobservations, its unique minimizer can be shown to lie in afinite-dimensional functional subspace, even if its penaltyterm is specified more generally in a space of infinite dimen-sions [Poggio and Girosi, 1990; Schoenberg, 1964; Wahba,1990]. More precisely, the penalized functional SSPE� maybe expressed as a linear combination of only a finite numberK of basis functions �k where K is the minimum number ofbasis functions that suffice to interpolate a set of P observa-tions occurring at the points of [a, b] when the signal wassampled, and as many linear coefficients ck:

f(t) � �k � 1

K

ck�k(t). (5)

Given this finite set of basis function, the penalized func-tional may be rewritten in vector notation

SSPE� � �y � Fc�2 � �c�Pc, (6)

where c is the vector of the coefficients of the basis expan-sion to be estimated, F � {�k(t(n))} is the N � K matrix of thebasis functions evaluated at the time points of t, and P is theK � K “penalty matrix”:

P��a

b

L�q(t)L�r(t)dt� , q,r � 1, 2, . . . K. (7)

Differentiating SSPE� with respect to c and setting to zeroas in ordinary regression gives the estimate of the coeffi-cients

c � F�F � �P)�1F�y. (8)

Quite efficient algorithms exist for inverting the matrix inequation (8), because the compact support of many basisfunctions sets (and notably that of B-splines) implies that F�Fis banded. For many basis function sets, the matrix P willalso possess properties that facilitate its computation. In thegeneral case, P may be computed by discretizing the inte-grals at sufficiently small intervals [Press et al., 1988]. Forfurther details, see de Boor [1978], Green and Silverman[1994], or Ramsay and Silverman [1997].

Principal Component Analysis

Given a vector of interrelated random variables xi, �ixi

� 0, i � 1, 2, . . . , Z, a step-by-step procedure to carry outordinary PCA is described below. To obtain the first com-ponent, find the direction in space such that the projections


� 112 �

of the occurrences of the random vector xi on it achievemaximal variation. Representing this direction as a vector ofunitary length g, we have

hi � �xi,g , (9)

where we will choose g so as to maximize Z�1�ihi2

� Z�1g�X�Xg subject to the constraint �g�2 � 1. The matrixX is formed by the random vectors x�i taken row-wise. For thesubsequent components, we repeat this step, subject to theadditional constraint that each new component be orthogo-nal to all previous components. We cannot determine thecomponents uniquely, because the same direction in space isrepresented by two vectors having the same elements butopposite signs. Standard linear algebra arguments show thatthese directions can be found as the solutions of theeigenequation

Z�1X�Xg � �g (10)

where the eigenvalue � is the variation of the random vectorprojected on g [Anderson, 1984].

As mentioned above, one leading theme in functional dataanalysis exploits of the fact that functions defined on aninterval [a, b] form an inner product vector space [Kolmog-orov and Fomin, 1968]. Multiplication by a scalar and addi-tion of functions are defined as is usually the case for scalarsand functions, and in the inner product between two func-tions f, g defined on [a, b] summation is replaced by integra-tion:

�f,g � �a

b

f(x)g(x)dx, (11)

implying that

�f�2 � �f,f � �a

b

[f(x)]2dx. (12)

With these definitions in mind, it is easy to check that allproperties defining a vector space are satisfied. The basisfunctions that are used to estimate the functional image arein fact the equivalent in functional space of a set of Euclideancoordinate vectors spanning a discrete estimation space.

To define functional PCA, we only need to replace theinner product and norm of vectors of equation (9) with theirequivalent in functional space. In the fMRI setting, the real-izations of the random variable are replaced by the imagefunctions fm that have been estimated from the voxel timeseries. The first principal component is therefore the eigen-function g maximizing the score

hm � �fm,g � �a

b

fm(t)g(t)dt (13)

over the image functions fm, m � 1, 2, . . . , M. The constrainton the length of the eigenfunction becomes

�g�2 � �a

b

[g(t)]2dt � 1 (14)

and the orthogonality constraint on any two eigenfunctionsgq, gr is expressed by

�gq,gr � �a

b

gq(t)gr(t)dt � 0. (15)

Like the directions found by its ordinary counterpart,functional PCA produces a set of functions that, whenadded and subtracted to the mean image function, indicatethe modes of variation in the functional image. As in discretePCA, it will be convenient to center the functions to avoidthe first component mainly to represent average activationlevels. The eigenfunctions then capture the covariance struc-ture of a symmetric bivariate function �, which may bedefined in analogy with its discrete counterpart:

�(y,x) � M�1 �m � 1

M

(fm(y) � f�)(fm(x) � f�), x, y � [a,b],

(16)

where f� � M�1 ¥m�1M fm.

There are two approaches to carry out PCA in functionalspace. The older approach is to evaluate the functions on afine grid, carry out ordinary PCA on the obtained values,and estimate the principal functions by fitting the basis set tothe discrete components thus obtained [Rao, 1958]. Morerecently, however, it was shown that functional PCA can becarried out on finite matrices, finding the eigenvectors andeigenvalues of a positive definite matrix as in the discretecase. Again, this is because, being the outcome of the esti-mation process, the image functions in question are linearcombinations of only a finite number of basis functions.Because the eigenfunctions belong to the same spacespanned by the image functions, they will be expressed by alinear combination of the same basis set. Define the K � Kmatrix U:

U � ��a

b

�q(t)�r(t)dt� q, r � 1, 2, . . . K. (17)


� 113 �

This matrix is a particular case of the matrix P in equation(7), and can be computed in the same way. In particular,note that for an orthonormal basis set, U � I. Then it ispossible to express the eigenequation of principal compo-nent analysis in terms of discrete matrices:

M�1C�CUg � �g (18)

where C is the M � K matrix of the coefficients of the basisexpansion of the centered image functions. Solving equation(18) for � and g returns the eigenvalues and the coefficientsof the eigenfunctions, respectively. For details, see Ramsayand Silverman [1997].

Once the eigenfunctions have been determined, the prin-cipal component scores of all image functions give a set ofparametric images, providing a graphic representation ofthe areas of the brain that load on each eigenfunction. For aneigenfunction g,

hm � �fm,g � c�mUg, m � 1, 2, . . . M, (19)

where the tilde indicates the centered function coefficients.At most min(M, K) eigenfunctions and functional compo-nent score images can be computed, but usually only thefirst few are of interest, as is the case with data contaminatedby noise.

Selection of the Smoothness Parameter

The specification of a smoothness constraint is crucial forthe success of the method, and is justified by the notion thatthe signal recorded by the scanner is contaminated by noise.Another reason to impose a smoothness constraint in non-periodic settings is that if the estimated function is allowedto interpolate the observations exactly, there will be littledifference between functional and ordinary PCA. It is alsoclear, however, that the amount of imposed smoothnesscannot be arbitrarily high. If the estimated functions areexcessively constrained, details of the signal of interest oreven the signal itself may not be captured by the fit. For thisreason, it is important to make a reasonably good choice ofthe smoothness parameter �. Usually, however, we have nodirect information to assist us in choosing its value. In thefollowing, we will consider three approaches for the selec-tion of the smoothness parameter. In so doing, we will alsoclarify the theoretical justifications for preferring the func-tional to the discrete approach in estimating the structure ofthe covariance of the data, and characterize the advantagesof functional PCA with respect of other explorative ap-proaches.

Uniform setting of the smoothness parameter

The first approach is to use visual judgment, trying dif-ferent values of � until one is found that seems to reflect thecourse of signal well, and apply it uniformly to the wholefunctional image. This approach should not be dismissedout of hand on the grounds of its subjective character. In

many cases, it produces entirely acceptable results. Further-more, its analysis leads to a more precise formulation of theway in which our smoothness assumptions bear on theoutcome of the PCA. Theoretically at least, this approach istherefore of fundamental importance.

We will consider the case of interpolating splines, whenthe matrix F is full rank and hence invertible, and the datahave already been centered. For a given value of �, be H� thehat matrix from the penalized regression of equation (8):

H� � F(F�F � �P)�1 F�. (20)

This matrix defines the estimation space of the regression;in fact, it is a linear operator that when applied to thepredictor transforms it into the estimate lying in the sub-space spanned by the columns of F. Clearly, the subspace ofH� could be spanned by another basis, and we will expressit in terms of the coefficients of its own eigenfunctions[Hastie et al., 2001]. Being positive definite, H� has a positivereal eigendecomposition:

H� � QWQ�, (21)

where the columns of Q define the eigenfunctions of H�, andW is a diagonal matrix containing the eigenvalues. Thecolumns of Q appropriately scaled according to their eigen-values constitute the alternative “Demmler-Reinsch” basisfor the estimation space [Demmler and Reinsch, 1975]. It isimportant here to mention three things about this basis set.First, whereas the curves of a usual basis set have norm onebut are not necessarily orthogonal, the curves of theDemmler-Reinsch basis are orthogonal, but their norm isequal to the respective eigenvalue. Second, when the penaltyis PEN2, the eigenvalues order the eigenvectors according totheir smoothness, i.e., smoother curves have larger eigenval-ues. Expressed in this form, splines show their affinity withFourier series, because they are orthogonal and indexedaccording to the “number of oscillations” [Eubank, 1988].Third, although the relative order of eigenvalues and eigen-vectors remain unaltered, changes of � affect the absolutesize of the eigenvalues. In fact, it can be shown that W � (I� �B)�1, where B is the diagonal matrix of the eigendecom-position (FP�1F�)�1 � QBQ� (see the Appendix for details).

Rewriting equation (20) in terms of Q, �, and B, we have

H� � Q(Q�Q � �B)�1Q�, (22)

a ridge regression [Draper and Smith, 1998; Hoerl and Ken-nard, 1970] in which the estimate is shrunk differentiallyaccording to the value of the diagonal elements of B. Re-membering that these latter are also an index of the numberof oscillations of the corresponding eigenfunction, we seethat the penalty term acts as a form of shrinkage towardsmoother curves, rather than toward smaller coefficients asin ordinary ridge regression. The amount of shrinkage isdetermined by � through its indirect effect on the size of the


� 114 �

trace of W. Note that those functions that satisfy Lf � 0 inequation (2), and therefore incur no penalty, are not shrunkat all.

The connection with ridge regression has several impor-tant implications.

First, as in ridge regression, the direction of shrinkage isinterpretable as a Bayesian prior on the distribution of theestimate [Kimeldorf and Wahba, 1970]. The differential op-erator L therefore specifies a prior belief on the result of theestimate, which favors smooth function estimates in thesense specified by the penalty term. This prior is appropriatefor the recovery of a hemodynamic response function, be-cause typically data in an fMRI experiment are collected athigher frequencies than are those of the reconstructed BOLDresponse [Turner et al., 1997].

Second, the estimate of the covariance resulting from thepenalized fits differs from the maximum likelihood estimatein that it too has been subjected to shrinkage toward lowerfrequencies. Lower-frequency signals will therefore tend tobe overrepresented in the first components of the eigende-composition, even if they explain proportionally less vari-ance in the original data.

Third, because of the bias toward lower frequencies, theapplication of a high-pass filter may be used to investigatethe presence of specific signals in the data, because theroughness penalty will preferentially extract the frequenciesimmediately below the filter threshold and place it amongthe first components (if these frequencies have enough sup-port in the data). The standard approach of applying ahigh-pass filter with a threshold at twice the period of theexperimental paradigm to remove signals of no interest suchas a low frequency drift before analysis [Turner et al., 1997]is thus all that is required to investigate the presence of asignal of interest generated by the alternation of the exper-imental conditions.

This is in contrast with other explorative techniques,which generate a number of “components” with little infor-mation about how to select from them. In the case of clus-tering [Wismuller et al., 2002], for example, the number ofvoxels assigned to a time-course cluster according to a near-est-neighbor scheme cannot be used to select the signal ofinterest, because there is generally no reason to assume thatthe signal of interest is present in any but a small number ofvoxels. Similarly, in ICA [McKeown et al., 1998], it is unclearhow the component that represents the signal of interestshould be selected.

Automatic selection by generalized cross-validation

In the second approach, an attempt is made to assign thevalue of � automatically from the data themselves usinggeneralized cross-validation [for a justification of the proce-dure, see Green and Silverman, 1994; Wahba, 1990]. The aimis to set a value of � that reflects an estimate of what in thedata is signal, discarding higher-frequency variance. Impor-tantly, this approach allows setting � to a different value ineach voxel series.

The generalized cross-validation score is given by

GCV� � N�(I � H�)y�2

[trace(I � H�)]2� NSSPE�

(df residuals)2 , (23)

where trace( � ) is the matrix trace. An intuitive understand-ing of how this score works can perhaps be gained if it isviewed as a corrected ratio between the sum of squareerrors, and the square of a value representing the degrees offreedom of the subspace of the residuals. The optimal � isgiven by the minimum cross-validation score, by which theleast square errors sum is obtained with the largest numberof degrees of freedom in error space. Unlike direct cross-validation, GCV� does not require resampling the data forits computation. The score is calculated from a series of fitson a range of � values, after which the value of � may bechosen according to the lowest score. For more exact deter-mination of �, this procedure may be iterated restricting therange of � to the lowest two scores.

An interesting aspect of the general cross-validation scoreis that the related ratio SSPE�/trace(I � H�) for the selected� may be interpreted as an estimate of the variance of theerror term � of our model, under the assumption that thesignal is smooth [Green and Silverman, 1994]. In fact, underthis assumption, generalized cross-validation represents atheoretically motivated attempt to separate genuine signalfrom noise [Carew et al., 2003; Craven and Wahba, 1979].Because the smoothed estimates of the activation functionstranslate into a smooth covariance matrix, generalized cross-validation eventually determines what part of the lowerfrequencies should be retained in the covariance matrix to belater recovered by the eigendecomposition, and what partshould be discarded.

The justification for estimating a separate smoothing pa-rameter for each voxel series is that we do not expect theproportion of noise and genuine signal to be uniform overthe fMRI image. The variance of the voxels of the ventriclesis usually larger, and through spatial smoothing or impre-cise segmentation, some of this variance may spill over tothe gray matter. Equally important is variance originatingfrom aliased signal, whose spatial distribution is also un-likely to be uniform. In keeping with this expectation, weobserved that both the GCV� minimum and the relatedestimate of the error variance vary across the image.

One possibly problematic aspect of generalized cross-val-idation is that, even when calculated from spatiallysmoothed data, the GCV� minimum does not always varysmoothly across adjacent voxels. This might be because, asWahba [1990] notes, the theory justifying generalized cross-validation is an asymptotic one, and therefore departuresfrom a good estimate of the smoothness parameter mayoccur with a finite number of observations. Wahba [1990]reported that even when the observations are in the hun-dreds, two or three obviously wrong estimates are producedevery thousand repetitions of the sampling. Because in thefMRI setting the number of the repeated applications ofgeneralized cross-validation is in the order of tens of thou-sands, we should expect the occurrence of quite a few wrongsmoothness parameters in the estimate. Nevertheless, the


� 115 �

results presented below demonstrate the practical effective-ness of generalized cross-validation when it is desirable notto impose explicit assumptions on the range of frequenciesof the signal of interest. Confidence in the effectiveness ofgeneralized cross-validation is also bolstered by the resultsof Carew et al. [2003], who successfully applied generalizedcross-validation to fMRI datasets with the purpose of cor-recting for the autocorrelation of the signal.

A practical problem of generalized cross-validation is thatit can require rather lengthy computations. For the interac-tive exploration of the data, an acceptable procedure mightconsist of carrying out generalized cross-validation once,taking note of value of � for the estimated functions with thehighest score for the component of interest, and then apply-ing this value uniformly to other similar datasets, relying onthe shrinkage effect of the penalized regression to empha-size the signal of the corresponding frequency across thewhole dataset.

Cross-validation has been used here on the data to calcu-late a different smoothing parameter for each image func-tion, but there are other interesting approaches that need tobe explored. One possibility is that of fitting the data with aninterpolating function, incorporate the penalty in the PCA,and cross-validate the result at this final stage [Ramsay andSilverman, 1997]. Another approach is that of estimating thesmoothness parameter separately for each component [Riceand Silverman, 1991].

Linear smoothers from restricted basis expansions

When the recording sessions are long and nonperiodicpredictor vectors are specified, generalized cross-validationbecomes impractical, and indeed even just fitting an inter-polating B-spline to the whole time series may require mem-ory resources that exceed the capability of the computer. Inthis case, it is possible to regularize the functional image byspecifying a smaller basis function set, i.e., one in which K� P. Because the knots of the B-splines are spaced regularly,the remaining degrees of freedom are distributed evenlyacross the function domain, resulting in a uniformlysmoother estimate. It is then not necessary to specify anypenalty term, and the fitting process minimizes the ordinarysum of error squares:

SSE � �y � Fc�2, (24)

leading to the familiar estimate

c � (F�F)�1F�y. (25)

The hat matrix is in this case an ordinary projection ma-trix, and there is no shrinkage of the estimated coefficients.Clearly, for low enough values of K, the fitted function willnot be able to capture the signal of interest even when thislatter is present. The advantage of this approach is that itscomputation is fast, and that of the PCA that follows evenmore so. If a Fourier basis set is used, it is easy to bind the

choice of the number of basis functions to an a priori spec-ification of the dominant frequency of the signal of interest.

RESULTS

First Case Study: Working Memory

In the first case study, functional PCA was applied torealigned, stereotactically normalized, smoothed, and high-pass filtered fMRI from a working memory block design(“two-back”) [Cohen et al., 1994]. After Turner et al. [1997],the filter threshold was set to twice the period of the block.In the first half of the block, subjects were required to pressa button indicating if the presented stimulus was identical toa predefined letter. In the second half of the block, the samedecision was taken regarding the sameness of the presentstimulus (a letter again) and the stimulus two presentationsearlier. The image functions were estimated with B-splinesusing a periodic scheme for the generation of the predictors,and periodic constraints on the estimated image functions.The number of base functions was the smallest to interpolatethe data. The amount of smoothing was determined throughgeneralized cross-validation.

Figure 1 displays the results of the functional eigenanaly-sis (top), compared to those of the ordinary version of PCA(bottom). The data on which ordinary PCA was carried outwere subjected to the same preprocessing (including appli-cation of a high-pass filter) as that in the functional versionof the algorithm. For a fair comparison, in the calculation ofthe ordinary version of PCA, the raw data were also foldedaround the period of the block and averaged before beingused to form the discrete variance matrix (for details on theprocedure applied to carry out ordinary PCA, see the Ap-pendix). The covariance function of the fit (equation [16]) isdisplayed by drawing the values taken by the function inshades of gray (top left). Here, the two phases of the block ofthe experimental paradigm are clearly visible as a pattern ofrough squares of higher covariance, shifted to the right anddownward because of the BOLD delay. These two areas ofhigher covariance arise because, in the centered functionsfrom the voxels affected by the paradigm, the off and onphases of the block create correlated deviations relative tothe average zero signal. Note that the main diagonal has twodrops in intensity corresponding to the point where thesignal inverts at the boundaries of the phases, when itcrosses zero. For comparison, the discrete covariance matrixof the centered and filtered data is displayed at the bottomleft. It is very difficult here to discern the pattern of vari-ance/covariance induced by the paradigm. In the discretecovariance matrix, there are high-frequency signals that ob-scure the biphasic structure of the experimental block. Onthe diagonal, four data points give rise to outliers, in contrastto the much more regular distribution of the variance in thefunctional counterpart.

The scree plots of the eigenanalyses, displayed under thecovariance structures, indicate the existence of a very clearfirst component in the functional version (top), whereas thediscrete PCA produces at least two components that might


� 116 �

Figure 1.Top: Functional eigenanalysis of the data from the working mem-ory paradigm. Bottom: Ordinary principal component analysis ofthe same dataset. Left: Covariance function and matrices fromwhich the eigenanalyses are carried out, below which the respec-tive scree plots are displayed. Top center: The first eigenfunctionfor the memory-encoding paradigm. For display, the direction waschosen that makes the coefficients positive relative to the ex-pected pattern of activation. Bottom center: The first andsecond eigenvectors, displayed using the same convention. Topright: Signal (black dots) and fitted function (gray) from the voxelwith the highest first functional component score, located in BA6.Here and in the following plots of the signal and the fitted function,the Talairach coordinates of the voxel are displayed in the upper

left corner of the plot. When calculated by generalized cross-validation and varying in each voxel, the smoothing coefficient � ofthe fit is displayed on the right lower corner of the plot. Bottomright: Signal and fitted functions from the voxels with the highestscores on the first and second components from ordinary PCA;the second voxel is here the same as that selected through thefirst functional component. In this and all subsequent case studies,the data were subjected to the same preprocessing steps beforeapplication of both methods. The block design had a periodicity of64 s. A high-pass filter of 128 s was applied to the series of 96scans acquired with an interscan interval of 4 s, and subsequentlysmoothed with a Gaussian kernel of 8 mm. A mask selecting graymatter was applied to select voxels of interest.


� 117 �

refer to signal in the data (bottom). Figure 1 also displays thefirst eigenfunction and the first two eigenvectors resultingfrom the respective eigenanalyses. The first eigenfunctionshows deviations from the average signal corresponding tothe two phases of the reconstructed BOLD response func-tion, delayed of about 5 s relative to the boundaries of theblock (top center). This eigenfunction captures the pattern ofvariation clearly visible in the covariance function. In con-trast, the first eigenvector from the ordinary PCA showslittle relation to the expected BOLD response; only the sec-ond eigenvector presents some resemblance to the expectedsignal (bottom center).

On the right of Figure 1 the signal and fitted functionsfrom the voxels with the highest PCA are displayed. Becausethey correspond to the maximum of the component scores,these voxels contain the highest variation in the directiondetected by the respective principal component. The courseof the signal in these voxels therefore has the highest corre-lation with the respective eigenfunction or eigenvector. Thismeans that the course of the signal deviates from zero andhas the same overall shape as the respective eigenfunction oreigenvector.

In functional PCA (top), the eigenanalysis is carried out onall fitted functions. As a result, the displayed fitted function(rather than the raw signal) has the highest correlation withthe eigenfunction in the voxel with the highest componentscore. The extent to which the fitted function follows thevariation of the raw signal (black dots in the plot on theright) depends on the smoothing coefficient, which is deter-mined here automatically in each voxel using generalizedcross-validation. If the smoothing coefficient is zero, the fitattempts to follow the signal as closely as possible; if it isvery large, the fit approximates a straight line irrespective ofthe variation of the signal over time. The smoothing coeffi-cient found with generalized cross-validation is smallenough for the fit to reproduce most of the variation in thesignal. In this voxel, therefore, the eigenfunction reproducesthe fitted function and the course of the signal well.

In case of ordinary PCA (bottom), the eigenanalysis takesplace over the averaged raw signal. As a result, the averagerow signal (rather than the fitted function) has the highestcorrelation with the respective eigenvector in the voxel withthe highest component score. The fitted function here fails toreproduce the eigenvector or the fitted signal well. This isdue to the quite large smoothing coefficient selected bygeneralized cross-validation. The voxel with the highestscore on the second ordinary principal component is thesame as the voxel with the highest score on the first func-tional principal component, where, as we have seen, thepenalized fit follows the signal quite faithfully. The secondordinary eigenvector does not, however, and contains muchhigh-frequency variation that is not prominent in the rawsignal or the fitted function. The first functional componentand the second ordinary component locate the peak of acti-vation in Brodmann’s area (BA) 6 [Talairach and Tournoux1988], a part of prefrontal cortex involved in working mem-

ory. We could not give a functional interpretation of theactivation located by the first ordinary component.

Functional PCA gave good results also when applied tothe same data without folding them around the period of theblock (Fig. 2). The task became more difficult because noknowledge about the phase and number of the blocks wasused. As before, the functions were fitted using B-splineswith the amount of smoothing determined by generalizedcross-validation. The number of basis functions was chosento enable the interpolation of the data. The covariance func-tion displays a more complex patterned structure corre-sponding to the six blocks of the experimental design (topleft), which are clearly visible in the first eigenfunction (topcenter). This pattern is much more difficult to recognize inthe maximum likelihood estimate of the covariance matrixof the raw data (bottom left), where some single data pointslook like outliers. Consequently, it is difficult to identify theactivation induced by the blocks in the eigenvectors pro-duced by ordinary PCA (bottom center). In contrast, theblocks are clearly retrieved by the first functional compo-nent, which comes close to replicating the experimentalregressor produced by the SPM package [Friston et al.,1995b]. The voxel with the highest functional componentscore (top right) is situated as in the previous analysis inBA6. As in the previous analysis with folded data in Figure1, one can see that raw signal, fitted function, and eigenfunc-tion in the voxel with the highest first component score arein good agreement with each other and with the experimen-tal regressor (top center and right). We therefore concludethat the variation identified by the first functional compo-nent well represents the systematic variation that must havebeen introduced by the alternation of the conditions of theexperimental setting (the blocks of the paradigm). By con-trast, the picture resulting from the comparison of raw sig-nal, fitted function, and eigenvectors resulting from the firsttwo components of ordinary PCA (bottom center and right)is much more difficult to interpret. Even if not immediatelyapparent at the naked eye, there is necessarily some corre-lation between raw signal and the respective eigenvector.Generalized cross-validation discarded all variations inthese voxels, however, so that the fitted function is a flat line(i.e., there is no signal left). This seems appropriate for ourpurposes, because the course of the signal in these voxelsseems to have little in common with the experimental re-gressor.

Figure 3 displays the comparison between the “beta im-age” produced by the experimental regressor using the SPMpackage (top row) and the images composed of the firstfunctional component scores (second and third rows fromtop). The beta image contains the coefficients of a regressorobtained by convoluting a boxcar function representing theconditions of the block with a standard hemodynamic func-tion (Fig. 2, center). The beta image from the experimentalregressor and the functional component score images arevery similar to each other. Moreover, the transverse slicesdemonstrate parts of the frontal-parietal network usuallyobserved during two-back tasks in fMRI [Cohen et al., 1994;


� 118 �

Walter et al., 2003a,b] and constituting parts of the phono-logical loop as described by Baddeley [1992]. Activation inBA6 is particularly prominent here, with a lack of diffusedactivation in the functional component score image from thedata that were not folded around the period of the block(third row from top). This is a consequence of generalizedcross-validation, which discarded most variance in thesevoxels. Ordinary PCA (not folded, bottom row) was unableto demonstrate relevant foci of activation. Folded ordinary

PCA did (fourth row); however, it was necessary to selectthe second component after visual inspection.

Second Case Study: Episodic Memory Encoding

The second case study evaluates data from a study inwhich functional PCA was carried out on fMRI data from ablock design for a memory encoding paradigm [Gron et al.,2001]. In the first part of the block, two abstract geometric

Figure 2.Eigenanalysis of the working memory data of Figure 1 withoutfolding around the period of the block. As in Figure 1, the covari-ance function and matrix and the scree plots from the respectiveeigenanalyses are on the left, the first eigenfunction and the firsttwo eigenvectors (gray) are in the center. Eigenfunction andeigenvectors have been displayed with the experimental regressorcreated by SPM by convolving the blocks of the paradigm with a

canonical BOLD response (black). The signal and fitted functioncharts on the right are chosen according to the highest componentscore of the respective PCA. The voxel individuated by functionalPCA is the same as that in the analysis with folded data. The twovoxels from ordinary PCA are situated in the frontal lobe. In bothcases the smoothing coefficient is the highest allowed by theimplementation.


� 119 �

patterns were presented, each for 6 s. In the second part of theblock, the subject viewed red concentric circles for 20 s. Thesubject was required to intentionally memorize the stimuli forlater recall. Each block was presented five times. As in the firstcase study, the data were first realigned, stereotactically nor-malized, smoothed, and high-pass filtered. As in the previouscase study, the image functions were estimated with B-splinesusing a periodic scheme for the generation of the predictorsand periodic constraints on the estimated image functions afterfolding the data around the period of the block. Enough basefunctions were used to allow interpolation of the samplingpoints, whereas the smoothness parameter was estimated bygeneralized cross-validation (Fig. 4).

In this case study, we find again that the covariance func-tion from the smoothed fit reveals the biphasic structure ofthe block more clearly than does the covariance matrix of the

raw data. Unlike the previous case, the first eigenvectorfrom ordinary PCA displays a good resemblance to theexpected activation, although it tends to detect the variancein the middle of the block to the detriment of the boundaries.In contrast, variance at the boundaries is captured by thesecond eigenvector. As a result, the eigenfunction is again amuch better reconstruction of the signal in the area of inter-est (top right). The comparison of the component scoreimages reveals that this difference is enough to lead to someminor mismatch in the detection of the activated areas.

To explore the robustness of the method, we changed thebasis set to carry out the functional PCA on the same datawithout folding them around the period of the block. Weused a nonperiodic scheme to generate the predictors and aFourier series as basis function set. The use of a Fourierseries implies the periodicity of the function modeled but no

Figure 3.Working memory paradigm. Top row: Slicesfrom the beta image from the regression ob-tained with the SPM package. Second row:Slices from the first functional principal com-ponent scores image using a periodic schemefor the generation of the predictors after fold-ing the data around the period of the experi-mental block. Third row: slices from the firstfunctional principal component scores imageusing a nonperiodic scheme for the generationof the predictors. Fourth row: Second prin-cipal component scores image from an ordi-nary PCA on the data folded around the pe-riod of the block. Bottom row: Firstprincipal component scores image from anordinary PCA on the not folded data. Thecomponent score images derive from thecomponent direction that best matches thebeta image of the SPM linear model.


� 120 �

knowledge of the phase of the block. Unlike the previousanalyses, we used a fixed smoothness parameter to obtain aquick computation of the functional fits (Fig. 5).

As in the working memory case study, the structure of theblock emerges more clearly from the functional version of theeigenanalysis. The first eigenfunction replicates closely the ex-perimental regressor (top center). Although dominated by anoutlier, the first eigenvector manages to capture some of thevariation induced by experimental paradigm. In contrast, thesecond eigenvector retrieves variance in voxels where it seemsto be constituted mostly of noise (bottom right).

The analysis of the component score images (Fig. 6) re-veals that although activation of the functional PCA corre-sponds closely to that retrieved by SPM, ordinary PCA leadssometimes to incorrectly active areas and sometimes missesimportant activations. SPM analysis (top) demonstrates asubset of clusters of active voxels in regions expected to berecruited during episodic learning of visual material [Des-granges et al., 1998; Gabrieli, 1998]: lingual gyrus, inferioroccipital gyrus, inferior and superior parietal lobule, andanterior cingulated gyrus. The same clusters of activationwere identified by functional PCA (folded and not folded;

Figure 4.Top: Functional eigenanalysis of the data from the episodic mem-ory paradigm. Bottom: Ordinary principal component analysis ofthe same dataset. For a detailed explanation, see the legend toFigure 1. The signal and fitted function relative to functional PCAis situated in BA18. The voxels for ordinary PCA are those with

the highest scores on the respective eigenvector. The block designhad a periodicity of 32 s. A high-pass filter of 64 s was applied tothe series of 40 scans acquired with an interscan interval of 4 s, andsubsequently smoothed with a Gaussian kernel of 8 mm. A maskselecting gray matter was applied to select voxels of interest.


� 121 �

second and third row). In contrast, ordinary PCA (folded,fourth row) fails to locate the lingual gyrus activation. Notfolded ordinary PCA (bottom row) demonstrates variousadditional regions that were not expected to be relevant inepisodic learning, especially insular cortex bilaterally andresidual posterior white matter at the level z � 35.

Third Case Study: Finger Tapping

To investigate the effectiveness of functional PCA in adifferent setting, we applied it to an event-related design inwhich the subject was required to press a button with herright index finger upon presentation of a visual stimulus.

Because in this experiment the stimuli do not occur regu-larly, a periodic scheme for the regressor with data foldingaround the period of the stimulus is not applicable. In thiscase study, instead of comparing folded and not folded datawe investigate the role of generalized cross-validation bycomparing it to the setting in which smoothing is carried outby reducing the number of the bases and applying a pre-defined smoothness coefficient.

The signal is strong enough to lead to its individuation inboth the functional and ordinary versions of PCA. The slightsuperior performance of functional PCA with generalizedcross-validation, however, can still be appreciated in the

Figure 5.Eigenanalysis of the episodic memory data without folding aroundthe period of the block, and using a Fourier basis set with a fixedsmoothness coefficient of 120. The signal and fitted function forfunctional PCA is situated in the same position as in the preceding

figure. The voxels for traditional PCA are those with the highestscores on the respective eigenvector. For an explanation of thefigure, see the legend to Figure 1.


� 122 �

shape of the scree plot. Although ordinary PCA retrievesseveral components that seem to explain some portion ofvariance, generalized cross-validation eliminates any plau-sible candidate component after the first (Fig. 7). This isparticularly remarkable in the case of the second componentof ordinary PCA, where all variance has been discarded bygeneralized cross-validation.

For a large dataset like this one, computation of the gen-eralized cross-validation smoothing parameter for a fit withinterpolating splines is computationally intensive (about 3.5hr on a Pentium PC). In contrast, fitting the same data usingB-splines and half the number of bases that would have beenrequired to interpolate the data took about 3.5 min. Al-though the covariance function here includes variance thatpartially obscures the pattern of the experimental events, the

first eigenfunction still displays a very good similarity to theexpected BOLD response (Fig. 8).

Figure 9 shows the comparison between the beta imageobtained with the SPM package and the component scoreimages obtained with different methods. For the tappingtask, the reference analysis (top) applying the general linearmodel demonstrates a network of motor and premotor areas(contralateral precentral gyrus, left lateral middle frontalgyrus, and dorsal superior frontal gyrus; BA6). This patterncould be repeated with minor variations in all forms of PCA(second to bottom rows). Although all images display fairlyconcordant results, the areas of activation in the componentscore image deriving from the application of generalizedcross-validation are isolated, instead of being surrounded bya penumbra of weakly activated voxels. These weakly acti-

Figure 6.Episodic memory paradigm. Top row: Slicesfrom the contrast image from the regressionobtained with the SPM package. Second row:Slices from the first functional principal com-ponent scores image using a periodic schemefor the generation of the predictors and fold-ing of the data. Third row: Slices from thefirst functional principal component scores im-age using a nonperiodic scheme for generationof the predictors. Fourth row and bottomrow: Slices from the first ordinary componentscores images of the equivalent traditionalPCA eigenanalyses.


� 123 �

vated voxels are apparent especially in the beta image fromSPM and in the component score image from ordinary PCA.

The elimination of diffuse activation in the functionalcomponent images is an effect of smoothing, which discardsvariance at high frequencies, and is most pronounced whengeneralized cross-validation is used. We have noted above(see comments regarding Fig. 1, 2, 7) that the fitted functionsin the voxels with maximum component scores in the tradi-tional PCA are so smooth that most or all variance is dis-carded by the penalized fit. This is actually a more general

phenomenon induced by generalized cross-validation inthese datasets, affecting not only the voxels with a highordinary component score, but also most voxels outside theareas loading on the main functional component. Figure 10displays a comparison of selected slices of the componentscores images where the areas of activation are prominent,and corresponding slices in images composed of thesmoothing coefficient estimated by generalized cross-valida-tion. The high values of the smoothing coefficient in theareas where there is no activation in the component score

Figure 7.Top: Functional eigenanalysis of the data from the finger-tappingparadigm. Bottom: Ordinary principal component analysis of thesame dataset. For detailed explanation, see the legend to Figure 1.The signal and fitted function for functional PCA and the firstprincipal component of ordinary PCA are taken from the samevoxel in the precentral gyrus. The signal and fitted function for the

second ordinary principal component is from the voxel with thehighest score, The smoothing coefficient was the largest allowedby the implementation. A high-pass filter of 60 s was applied to theseries of 102 scans acquired with an interscan interval of 2.49 s,and subsequently smoothed with a Gaussian kernel of 8 mm. Amask selecting gray matter was applied to select voxels of interest.


� 124 �

images mean that in these voxels the variance has beendiscarded by generalized cross-validation. The componentscore is therefore about zero in all these voxels.

Although Figure 10 is limited necessarily to displayingselected slices, a rough idea of the extent of this phenome-non is given from the scree plots of the functional PCAsfrom the generalized cross-validation fits (Fig. 2 and 7).These scree plots show that apart from the first component,little variance has remained in the data.

DISCUSSION

There are two concurring reasons why functional PCA iseffective in recovering the signal generated from the exper-imental manipulations. The first is that, as discussed in thesection on smoothing amount selection, in the typical fMRIexperimental setting the frequency of the BOLD signal islow relative to the sampling rate of the scans [Turner et al.,1997]. This means that the signal at adjacent time points islikely to be correlated. In the presence of noise, the functionsfrom which the data are sampled may therefore be betterestimated by adding a roughness penalty and thus biasingthe estimate toward smooth curves [Eubank, 1988]. Theresulting covariance function is also comparativelysmoother. The second reason is that, as in all experimentalsettings, the alternation of the experimental conditions is asource of systematic variance in the data. It therefore seemsjustified to use a PCA method, which selects componentsbased on the amount of variance, to detect it. Taken together,these two considerations provide the rationale for lookingfor the signal arising from the experimental manipulationsin the first components individuated by functional PCA onthe penalized fits.

Comparison With Ordinary PCA

The case studies presented here confirm these theoreticalsuggestions. Systematic variance introduced by alternationof experimental conditions is always retrieved by the firstfunctional component. In contrast, the success of traditionalPCA in retrieving this variance is much less constant.

Although these results demonstrate that functional PCAwas usually more effective than its ordinary counterpart inretrieving the variance associated with the experimentalconditions, one may wonder if ordinary PCA retrieved somegenuine signal that the functional version was missing. It isinteresting, in this respect, that often the signal retrieved byordinary PCA is discarded in a fit carried out with general-ized cross-validation. This observation casts doubts on thenature of the signal recovered by ordinary PCA.

The effects of generalized cross-validation are apparent inthe improvements in the interpretability of the scree plots.Our case studies indicate, however, that the use of otherforms of smoothing (such as using a fixed smoothness pa-rameters or reducing the number of basis functions) is alsobeneficial. Because of their reduced computational require-ments, these techniques are probably indicated as a firstquick explorative approach to the data. As an objectivemethod, generalized cross-validation removes possible in-terpretation uncertainties generated by a manual selection ofthe smoothness coefficient in a final analysis.

Comparison With ICA

Because the assumptions behind functional PCA and ICAare quite different, it is helpful to review them to contrastthese two approaches and determine the respective appro-

Figure 8.Left: Covariance function and scree plot resulting from eigenan-alysis of the finger-tapping experiment fitted with half the numberof bases necessary to interpolate the data together with a fixedsmoothing coefficient of 12. Center: The first eigenfunction (gray)

is displayed together with the regressor obtained from the SPMpackage (black). Right: Signal (black dots) and fitted function (ingray) for the voxel in the dorsolateral prefrontal cortex alsodisplayed in the previous figure.


� 125 �

priate domains of application. ICA assumes the componentsto be independent [Bell and Sejnowski, 1995; McKeown etal., 1998] and non-Gaussian [Hyvarinen and Oja, 2000]. Thisis a powerful and attractive set of assumptions that makeICA very competitive for exploratory tasks; however, ICAtreats the observed signals as a set of random variableswithout considering the dependency of adjacent time points.In contrast, functional PCA exploits this dependency byimposing a smoothing constraint that can be fixed or esti-mated through generalized cross-validation. The resultingshrinkage of the data helps bringing the signal of interest tothe fore in the first few components.

If the task is rather one of detecting signal unrelated to theexperimental paradigm or the assumption of smoothness ofthe signal no longer applies, the rationale for the adoption offunctional PCA is no longer present. In this case, it may behelpful to use different approaches and compare the resultsin light of the different underlying assumptions.

Comparison With Other Methods Basedon the Design Matrix

We address the issue of the difference between functionalPCA and other forms of model selection in multivariate

methods, such as multivariate linear models (MLM) [Wors-ley et al., 1997] or in parametric mapping with an extendeddesign matrix [Kherif et al., 2002]. The first fundamentaldifference is that although these methods address the issueof inference on the effects of the experimental paradigm,functional PCA is purely explorative. By specification, ourmethod is required not to use information from the experi-mental design matrix. On the contrary, use of such informa-tion is appropriate and necessary in MLM, a form of canon-ical correlation analysis, and in SPM.

A further general difference between functional PCA andMLM is in the estimate of the covariance structure of thedata (we use the word structure because in the functionalcase it is a bivariate function, not a matrix). This point can beappreciated easily if one compares the graphic renderings ofboth covariance structures in the preceding figures. Becauseof the different character of the covariance structures, thereis a significant difference between the functional approachand any multivariate method that makes use of the maxi-mum likelihood estimate of the covariance matrix of thedata. Whether the use of penalized covariance structuresmay also be of advantage in functional MRI when applyingmultivariate methods in general is an issue that lies beyond

Figure 9.Finger-tapping paradigm. Top row: Slicesfrom the beta image from the regression ob-tained with the SPM package. Second row:Slices from the first functional principal com-ponent scores image, obtained with general-ized cross-validation. Third row: Slices fromthe first functional principal component scoresimage, with smoothing obtained by fitting halfthe number of bases necessary to interpolatethe data, and a fixed smoothing coefficient of12. Bottom row: Slices from the first ordi-nary component score image.


� 126 �

the focus of the present work. At least when the dependentvariable is categorical (classification tasks), there are studiesthat suggest that using the functional approach is advanta-geous [Hastie et al., 1995].

In the literature stemming from the SPM approach, it hasbeen proposed that uncertainties in the form of the BOLDfunction may be overcome by forming an expanded designmatrix that includes, for example, derivatives of the exper-imental regressor [Kherif et al., 2002]. To avoid overfittingthe data, one needs to carry out model selection on theexpanded design matrix [Kherif et al., 2002]. There is here apoint of similarity with our approach because in both cases,the detection of a signal is accomplished by means of modelselection (in functional PCA implemented through theshrinkage given by smoothing and generalized cross-valida-tion) instead of using precise information from the experi-mental paradigm. However, the type and direction of modelselection differ in several respects in functional PCA and inexpanded design matrix approaches.

Firstly, in functional PCA shrinkage is carried out on thedata, not on the design matrix as in Kherif et al. [2002].Although the motivation for the adoption of PCA and ridgemethods in statistical applications is collinearity of the pre-dictors, this typically happens in the context of observedpredictor variables [Hastie et al., 2001], not in the context ofexperimental variables as in most fMRI design matrices. In amultivariate approach to fMRI data, it is the dependentvariables that are highly multidimensional and contami-nated by noise, so that shrinking of the dependent variablesseems justified. Furthermore, because it makes very little orno use of the experimental paradigm, functional PCA castsa wider net than do methods based on an expanded designmatrix. For example, the expanded design matrix approachwould be difficult to execute in a situation when the onset ofthe response is not known at all.

Secondly, in the functional approach the direction of theshrinkage is determined by a notion of frequency (coeffi-cients of base function with many “oscillations” are penal-

Figure 10.Selected slices of the beta images from ordi-nary (left) and functional PCA (center) fromthe working memory (top two rows) andfinger-tapping paradigm (bottom two rows),calculated without folding the data around theperiod of the block. To clarify the reasons forthe different distribution of activation in thesetwo methods, the corresponding slices fromthe images composed of the smoothing coef-ficients estimated through generalized cross-validation are displayed on the right. The graypatches near the midline in these coefficientimages are due to the segmentation process,which has classified these voxels to the whitematter compartment. To facilitate compari-son, the beta and component score valueshave been normalized to the unit scale. Toobtain adequate contrast in the display, thesmoothing coefficient images are in logarith-mic scale.


� 127 �

ized by the fit). By contrast, discrete multivariate methodshave no notion of frequency. This direction of shrinkage ismotivated by extensive experience in the field of nonpara-metric curve estimation under the assumption that the sig-nal is smooth in the presence of measurement noise. In thecomparative illustrations of the covariance structures, onecan see that this direction of shrinkage results in qualitativedifferences in the estimate of the covariance structure andhence on any multivariate estimate that depends on thecovariance structure.

REFERENCES

Anderson TW (1984): An introduction to multivariate statisticalanalysis. Second ed. New York: John Wiley and Sons. 675 pp.

Ashburner J, Friston KJ (1997): Spatial transformation of images. In:Frackowiak RSJ, Friston KJ, Frith CD, Dolan RJ, Mazziotta JC,editors. Human brain function. London: Academic Press. p 43–58.

Baddeley AD (1992): Working memory. Science 255:556–559.Bell AJ, Sejnowski TJ (1995): An information-maximization ap-

proach to blind separation and blind deconvolution. NeuralComput 7:1129–1159.

Brett M (2000): Slice display software. Online at http://www.mrc-cbu.cam.ac.uk/Imaging/display_slices.html (accession date 2November 2002).

Callicott JH, Weinberger DR (1999): Functional MRI in psychiatry.In: Moonen CTW, Bandettini PA, editors. Functional MRI. Ber-lin: Springer. p 501-512.

Carew JD, Wahba G, Xie X, Nordheim EV, Meyerand ME (2003):Optimal spline smoothing of fMRI time series by generalizedcross-validation. Neuroimage 18:950–961.

Cohen JD, Forman SD, Braver TS, Casey BJ, Servan-Schreiber D,Noll DC (1994): Activation of prefrontal cortex in a nonspatialworking memory task with functional MRI. Hum Brain Mapp1:293–304.

Craven P, Wahba G (1979): Smoothing noisy data with spline func-tions: estimating the correct degree of smoothing by the methodof generalized cross-validation. Numer Math 31:377–403.

de Boor C (1978): A practical guide to splines. Berlin: Springer. 392pp.

Demmler A, Reinsch C (1975): Oscillation matrices with splinesmoothing. Numer Math 24:375–382.

Desgranges B, Baron JC, Eustache F (1998): The functional neuro-anatomy of episodic memory: the role of the frontal lobes, thehippocampal formation, and other areas. Neuroimage 8:198–213.

Draper NR, Smith H (1998): Applied regression analysis, Third ed.New York: John Wiley and Sons. 672 p.

Eubank RL (1988): Spline smoothing and nonparametric regression.New York: Marcel Dekker. 438 p.

Friston KJ (1997): Characterising distributed functional systems. In:Frackowiak RSJ, Friston KJ, Frith CD, Dolan RJ, Mazziotta JC,editors. Human brain function. London: Academic Press. p 107–126.

Friston KJ, Frith CD, Liddle PF, Frackowiak RSJ (1993): Functionalconnectivity: the principal-component analysis of large (PET)data sets. J Cereb Blood Flow Metab 13:5–14.

Friston KJ, Frith CD, Frackowiak RSJ, Turner R (1995a): Character-izing dynamic brain responses with fMRI: a multivariate ap-proach. Neuroimage 2:166–172.

Friston KJ, Holmes AP, Worsley KJ, Poline JP, Frith CD, FrackowiakRSJ (1995b): Statistical parametric maps in functional imaging: ageneral linear approach. Hum Brain Mapp 2:189–210.

Gabrieli JD (1998): Cognitive neuroscience of human memory. AnnRev Psychol 49:87–115.

Green PJ, Silverman BW (1994): Nonparametric regression and gen-eralized linear models: a roughness penalty approach. London:Chapman and Hall. 182 pp.

Gron G, Bittner D, Schmitz B, Wunderlich AP, Tomczak R, RiepeMW (2001): Hippocampal activations during repetitive learningand recall of geometric patterns. Learn Mem 8:336–345.

Hastie TJ, Buja A, Tibshirani RJ (1995): Penalized discriminant anal-ysis. Ann Statist 23:73–102.

Hastie TJ, Tibshirani RJ, Friedman J (2001): The elements of statis-tical learning. Data mining, inference, and prediction. New York:Springer. 533 p.

Hoerl AE, Kennard RW (1970): Ridge regression: biased estimationfor non-orthogonal problems. Technometrics 12:55–67.

Holmes A, Poline JB, Friston KJ (1997): Characterising brain imageswith the general linear model. In: Frackowiak RSJ, Friston KJ,Frith CD, Dolan RJ, Mazziotta JC, editors. Human brain func-tion. London: Academic Press. p 141–159.

Hyvarinen A, Oja E (2000): Independent component analysis: algo-rithms and applications. Neural Netw 13:411–430.

Jolliffe IT (1986): Principal component analysis. Heidelberg:Springer. 271 pp.

Kherif F, Poline JB, Flandin G, Benali H, Simon O, Dehaene S,Worsley KJ (2002): Multivariate model specification for fMRIdata. Neuroimage 16:1068–1083.

Kimeldorf GS, Wahba G (1970): A correspondence between bayes-ian estimation on stochastic processes and smoothing by splines.Ann Math Statist 41:495–502.

Kolmogorov AN, Fomin SV (1968): Introductory real analysis [En-glish translation]. Englewood Cliffs, NJ: Prentice-Hall. 403 pp.

Lee TW, Girolami M, Bell AJ, Sejnowski TJ (1999): A unifyinginformation-theoretic framework for independent componentanalysis. Neural Comput 10:2103–2144.

McKeown MJ, Makeig S, Brown GG, Jung T-P, Kindermann SS, BellAJ, Sejnowski TJ (1998): Analysis of fMRI data by blind separa-tion into independent spatial components. Hum Brain Mapp6:160–188.

McIntosh AR, Bookstein FL, Haxby JV, Grady CL (1996): Spatialpattern analysis of functional brain images using partial leastsquares. Neuroimage 3:143–157.

Petersson KM, Nichols TE, Poline J-B, Holmes AP (1999): Statisticallimitations in functional neuroimaging I. Non-inferential meth-ods and statistical models. Philos Trans R Soc Lond B Biol Sci354:1239–1260.

Poggio T, Girosi F (1990): Networks for approximation and learning.Proc IEEE 78:1481–1497.

Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1988): Nu-merical recipes in C. Cambridge: Cambridge University Press.735 pp.

Ramsay JO, Silverman BW (1997): Functional data analysis. Berlin:Springer. 328 pp.

Ramsay JO, Silverman BW (2001): Functional data analysis software,MATLAB edition. Online at http://www.psych.mcgill.ca/faculty/ramsay/software.html (accession date 2 November2002).

Rao RC (1958): Some statistical methods for comparison of growthcurves. Biometrics 14:1–17.


� 128 �

Rice JA, Silverman BW (1991): Estimating the mean and covariancestructure nonparametrically when the data are curves. J R StatSoc B 53:233–243.

Schoenberg I (1964): Spline functions and the problem of gradua-tion. Proc Natl Acad Sci USA 52:947–950.

Talairach J, Tournoux P (1988): Co-planar stereotaxic atlas of thehuman brain. Stuttgart: Thieme. 122 pp.

Turner R, Howseman A, Rees G, Josephs O (1997): Functionalimaging with magnetic resonance. In: Frackowiak RSJ, FristonKJ, Frith CD, Dolan RJ, Mazziotta JC, editors. Human brainfunction. London: Academic Press. p 467–486.

Wahba G (1990): Spline models for observational data. Philadelphia:Society for Industrial and Applied Mathematics. 169 pp.

Walter H, Wunderlich AP, Blankenhorn M, Schafer S, Tomczak R,Spitzer M, Gron G (2003a): No hypofrontality, but absence ofprefrontal lateralization comparing verbal and spatial workingmemory in schizophrenia. Schizophr Res 61:175–184.

Walter H, Bretschneider V, Gron G, Zurowski B, Wunderlich AP,Tomczak R, Spitzer M (2003b): Evidence for quantitative domaindominance for verbal and spatial working memory in frontaland parietal cortex. Cortex 39:897–911.

Wismuller A, Lange O, Dersch D, Leinsinger G, Hahn K, Putz B,Auer D (2002): Cluster analysis of biomedical image time-series.Int J Comp Vision 46:102–128.

Worsley KJ, Poline JB, Friston KJ, Evans AC (1997): Characterizingthe response of PET and fMRI data using multivariate linearmodels. Neuroimage 6:305–319.

APPENDIX

Ordinary PCA on fMRI Data

Remembering that the N � M matrix of data Y is com-posed by the rows d�1,d�2, . . . d�N, each being one volume, andby the columns y�1,y�2, . . . y�M, each being a time series com-posed of one of the M voxels sampled in each scan, the firststep to carry out ordinary PCA is to double-center the data,thus obtaining the centered data matrix Y. The double-centering is appropriate because the data are not composedof “observations” and “variables,” but rather of variation intime and space coordinates [Jolliffe, 1986]. It also is the directcounterpart of the procedure here adopted for the functionalversion of the algorithm. The eigenvectors and the eigenval-ues may be obtained by carrying out the singular-value

decomposition of the N � N covariance matrix M�1YY�,which retrieves the solutions of the eigenequation

M�1YY�g � �g (26)

in g and � subject to the constraint �g�2 � 1 (compare toequation [10]). The matrix M�1YY� is the maximum likeli-hood estimate of the real covariance matrix [Anderson,1984].

Derivation of the Demmler-ReinschForm of Spline Regression

We will consider here the case in which the matrix F issquare and of full rank. This will generally obtain when theminimum number of splines is used that can interpolate thedata exactly (and would do so, if no penalty term wereused). The main difficulty consists in reparametrizing thehat matrix taking into account the penalty matrix P and thesmoothness parameter � separately. To achieve this, we firstrearrange the expression for the hat matrix:

H� � F(F�F � �P)�1F� � [(F�)�1(F�F � �P)(F)�1]�1

� [(F�)�1F�FF�1 � �(F�)�1PF�1]�1 � [I � �(F�)�1PF�1]�1.

If we eigendecompose the quadratic form F�)�1PF�1�QBQ�,and replace in the above, then, remembering that QQ� �Q�Q � I,

H� � (I � �QBQ�)�1�Q(I � �B)�1Q� � QWQ�,

where we let (I � �B)�1 � W. Because (I � �B)�1 is diag-onal, we conclude that QWQ� is also the eigendecomposi-tion of H�, and therefore that H� and F�)�1PF�1 share thesame eigenvectors Q. From the equation above, we have

H� � Q(I � �B)�1 Q��Q(Q�Q � �B)�1 Q�.

The proof of the common ordering of eigenvalues and num-ber of oscillations of the eigenfunctions can be found inDemmler and Reinsch [1975].


� 129 �

functional principal component analysis of fmri data · functional principal component analysis of...

Documents