utilizing temporal information in fmri decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfutilizing...

12
Utilizing temporal information in fMRI decoding: Classier using kernel regression methods Carlton Chu a, b, , Janaina Mourão-Miranda c, d , Yu-Chin Chiu e , Nikolaus Kriegeskorte f , Geoffrey Tan b , John Ashburner b a Section on Functional Imaging Methods, Laboratory of Brain and Cognition, National Institute of Mental Health, NIH, USA b Wellcome Trust Centre for Neuroimaging, University College London, UK c Centre for Computational Statistics and Machine Learning, Department of, Computer Science, UCL, UK d Department of Neuroimaging, Institute of Psychiatry, KCL, UK e Department of Psychology, University of California, San Diego, CA, USA f Medical Research Council, Cognition and Brain Sciences Unit, 15 Chaucer Road, Cambridge CB2 7EF, UK abstract article info Article history: Received 4 January 2011 Revised 7 June 2011 Accepted 20 June 2011 Available online 25 June 2011 Keywords: Kernel methods Machine learning Kernel ridge regression (KRR) fMRI prediction Support vector machine (SVM) Relevance vector machines (RVM) Multi-class Temporal compression Temporal compacting This paper describes a general kernel regression approach to predict experimental conditions from activity patterns acquired with functional magnetic resonance image (fMRI). The standard approach is to use classiers that predict conditions from activity patterns. Our approach involves training different regression machines for each experimental condition, so that a predicted temporal prole is computed for each condition. A decision function is then used to classify the responses from the testing volumes into the corresponding category, by comparing the predicted temporal prole elicited by each event, against a canonical hemodynamic response function. This approach utilizes the temporal information in the fMRI signal and maintains more training samples in order to improve the classication accuracy over an existing strategy. This paper also introduces efcient techniques of temporal compaction, which operate directly on kernel matrices for kernel classication algorithms such as the support vector machine (SVM). Temporal compacting can convert the kernel computed from each fMRI volume directly into the kernel computed from beta-maps, average of volumes or spatialtemporal kernel. The proposed method was applied to three different datasets. The rst one is a block-design experiment with three conditions of image stimuli. The method outperformed the SVM classiers of three different types of temporal compaction in single-subject leave-one-block-out cross-validation. Our method achieved 100% classication accuracy for six of the subjects and an average of 94% accuracy across all 16 subjects, exceeding the best SVM classication result, which was 83% accuracy (p=0.008). The second dataset is also a block-design experiment with two conditions of visual attention (left or right). Our method yielded 96% accuracy and SVM yielded 92% (p=0.005). The third dataset is from a fast event-related experiment with two categories of visual objects. Our method achieved 77% accuracy, compared with 72% using SVM (p=0.0006). Published by Elsevier Inc. Introduction There has been increasing interest in pattern classication of fMRI data. In contrast with the conventional mass univariate approach, which aims to nd the brain regions that correlate with the experimental conditions, the goal of the pattern recognition approach, also known as the decoding framework, is to predict the experimental stimuli and cognitive state from the brain activation patterns. Many studies have shown high accuracies of decoding fMRI patterns with machine learning methods (Chu et al., 2011; Friston et al., 2008; Haxby et al., 2001; Haynes and Rees, 2006; Mitchell et al., 2004; Strother et al., 2004). Many of the benets of neuroimaging may come from its potential to make predictions, rather than from the estimates of statistical parameters. Attempts had been made in trying to apply fMRI decoding into practical and clinical applications (Craddock et al., 2009; Davatzikos et al., 2005; Fu et al., 2008). Unlike some decoding works that used the classication accuracies as the surrogates of hypothesis testing (Hassabis et al., 2009; Haynes et al., 2007), improving prediction accuracies is always important in practical applications, for example braincomputer interfaces (BCI). Higher prediction accuracy also plays a major role in the design of real-time fMRI and neural feedback experiments (Weiskopf et al., 2007). If the experimental stimulus depends on the feedback of the prediction, near-perfect prediction may be necessary. NeuroImage 58 (2011) 560571 Corresponding author at: Section on Functional Imaging Methods, Room 1D80, Building 10, National Institute of Health, 9000 Rockville Pike, Bethesda, Maryland 20892, USA. Fax: +1 301 4021370. E-mail address: [email protected] (C. Chu). 1053-8119/$ see front matter. Published by Elsevier Inc. doi:10.1016/j.neuroimage.2011.06.053 Contents lists available at ScienceDirect NeuroImage journal homepage: www.elsevier.com/locate/ynimg

Upload: others

Post on 13-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

NeuroImage 58 (2011) 560–571

Contents lists available at ScienceDirect

NeuroImage

j ourna l homepage: www.e lsev ie r.com/ locate /yn img

Utilizing temporal information in fMRI decoding: Classifier using kernelregression methods

Carlton Chu a,b,⁎, Janaina Mourão-Miranda c,d, Yu-Chin Chiu e, Nikolaus Kriegeskorte f,Geoffrey Tan b, John Ashburner b

a Section on Functional Imaging Methods, Laboratory of Brain and Cognition, National Institute of Mental Health, NIH, USAb Wellcome Trust Centre for Neuroimaging, University College London, UKc Centre for Computational Statistics and Machine Learning, Department of, Computer Science, UCL, UKd Department of Neuroimaging, Institute of Psychiatry, KCL, UKe Department of Psychology, University of California, San Diego, CA, USAf Medical Research Council, Cognition and Brain Sciences Unit, 15 Chaucer Road, Cambridge CB2 7EF, UK

⁎ Corresponding author at: Section on Functional ImBuilding 10, National Institute of Health, 9000 Rockv20892, USA. Fax: +1 301 4021370.

E-mail address: [email protected] (C. Chu).

1053-8119/$ – see front matter. Published by Elsevierdoi:10.1016/j.neuroimage.2011.06.053

a b s t r a c t

a r t i c l e i n f o

Article history:Received 4 January 2011Revised 7 June 2011Accepted 20 June 2011Available online 25 June 2011

Keywords:Kernel methodsMachine learningKernel ridge regression (KRR)fMRI predictionSupport vector machine (SVM)Relevance vector machines (RVM)Multi-classTemporal compressionTemporal compacting

This paper describes a general kernel regression approach to predict experimental conditions from activitypatterns acquired with functional magnetic resonance image (fMRI). The standard approach is to use classifiersthat predict conditions from activity patterns. Our approach involves training different regression machines foreach experimental condition, so that a predicted temporal profile is computed for each condition. A decisionfunction is then used to classify the responses from the testing volumes into the corresponding category, bycomparing the predicted temporal profile elicited by each event, against a canonical hemodynamic responsefunction. This approach utilizes the temporal information in the fMRI signal andmaintainsmore training samplesin order to improve the classification accuracy over an existing strategy. This paper also introduces efficienttechniques of temporal compaction,whichoperate directly on kernelmatrices for kernel classification algorithmssuch as the support vector machine (SVM). Temporal compacting can convert the kernel computed from eachfMRI volume directly into the kernel computed from beta-maps, average of volumes or spatial–temporal kernel.The proposed method was applied to three different datasets. The first one is a block-design experiment withthree conditions of image stimuli. The method outperformed the SVM classifiers of three different types oftemporal compaction in single-subject leave-one-block-out cross-validation. Our method achieved 100%classification accuracy for six of the subjects and an average of 94% accuracy across all 16 subjects, exceeding thebest SVM classification result, which was 83% accuracy (p=0.008). The second dataset is also a block-designexperiment with two conditions of visual attention (left or right). Our method yielded 96% accuracy and SVMyielded 92% (p=0.005). The third dataset is from a fast event-related experiment with two categories of visualobjects. Our method achieved 77% accuracy, compared with 72% using SVM (p=0.0006).

aging Methods, Room 1D80,ille Pike, Bethesda, Maryland

Inc.

Published by Elsevier Inc.

Introduction

There has been increasing interest in pattern classification offMRI data. In contrast with the conventional mass univariate approach,which aims to find the brain regions that correlate with the experimentalconditions, the goal of the pattern recognition approach, also known asthe decoding framework, is to predict the experimental stimuli andcognitive state from the brain activation patterns. Many studies haveshown high accuracies of decoding fMRI patterns with machine learning

methods (Chu et al., 2011; Friston et al., 2008; Haxby et al., 2001; Haynesand Rees, 2006; Mitchell et al., 2004; Strother et al., 2004).

Many of the benefits of neuroimagingmay come from its potential tomake predictions, rather than from the estimates of statisticalparameters. Attempts had been made in trying to apply fMRI decodinginto practical and clinical applications (Craddock et al., 2009;Davatzikoset al., 2005; Fu et al., 2008). Unlike some decoding works that usedthe classification accuracies as the surrogates of hypothesis testing(Hassabis et al., 2009; Haynes et al., 2007), improving predictionaccuracies is always important in practical applications, for examplebrain–computer interfaces (BCI). Higher prediction accuracy alsoplays amajor role in the design of real-time fMRI and neural feedbackexperiments (Weiskopf et al., 2007). If the experimental stimulusdepends on the feedback of the prediction, near-perfect predictionmaybe necessary.

Page 2: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

561C. Chu et al. / NeuroImage 58 (2011) 560–571

One common technique to improve predictive accuracy is to applyfeature selection over the spatial domain (Chu et al., 2010a; DeMartino et al., 2008; Mourao-Miranda et al., 2006). Commonmethodsinclude filtering out non-informative voxels with univariate statisticalmaps e.g. SPM maps, recursive feature elimination (RFE) (Guyon andElisseeff, 2003), and knowledge-based spatial priors. In the PBAIC 2007competition (Chu et al., 2011), we found that feature selection, basedon prior knowledge of brain function, significantly improved perfor-mance in predicting two of the ratings (barking and inside/outside). Inthis manuscript, we introduce a different method that improvesclassification accuracy byutilizing the temporal information. Temporaland spatial information are orthogonal to each other, so one can applyboth to obtain the best performance. However, the current work onlyfocuses on optimizing the use of temporal information.

Several studies of fMRI decoding have employed the support vectormachine (SVM)or otherbinary classifiers (LaConte et al., 2005;Mourao-Miranda et al., 2005). In these approaches, fMRI volumes are treated asthe input features and the patterns are the strength of BloodOxygenation Level Dependent (BOLD) signal. However, there is strongtemporal correlation in the fMRI time series, especially due to the delayand smoothing from the hemodynamic response (HRF). For a blockdesign, temporal shift is often applied to account for the hemodynamicdelay and the volumes are averaged over each block (Cox and Savoy,2003). Such strategies ignore the temporal profiles caused by thehemodynamic response. An alternative method, which preserves theHRF information, involves applying the regression to obtain parametermaps, sometimes referred to as “Beta maps” (Eger et al., 2008;Kriegeskorte et al., 2008). However, all these methods involvingtemporal compaction greatly reduce the number of training samplesfrom the number of time points to the number of stimulus trials, hencehinder the trainingprocess, especiallywhen the repeated conditions arefew. Inspired by thePittsburghBrain activity InterpretationCompetition(PBAIC) (Chu et al., 2011), we propose a novel approach that treats theproblem as one of regression. We used a matching function to comparethe predicted time series with canonical time series patterns for eachtarget class, selecting the closest match as our prediction. The matchingfunction can also be considered as a simple classifier. This new methodgreatly improves the prediction accuracy of experimental conditionswithin a single subject, over the conventional SVM with differenttemporal compression, especially for small numbers of training trials.Weapplied theproposedmethod to threedifferentdatasets,withmulti-class classification and binary classification. The first and seconddatasets are from experiments using block designs, whereas the thirddataset is from a fast event-related design experiment.

In addition, we also introduced a convenient way to derive atemporally compacted kernel directly from the original input kernel.Conventionally, temporal compaction is often applied before generatingthe kernel matrix from input features. Because this is a linear operation,we show how to compact the input kernel matrix into three commonforms: boxcar averaging, beta-map, and spatial–temporal (temporalconcatenation).

Methods

Kernel methods

Kernel methods are a collection of algorithms that, instead ofevaluating the parameters in the space of input features, transformthe problem into its dual representation. Therefore, solutions aresought in the kernel space, and the complexity of the algorithm isbounded by the number of training samples. A simple linear kernelcan be calculated from the dot products of pair-wise vectors withequal elements. We define the input matrix as X, where X=[x1x2 ⋯xn]T, each row of X is one fMRI volumewith d voxels. The linear kernelmatrix is calculated by K=XXT. An advantage of kernel methods isthe use of the so called “kernel trick”. Because the algorithm only

requires the similarity measures described by the kernel matrix,implicit feature mapping into higher or even infinite dimensionalspace is possible, such as when using the radial basis function (RBF)kernel (Shawe-Taylor and Cristianini, 2004). Therefore it is possiblethat a nonlinear pattern in the original input space may appear linearin the higher dimensional feature space. Practically, since the imagesare already in a high-dimensional space, we did not apply any non-linear kernel in this work. Typical kernel algorithms include thesupport vector machine (SVM), relevance vector machine (RVM) andthe Gaussian Process models (GP) (Cristianini and Shawe-Taylor,2000; Rasmussen and Williams, 2006; Tipping, 2001; Vapnik, 1998).

Intuitively, the kernel matrix can be conceptualized as a matrix ofsimilarity measures between each pair of input features. It contains allthe information available about the relative positions of the inputs inthe feature space. In other words, if we rotate and translate the datapoints in feature space, the information contained in the kernel matrixwill not change, although the values of the kernel matrix may change.Because the information is encoded in the relative similarity, kernelmethods require the training data when making predictions. If asparse kernel method is used, at least the signature samples (e.g.support vectors or relevance vectors) would be required for thepredicting phase. An exception is when the linear kernel is used, whenit is possible to project the training results back into the originalfeature space. The general equation for making predictions withkernel methods is

t� = ∑N

i=1aiK xi; x�ð Þ + b: ð1Þ

Here, t* is thepredicted score for regression, and if it is a classificationalgorithm, then it is the distance to the decision boundary. N is thenumber of training samples, xi is a feature vector of the training sample,and x� is the feature vector of the testing sample. Each kernel weight isencoded in ai and b is a constant offset, both ofwhich are learnt from thetraining samples. K(.,.) is the kernel function, which contains dot-products for a linear kernel.

Detrend using residual forming matrix

Simple linear regression is often used to remove low frequencycomponents in the fMRI time series at each voxel. This linearprocedure can be reformulated as applying a residual forming matrix(Chu et al., 2011) R = I−CCþ� �

to the fMRI time series, where C is aset of any basis functions that models the low frequency components.For example, a linear basis set couldmodel a constant term aswell as a

linear drift by C =1⋮N

1⋮1

0@

1A. Cþ = CTCÞ1CT

�denotes the pseudo-

inverse of C. For the same input matrix X defined previously, thedetrended data can be computed by Xdetrend = RX. Recall that thelinear kernel is calculated as K¼XXT . To compute the linear kernelfrom data with drifts removed:

Kdetrend¼RX RXð ÞT¼RXXTRT¼RKRT: ð2Þ

This operation is simpler thandetrendingeachvoxel separately. In thiswork, we only use simple linear detrending.

Temporal compaction of kernel matrices

Because the parameter maps, i.e. beta maps, are obtained from linearregression, it is possible to formulate some forms of temporal compactionas matrix operations. In fact, both “average of volumes” and “beta maps”areaweighted linear combinationof the images in the timeseries. So far, asquare residual forming matrix has been described for removinguninteresting signal from the kernel, but other forms of the matrix may

Page 3: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

Segmented EPIusing SPM5 andrigidly aligned

DARTEL

Smooth with 8mm FWHM Gaussian

Apply the deformationparameters to all fMRIvolumes of the samesubject with nomodulation

Subject 1

Subject 2

Subject 3

Normalization

Initial average

Template3 Template6

Mean EPI of each subject

Segmentation

Iteratively registerto evolving groupaverage

Smoothing

Fig. 1. The pipeline of normalizing the fMRI volumes into a common space (population average). Themean echoplanner images (EPIs)werefirstly segmented using unified segmentation.DARTEL was used to warp all 16 subjects into the evolving population average iteratively. The final deformation parameters were then applied to all the EPI of each subject withoutmodulation.

562 C. Chu et al. / NeuroImage 58 (2011) 560–571

also be used. Instead, it can be a matrix for converting a kernel matrixgenerated from the original data, into a kernel that is obtained bygenerating dot products from the parameter images.

Mathematically, we can define a vector of weighting coefficients, p,which has the samenumber of elements as the number of images in thetime series. This weighting vector is generated by taking the pseudo-inverse of the regressor in the designmatrix of the corresponding block.

Original Kernel252 by 252

Average of volumes forming matrix

Beta mapForming matrix

3 conditions x6 blocks each x14 volumes each

50 100 150 200 250

2468

1012141618

×50 100 150 200 250

5

10

15

2 4 6 8 10 12 140

0.05

0.1

0.15

p=

p=

0 5 10 15-0.1

0

0.1

0.2

0.3

Fig. 2. Illustration of temporal comp

Usually, the regressor is theHRF convolved block (see Fig. 2) or a boxcarfunctionwith six seconds of delay (two TRs in our experiment) after theonset of the stimulus, which comes from the delay of the peak of theHRF. If every block has the same length, we can use the Kroneckerproduct to generate the “average formingmatrix” or “betamap formingmatrix” (temporal compressing matrix) by P = I⊗pT , where I is thenumber of blocks by number of blocks identity matrix. This approach

Compressed Kernel18 by 18

Compressed Kernel18 by 18

×5 10 15

50

100

150

200

250

5 10 15

50

100

150

200

250

ression using matrix operation.

Page 4: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

563C. Chu et al. / NeuroImage 58 (2011) 560–571

can be extended to event related fMRI as well. If each event is modeledas a separate regressor in the design matrix, the temporal compactionmatrix, P, is simply the pseudo inverse of the design matrix. The newdatamatrix canbe evaluatedby X̃ = PX and the compressedkernel canalso be evaluated directly from the original linear kernel generated fromall image volumes, K̃¼ X̃ X̃

T= PXXTPT¼PKPT . The dimension of this

new kernel will be the number of blocks or events, rather than numberof fMRI volumes in the series (see Fig. 2).

There is also another formulation called “spatial–temporal”(Mourao-Miranda et al., 2007). In this formulation, images in eachblock are concatenated into one long vector, hence the input featurescontain both spatial and temporal information, i.e. the temporalinformation is not averaged. Unfortunately, this formulation cannotbe arranged into the same matrix operation. Therefore, we expresseach element in the condensed kernel as a sum of weighted kernelelements in the original kernel (Fig. 3).

k̃ol = ∑n

i=1∑n

j=1dijkn o−1ð Þ + i;n l−1ð Þ + j ð3Þ

where k̃ol is the element at row o and column l in the compressedkernel, and n is the number of image volumes in each block. D is the nby n weighting matrix containing the coefficients dol for each of theelements in the original kernel. Because the kernel matrix K issymmetric, the weighting matrix is also symmetric. The weightingmatrix for the ‘beta map’ and ‘average of volumes’ can be computeddirectly from their weighting vector p, by D = ppT . For the spatial–temporal operation, the weighting matrix is a partial diagonal matrix,

such that Di;i =0 i∉S1 i∈S

�, where S is the set of images concatenated

in the block, and it is often selected to be the same set as the averagingoperation. Generally speaking, the full kernel matrix from the entiretime series is often utilized in the kernel regression framework,

Original Kernel

252 by 252

Aver

Spa

14 by 14 Block operators

3 conditions x

6 blocks each x

14 volumes each

D

Fig. 3. Illustration of temporal compression and manipulation using a more gen

whereas the compacted kernel matrix is used in classificationproblems, where the objective is to categorize events.

Support Vector Machine (SVM)

In this work, SVM specifically refers to support vector classification(SVC). SVC is also known as the maximummargin classifier (Cristianiniand Shawe-Taylor, 2000), and has shown superior performance inpractical applications with high-dimensional data. Motivated bystatistical learning theory (Vapnik, 1995), the decision boundary ischosen so that it achieves the maximum separation between bothclasses. The standard formulation of optimizing the hard-margin SVC is

minimize 1=2w

Tw

subject to ti wTxi + b� �

≥1 ; i = 1…;N:ð4Þ

This is often known as the primal form of the SVC optimization.Here, w is a vector of feature weights, t∈ {−1, 1} is the label for theclasses, x is a vector of input features, and b is an offset. Although it ispossible to solve this optimization in the primal form (Chapelle,2007), the optimization is often solved in the dual formulation byintroducing Lagrange multipliers.

maximize −12aT Ha + ∑

n

i=1ai

subject to∑n

i=1tiai = 0

ai≥0; i = 1;…;N

ð5Þ

Here, ai is a vector of Lagrange multipliers, and H is an n by nmatrix defined by hi;j = titjxT

i xj : i; j = 1;…;N� �

. More generally, we

age of volumes

Beta mapCompressed Kernel18 by 18

Compressed Kernel18 by 18

tial temporal

Compressed Kernel18 by 18

erate operation. Spatial–temporal can only be derived using this operation.

Page 5: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

564 C. Chu et al. / NeuroImage 58 (2011) 560–571

can replace xTi xj by the kernel K xi; xj

� �. This formulation makes SVM a

kernel algorithm, and can be optimized by standard quadraticprogramming, but is also often solved by more efficient algorithmse.g. Sequential Minimal Optimization (SMO) (Platt, 1999). Thecomputational complexity of SMO is between linear and quadraticin the training size, and depends on the sparsity of the data. Once theLagrange multipliers ai and the offset parameter b are computed, theprediction can bemade based on Eq. (1) for a new testing sample, andthe decision boundary is at the line where Eq. (1) yields zero. In thisstudy, we used the implementation of hard margin SVC in LIBSVM(Chang and Lin, 2001).

Multi-class SVC

The original SVC is a binary classifier. To perform multi-classclassification, the common approach in the neuroimaging field is tocombine a number of binary classifiers (Hassabis et al., 2009; Mourao-Miranda et al., 2006). For situations with H classes, there are twocommonly used approaches; one is the “one vs the rest” classifier. Thisworks by trainingH classifiers, eachof themtrainedwithone class vs theother H-1 classes. The classification for a testing point is determined bythe classifier that achieves the highest classification scores i.e. furthestaway from the decision boundary toward the particular class. Anotherapproach is called the “one vs one” classifier, which works byintroducing H(H-1)/2 or C2K classifiers. Each of the classifiers is trainedwith one class vs another class only. The assigning of a testing point isachieved by majority vote. In other words, it is assigned to the mostfrequent class to which it is classified by all the classifiers. Ambiguouscasesmay occur in this approach. For example ifwe have three classes 1,2, and 3, then we will have three classifiers (1 vs 2, 1 vs 3, 2 vs 3). Thetesting point may be classified into class 1, class 3, and class 2 from thethree classifiers respectively.

Alternatively, one can use error correcting output codes to improvethe performance (Dietterich and Bakiri, 1995). In the example of threeclasses, the error codes can be generated from six binary classifiers (1 vs

Regression

Machine 1

Regression

Machine 3

Regression

Machine 2

Training PhaseCanonical HRF

convolved design

vector with pleasantstimuli Trai

Reg

Mac

Trai

Reg

Mac

Train

Reg

Mac

Testing feafMRI volum

block (14 vo

PrePha

Canonical HRF

convolved design

vector with neutral stimuli

Canonical HRF

convolved design

vector with unpleasant stimuli

Regression target

Input featuresAll training fMRI

volumes

50 100 150 200 2500

0.51

50 100 150 200 2500

0.51

50 100 150 200 2500

0.51

Fig. 4. Illustration of temporal compression and manipulation using a more gen

2, 1 vs 3, 2 vs 3, 1 vs (2+3), 2 vs(1+3), 3 vs (1+2)). This approachescan avoid ambiguous cases and aggregate more information. There arealso classifiers that are capable of doing multi-class classification, forexample multiclass Fisher's discriminant, multiclass logistic regression(Bishop, 2006; Girolami and Rogers, 2006), and decision tree methods(Quinlan, 1993).

Multi-class classifier using kernel regression (MCKR)

We design a multi-class classifier that is very similar to one vs therest classification for fMRI experiments. This method utilizes thetemporal information without compressing it into a reduced kernel.Our approach breaks the classification into three stages: 1. Training Kregression models; 2. Predicting the temporal profiles for a testingblock; and 3. Matching the predicted K profiles with the canonicalprofile, which can be computed by convolving one block with thecanonical HRF (Friston et al., 2007) (see Fig. 4). This approach wasoriginally inspired by the PBAIC (Chu et al., 2011), therefore we took asimilar approach in the training phase. That is, we only changed thetarget variable, but used the same input features. In our case, theexperiment had three conditions in the design. We trained threedifferent regression machines, where each of the machines took thesame kernel generated from the fMRI volumes as input features, butthe target variables were the corresponding regressors (columns) inthe designmatrix. In the predicting phase, temporal profiles of the testblock (multiple fMRI volumes) were predicted from all threeregression machines. To assign class membership, we compared allthe predicted profiles with the canonical profile. We tried bothcovariance and correlation as the metric to measure similarities. Bothmeasures ignore the constant offset, and covariance considers themagnitude of the prediction, while correlation ignores the informa-tion of magnitude. In practice, the results using correlation andcovariance are not statistically different. The class was assigned to thecondition for which the machine achieved the highest similaritybetween the predicted profile and the canonical profile. Although this

ned

ression

hine 1

ned

ression

hine 2

ed

ression

hine 3

tureses in one

lumes)

dicting se

Canonical profile

Matching function

Most likely

to be Unpleasant

Predicted

Temporal

profile

Matching Phase

2 4 6 8 10 12 14

-1

0

2 4 6 8 10 12 14

-1

-0.5

0

2 4 6 8 10 12 14

-1

-0.5

0

0 5 10 15-0.5

0

0.5

1

1.5

-0.5

erate operation. Spatial–temporal can only be derived using this operation.

Page 6: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

565C. Chu et al. / NeuroImage 58 (2011) 560–571

technique is introduced as a multi-class classifier, it works the samefor binary classification. This method was applied to data from blockdesigns (dataset 1 and dataset 2) and fast event related designs(dataset 3).

Kernel ridge regression (KRR)

Kernel ridge regression is the kernel formulation of the ridgeregression. Ridge regression is the standard least square regression withregularization that penalizes sum of squares of the weights (parameters)

w = arg minw

∑N

i=1wTxi−ti� �2 + λwTw, where λ≥0 is the regulariza-

tion parameter. It has been shown that kernel ridge regression yields thesamesolution fromthe standard (primal) ridge regression (Shawe-Taylorand Cristianini, 2004). The kernel formulation can gain computationalefficiencywhen thedimensions of the feature vectors are greater than thenumberof training samples,which is true inmostneuroimagingdata. Theformulation of kernel ridge regression is

a = K + λIð Þ−1t ð6Þ

whereK is thekernelmatrix, and I is the identitymatrix.Notice there is nooffset parameter (b) in kernel ridge regression. Prediction can becomputed from Eq. (1). The regularization, λ, was fixed to 1e5, whichwas based on our experience and empirical results fromPBAIC (Chu et al.,2011). In this study, the overall accuracies varied by less than 3% when λwas in the rangeof1e2–1e5 (i.e. the regularizationparameterdidnothaveto be very precise, as it yielded very similar results in a verywide range ofvalues). Because of thematrix inversion, the computational complexity ofKRR is O(N3), where N is the number of examples in the training data.

Relevance vector regression (RVR)

Relevance Vector Regression (RVR) is a sparse kernel method, and isformulated in the Bayesian framework (Tipping, 2000, 2001). Strictlyspeaking, RVR is not a standard kernel method, because it does not havethe interchangeable formulation between primal and dual. Instead, RVRtreats the kernelmatrix as a set of linear basis function. This implies thatthe input of RVR does not need to be aMercer kernel (i.e. symmetric andpositive-definite). However, the standard RVR often uses the Kernelappended with one column of ones as the input, Φ = 1;K½ �, with Kdenoting the kernel matrix and 1 denoting a column of ones. Thelikelihood function of the observation is modeled by a Gaussiandistribution, p t ja;σ2

� �= N t jΦa;σ2I

� �, where t is the vector of

observed target values, a is the weight vector, and σ2 is the varianceof the noise. The prior of the weights, a, are also modeled by zero mean

Gaussian, p a jαð Þ = ∏N

i=0N ai j0;α−1

i

� �. The solution involves optimiz-

ing the marginal likelihood (type-II maximum likelihood).

p t jα;σ2� �

= ∫p t ja;σ2� �

p a jαð Þda = N t j0;Cð Þ ð7Þ

whereC = σ2I + ΦΩ−1ΦT is the covarianceof themarginal likelihood,and Ω=diag(α0, α1, …, αN) is the diagonal matrix of the inverse of thevariance for eachweight. The objective of the optimization is to find thehyper-parameters, α;σ2, which maximize the “evidence” of the data(Mackay, 1992; Tipping, 2001). This is closely related to restrictedmaximum likelihood (ReML) and estimation of covariance componentsin the statistical literature (Friston et al., 2002; Harville, 1977). Thecovariance matrix that maximizes the marginal likelihood can beobtained by iterative re-estimation. After finding the hyper-parameters,

the posterior weights can be estimated by a = ΦTΦ + σ2Ω� �−1

ΦTt.

Whenmaximizing themarginal likelihood, some of the αwill grow3very large, and effectively result in the posterior probability of thoseweights, a, sharply peaking at zero. This property allows irrelevant

columns of the basis functions to be pruned out, and is known asautomatic relevance determination (ARD) (MacKay, 1995). As with allother kernelmethods, the prediction can be computed fromEq. (1). Thecomputational complexity of RVR is O(N3) (Tipping, 2001), where Ñ isthe averaged training size. In practice, the computation timedepends onthe implementation (e.g. stopping criteria).

One thing to be aware of is that all these machine learning methods(SVC, RVR, KRR) assume that the data points are independent andidentically distributed (iid),whereas the actual noise for fMRI data is noti.i.d. This does not mean those algorithm will fail completely, but doesimply that suchmis-specifiedmodelsmaygive sub-optimal results (Chuet al., 2011; Rasmussen and Williams, 2006).

Weight map

Becausewe used a linear kernel, it was possible to project theweightsback into the voxel space for both SVC and Kernel Regression. Theweightmap from one training of the regressionmachine, either KRR or RVR, wascomputed by w = aTRX

� �T , where X is the original input matrix, R iseither a detrending or residual forming matrix, and a contains the kernelweights obtained from the training. Unlike SVC or other binary classifiers,which only generate one map for binary classification, the regressionmachines generate one map per class. In other words, the regressionmachines generate two maps for a binary classification.

To generate the map for SVC using a compacted kernel matrix, wecan simply add the compaction matrix and rewrite the equation asw = aTPRX

� �T . For spatial–temporal SVC, P is the matrix that selectsthe specific fMRI volumes for each block. Each time, only one spatialmap can be generated, and the temporally successive maps (i.e. thespatial map at different TR) can be computed by fixing the kernelweight, a, and changing the matrix P to select different volumes.

Materials

Dataset 1

The dataset used in this study is the same as described by (Hardoonet al., 2007;Mourao-Miranda et al., 2007; Mourao-Miranda et al., 2006)and involves functional MRI scans from 16 male right-handed healthycollege students (age 20–25 years). The study was performed inaccordance with the local Ethics Committee of the University of NorthCarolina. The datawere collected on a 3 TAllegraHead-onlyMRI system(Siemens, Erlangen, Germany). The fMRI runswere acquired using a T2*sequencewith 43 axial slices (slice thickness, 3 mm; gapbetween slices,0 mm; TR=3 s; TE=30 ms; FA=80°; FOV=192 mm×192 mm;matrix, 64×64; voxel dimensions, 3 mm×3 mm×3 mm). In each run,254 functional volumes were acquired.

Experimental designThe experimental stimuli were in a standard block design. It was a

passive experiment with visual stimuli. The visual stimuli were cate-gorized into three different active conditions: viewing unpleasant(dermatological diseases), neutral (people) and pleasant images (girlsin bikinis). Each active condition was followed by a resting condition(fixation) with equal duration. In each run, there were six blocks of theactive condition (each consisting of seven image volumes) alternatingwith resting (fixation) over seven image volumes. Six blocks of each ofthe three stimuli were presented in random order. There was only onerun for each subject.

Pre-processingThe data were pre-processed using SPM5 (Wellcome Trust Centre

for Neuroimaging, London, UK). All scans were first realigned andresampled to remove effects due to subject motion. This had the effectof increasing the similarity among scans, and compacting the lowdimensional manifolds on which the images lay.

Page 7: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

566 C. Chu et al. / NeuroImage 58 (2011) 560–571

To further increase the signal-to-noise ratio, those voxels thatwere, apriori, considered non-informative were removed. BOLD signal changeis generally believed to occur mainly in gray matter, because its majorcause should be the local neuronal activity (Logothetis et al., 2001).Empirical results from the Pittsburgh Brain Activity InterpretationCompetition (PBAIC) 2007 also showed that masking out non-graymatter voxels improves the classification performance (Chu et al.,2011). Masks defining gray matter were generated for each subjectby 'segmenting the average fMRI scan of each subject using unifiedsegmentation implemented in SPM5 (Ashburner and Friston, 2005).This also accelerates the computation of the kernel matrices, as onlyabout 20% of the whole image is used.

To performmulti-subject prediction, all scanswere also non-linearlyaligned to their population average using DARTEL (Ashburner, 2007),which is more accurate than conventional SPM spatial normalization(Bergouignan et al., 2009; Klein et al., 2009). After estimating the inter-subject alignmentbymatching tissue class images together, thewarpingparameters were used to transform each subject's fMRI volumes. Anadditional smoothing with 8 mm FWHM Gaussian Kernel was applied.The amount of smoothing was chosen to be the same as that in theprevious studies (Mourao-Miranda et al., 2007; Mourao-Miranda et al.,2006), allowing the results to be compared (see Fig. 1). Although arecent study (Opde Beeck, 2010) suggested that spatial smoothingdoesnot affect the decoding performance, our empirical findings from PBAIC2007 (Chu et al., 2011) showed that spatial smoothing can improveaccuracies for some subjects. Temporal normalization was also appliedto each voxel for each subject, which involved dividing the voxel timecourses by their standard deviations. This procedure minimized thevariation of temporal scale among different subjects.

Cross-validation

To compare the accuracies of MCKR with the multi-class SVM inprevious studies, we performed leave-one-block-out cross-validation(LOBOCV) within each subject separately. The fMRI volumes for LOBOCVwere in the native space. In each LOBOCV trial, one block (active+rest)was removed fromthedataset as the testingblock i.e. onevolume for SVMafter compression or 14 volumes for MCKR. The training used theremaining dataset, and the label of the testing block was predicted withthe parameters obtained from the training. The averaged accuracy wasthen calculated by averaging of the predicted accuracies from all subjects.

We also performed leave-one-subject-out cross-validation(LOSOCV) for each subject. The fMRI volumes for LOSOCV were inthe population-averaged space (by DARTEL). In each LOSOCV trial,one subject was left out as the testing sample, and the data from theremaining 15 subjects were used to train the classifier. After thetraining, all 18 labels were predicted from the testing subject. Theaveraged accuracy was calculated.

Dataset 2

Details of this dataset are described fully in the thesis of Chiu (2010).Briefly, fifteen neurologically intact right-handed, healthy adults(four males, mean age=21.3 years) were recruited from the JohnsHopkins University community. All participants completed and signedan informed consent approved by the Johns HopkinsMedicine InstitutionReview Board. Whole brain functional data were acquired with 40 sliceecho-planar images (EPIs) in an ascending sequence, TR=2000ms,TE=30ms, flip angle=70°, scan time=314 s (one run), matrix=64×64, slice thickness=3mm, SENSE factor=3, yielding 3-mm isotropicvoxels.

Experimental designThe original experiment had 15 runs, but we only used three runs of

sustained attention task, which are relevant to fMRI decoding. At thebeginning of each sustained attention run, participants were instructed

to startmonitoring either the left or the right target streamof charactersfor a fixed block of 20 s (10 volumes), during which they responded tothe appearance of “5” in the attended stream by pressing both buttons.The initial stream (left or right) was randomly selected for each run. Atthe end of each block, a shift cue “X” appeared in the currently attendedstream, instructing participants to shift their attention to the other side(i.e., left to right and vice versa). There were a total of 14 shift cues,creating seven blocks of sustained attention to the left and seven to theright in each run, with an additional block of attention to the initialattended location. In our decoding task, the last block was omitted toobtain an even number of conditions.

In summary, there were 15 subject, and three runs for each subject.There were two conditions (left or right) in the experiment, and eachrun had seven attend-left blocks and seven attend-right blocks. Thelength of each block is 10 volumes (20 s).

Pre-processingThe datawere preprocessed using BrainVoyager QX v1.10 software

(Brain Innovation, Maastricht, The Netherlands). Functional data wereslice-time corrected (with cubic spline interpolation), motion cor-rected (rigid-body translation and rotation with trilinear interpola-tion), and then temporally high-pass filtered (3 cycles per run). Noother spatial-smoothing or normalization was performed. Simpleintensity-based masks were used to exclude non-brain tissues.

Cross-validation

Although, binary classification (left or right) was performed in thisdataset, for the consistency of naming, we still called the proposedmethod MCKR in this dataset and dataset 3. Because there aremultiple runs in this dataset, we performed leave-one-run-out(LORO) cross-validation, which is more conservative, for each subject.In each LOROCV trial, one run was left out as the testing sample, andthe data from the remaining two runs were used to train the classifier.After the training, all 14 testing blocks in the test run were predicted.The averaged accuracy was then calculated by averaging over thepredictive accuracies of all 15 subjects.

Dataset 3

This dataset was used and described by (Kriegeskorte et al., 2008;Misaki et al., 2010). Only a part of the original dataset was used todemonstrate the proposedmethod. Four healthy subjects participatedin this event-related experiment. The data were collected on a 3 T GEHDx MRI scanner with a receive only whole brain surface coil array(16 elements). Whole brain functional data were acquired with 25slice echo-planar images (EPIs) with SENES (Acceleration factor=2),TR=2000 ms, TE=30 ms, scan time=544 s (one run), ma-trix=128×96, slice thickness=2 mm, SENSE (factor=3), yielding1.95 mm×1.95 mm×3 mm voxels. Only the occipital and temporallobes were covered.

Experimental designThere were a total six runs in the experiment. In each run, 96 color

pictures of objects on a gray background were shown. There were 48inanimate objects and 48 animate objects. The duration of eachstimulus was 300 ms. Each stimulus was shown only once in each run,in random order. The stimulus onset asynchronywas random (4 s, 8 s,12 s, 16 s). The decoding task was to discriminate animate stimulifrom inanimate stimuli.

Pre-processingThe data were slice-time corrected and realigned using Brain-

Voyager QX. The time series in each voxel were normalized to percentsignal change. No other spatial-smoothing or normalization wasperformed (Kriegeskorte et al., 2008). Linear detrending was applied.

Page 8: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

Table 1Within-subject classification accuracy (%) of cross-validation.

Multi-class classifiers Conditions

Average Accuracy Unpleasant Neutral Pleasant

MCKR with RVR 94.0(1.8) 94.8(2.4) 91.7(3.3) 94.8(2.4)MCKR with KRR 93.8(1.8) 93.8(2.9) 92.7(3.3) 94.8(2.4)SVC (average) 86.5(2.8) 83.3(2.8) 88.5(2.8) 87.5(3.0)SVC (beta-map) 87.5(2.2) 83.3(4.2) 90.6(3.0) 88.5(2.8)SVC (spatial temporal) 66.3(4.5) 66.7(5.0) 62.5(6.8) 69.8(6.0)

Values inside the parentheses are the standard error (%).

567C. Chu et al. / NeuroImage 58 (2011) 560–571

A mask of the human inferior temporal cortex (hIT) with 1000 voxelswas computed from theoverlapping voxels of an anatomicalmask and athresholded statistical map from a separate experiment (i.e. element-wise and operation to the anatomicalmask and the thresholded t-map).

Cross-validation

We also applied LOROCV in this dataset. We followed the proceduredescribed in Misaki et al.(2010) to perform SVC classification. In eachLOROCV trial, one run was left out as the testing run, and the data fromthe remaining five runs were used to train the classifier. The beta-mapestimates were computed from the two sets independently.

For MCKR, although dataset 3 is a fast event-related experiment, wemodeled each stimulus as a small block, andperformed theprocedure inthe sameway as for a block-design experiment. In the predicting phase,we used image volumes from the onset to 14 s after the onset (sevenvolumes) for temporal profile matching. We predicted the temporalprofilewithin the 14 swindows (based on theHRFprofile) per stimulus,and then applied the matching function to make one classification forthat temporal profile. In other words, we still made one prediction perstimulus per run using the proposed method, which had exactly thesame number of test estimates as the SVM.

Results

Classification performance was compared between differentmethods, using paired t tests (Misaki et al., 2010). We found that thechoice of whether covariance or correlation was used as the matchingfunction made no significant difference. We therefore chose covarianceas the matching function in all datasets and subjects because it wasconsidered “simpler” (i.e. correlation is covariance normalized by thestandard deviations, so more operations are needed to computecorrelation than to compute covariance). Following Occam's razor,when a simpler procedure yields equivalent performance, we shouldprefer the simpler one. Although SVC with beta maps yielded highermean accuracies than SVC with averages, predictive accuracies witheither method were not significantly different from each other. MCKRusing RVR performed slightly better than MCKR using KRR, but bothmethods were also not significantly different. There were two variants

Leave-one-block-out cros

0.50.550.6

0.650.7

0.750.8

0.850.9

0.951

Average accuracy Unpleasant

Conditions

Cla

ssif

icat

ion

Acc

ura

cy

Fig. 5. Leave one block out cross-validation accuracies (within-subject cross-validation) avError bars show the standard error.

ofMCKRand three variants of SVC, so to avoidmultiple comparisons,wecompared the worst performing MCKR (MCKR-KRR), with the bestperforming SVC (SVC beta map). Although this approach was veryconservative, we found that MCKR performed significantly better thanSVC in all threedatasets (p=0.008, p=0.0006, p=0.005, respectively).The original cross-validation accuracies in each run and each subject areshown in the supplementary material.

Dataset 1

Within-subject classificationThe averaged accuracy of LOBOCVwas very highwhen the proposed

method, MCKR, was applied. Perfect classification (100% accuracy) wasobtained for 6 subjects (2 subjects by SVC beta map) and an average of94% accuracy was achieved across 16 subjects. We achieved slightlyhigher accuracies for SVC comparedwith thepreviously publishedwork(Mourao-Miranda et al., 2006). The accuracies of “onevs the rest”multi-class SVCwere around 3% higher than the “one vs one”multi-class SVM,so we only present the result from “one vs the rest” SVM in this paperbecausewewant to compare the best SVCmethodwith theworstMCKRmethod. The best classification accuracy for SVC was 87.5% using thebeta map, average of volumes resulted in 86% accuracy, and spatial–temporal had the worst performance with 66% accuracy (Table 1 andFig. 5). Despite using the best SVC results, the classification accuracyfrom multi-class SVC with beta maps was significantly lower than theworst performing MCKR (p=0.008). For multi-class classification, it isnot possible to compute the receiver operating characteristic (ROC).Therefore we present the confusion matrix of MCKRwith RVR, which issimilar to the results fromMCRKwith KRR, and the confusion matrix ofmulti-class SVC compressed by “average of volume”, which is similar tothe result from using beta-maps.

In addition to performing the LOBOCV, we also performed cross-validation (CV) trials with reduced training blocks. In each of the CVtrials, we randomly sampled M (a number ranging from one to five)blocks, whichwere used for training the classifiers. For each selectedM,we repeated the sampling and CV 40 times. Surprisingly, evenwhenweused only one training block, all methods still achieved classificationaccuracies above chance level (33%) (Fig. 6). The results also showedthat MCRK always outperformed multi-class SVC, regardless of thenumber of training samples (from training size of one block to fiveblocks, p=0.49, p=0.017, p=0.065, p=0.063, p=0.025, respective-ly).MCKRwith only two training blocks can perform aswell as SVCwithfour training blocks (p=0.3, no statistical difference).

Inter-subject classificationThe accuracy of multi-class SVC improved greatly in LOSOCV, and

the average accuracy was 95% regardless of compaction methods.However, the averaged accuracy of MCKR decreased to 91% (Fig. 7 andTable 3). Despite higher accuracy in SVC, paired t tests showed nosignificant difference between MCKR and SVC (p=0.55). There weretwo subjects that had extremely low accuracy in MCKR, whereas the

s-validation

Neutral Pleasant

MCKR with RVR

MCKR with KRR

SVM average

SVM beta map

SVM spatial-temporal

eraged over 16 subjects with five different classifiers for different stimulus conditions.

Page 9: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

Fig. 6. Leave one block out cross-validation accuracies (within-subject cross-validation)with different sizes of training blocks. Each accuracy was calculated from averaging 40cross-validation trialswith randomly selected subsets. Error bars show the standard error.

Table 2Confusion matrix of MCKR with RVR and multi-class SVC compressed by “average ofvolume” within-subject.(LOBOCV).

Predicted class % MCKR withRVR

Actual

Unpleasant Neutral Pleasant

Predicted Unpleasant 93.8 3.1 2.1Neutral 3.1 91.7 3.1Pleasant 3.1 5.2 94.8

Predicted class % multi-classSVC (average of volume)

Actual

Unpleasant Neutral Pleasant

Predicted Unpleasant 83.3 4.2 8.3Neutral 5.2 88.5 4.1Pleasant 11.5 7.3 87.5

Table 3Inter-subject classification accuracy (%) of cross-validation.

Multi-class classifiers Conditions

Average accuracy Unpleasant Neutral Pleasant

MCKR with RVR 91.3(2.2) 94.8(2.4) 85.4(4.6) 93.8(4.1)MCKR with KRR 91.0(2.2) 93.8(4.1) 85.4(4.6) 93.8(2.0)SVC (average) 95.5(1.5) 93.8(2.0) 95.8(2.8) 95.8(1.8)SVC (beta-map) 95.5(1.4) 93.8(2.0) 96.9(2.2) 95.8(2.3)SVC (spatial–temporal) 95.1(1.3) 95.8(2.3) 96.9(1.6) 92.7(2.5)

Values inside the parentheses are standard error (%).

Table 4Confusion matrix of MCKR with RVR and multi-class SVC compressed by “average ofvolume” inter-subject (LOSOCV).

Predicted class % MCKR withRVR

Actual

Unpleasant Neutral Pleasant

Predicted Unpleasant 94.8 9.4 6.3Neutral 0 85.4 0Pleasant 5.2 5.2 93.7

Predicted class % multi-classSVC (average of volume)

Actual

Unpleasant Neutral Pleasant

Predicted Unpleasant 93.7 2.1 1.1Neutral 4.2 95.8 3.1Pleasant 2.1 3.1 95.8

568 C. Chu et al. / NeuroImage 58 (2011) 560–571

other 14 subjects had similar results for both methods. Also, MCKRachieved equivalent performance to multi-class SVC for classifyingpleasant and unpleasant stimuli. The reduction of averaged accuracycame from a significant drop in the ability of MCKR to classify neutralstimuli (85%), especially in those two subjects (see Supplementarymaterial). The confusionmatrix of RVR (Table 4) from LOSOCV, showsa strong bias toward misclassifying a neutral stimulus as unpleasant.

We also performed the LOSOCV with reduced numbers of trainingsubjects (Fig. 8). In each of the CV trials, we randomly sampledM (1, 2,3, 4, 5, 7, 10, 13) training subjects. For each selected M, we repeatedthe sampling and CV 40 times. The accuracies of MCKR with KRR andSVC with beta-maps reached a plateau when the number of trainingsubjects was seven or more. The accuracies of SVC with spatial–temporal reached the plateau much later, when 13 training subjectswere used. SVC was only significantly better than MCKR whentraining with 10 and 13 subjects (pb0.05).

The major clusters in the weight maps generated by training all 16subjects (Fig. 9) were very prominent. Maps generated from KRR andSVC have very high correlation in both pleasant and unpleasantstimuli (r=0.76, r=0.80, respectively), but the correlation is lower(r=0.69) between maps from KRR and SVC in neutral stimuli. Thismay also explain the comparable performance between MCKR andSVC in classifying pleasant and unpleasant stimuli, but their differentperformance for classifying neutral stimuli.

Dataset 2

We only performed LOROCV for SVC using beta map and MCKRwith KRR. This experiment had a very high contrast-to-noise ratio

Leave-one-subject-out cro

0.50.550.6

0.650.7

0.750.8

0.850.9

0.951

Average accuracy UnpleasantConditions

Cla

ssif

icat

ion

Acc

ura

cy

Fig. 7. Leave one subject out cross-validation (inter-subject cross-validation) accuracies avError bars show the standard error.

(CNR). The averaged accuracy of LOROCV was very high when theproposed method, MCKR, was applied, achieving an average accuracyof 96% (Fig. 10). Perfect classification (100% accuracy) was obtainedfor 32 runs out of 45 (15 subjects times 3 runs per subject) comparingto 23 runs using the SVC beta map approach. SVC also performedwell,achieving an average accuracy of 92%. However, paired t tests showedMCKR to be significantly better than SVC (p=0.006). There was nobias toward predicting one particular class.

ss-validation

Neutral Pleasant

MCKR with RVR

MCKR with KRR

SVM average

SVM beta map

SVM spatial-temporal

eraged over 16 subjects with five different classifiers for different stimulus conditions.

Page 10: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

Fig. 8. Leave one subject out cross-validation accuracies with different sizes of trainingsubjects. Each accuracy was calculated from averaging 40 cross-validation trials withrandomly selected subsets. Error bars show the standard error.

Fig. 10. Leave one run out cross-validation (within-subject cross-validation) accuraciesaveraged over 15 subjects (3 runs per subject) and with MCKR KRR and SVC beta mapfor different directions of visual attention. Error bars show the standard error.

Fig. 11. Leave one run out cross-validation (within-subject cross-validation) accuraciesaveraged over 4 subjects (6 runs per subject) with MCKR KRR and SVC beta map fordifferent categories of visual objects. Error bars show the standard error.

569C. Chu et al. / NeuroImage 58 (2011) 560–571

Dataset 3

This dataset had relatively lower CNR because of the fast-eventrelated design. Only LOROCV for SVC using the betamap approach andMCKR with KRR were employed. MCKR achieved 77% accuracy andSVC achieved 72% accuracy (Fig. 11). MCKR performed significantlybetter than SVC (p=0.0006), and resulted in the lowest p-valueamong the three datasets. We also found no bias toward predictingone particular stimulus category.

Discussion

We developed a novel multi-class classifier using the kernelregression method. Our MCKR method uses all the information in thefMRI volumes during training instead of compacting them into a smallerset. The results demonstrated that our MCKR method was consistentlybetter than multi-class or binary SVC in LOBCV or LOROCV within each

SV

C (

Bet

a-m

ap)

KR

R

Pleasant vs others Neutral vs o

Fig. 9. The weight map (feature weights) from training all volumes of all subjects. The top rowgenerated from KRR. From left column to right, the maps were generated form training pleasan

individual subject. The same performance from RVR and KRR impliesthat the superiority of our method comes mainly from the architectureof the procedure (Fig. 4) rather than the specific choice of regressionalgorithm. Other methods, such as support vector regression or elasticnet, should also perform similarly. The choice of matching functionalso made no statistical difference, but we used covariance in thematching function for reasons described earlier. This implies that thepredicted profile contains informationmainly in the shapeof the profile.We suggest that MCKR performs better than SVC because of moretraining samples and the ability to learn the subject's individualresponse to each condition. With more training samples, the solutionsfrom the kernel machines (either regression or classification) would

thers Unpleasant vs others

shows the feature map generated from SVC and the bottom row shows the feature mapt stimuli vs the other two, neutral vs other two, and unpleasant vs other two, respectively.

Page 11: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

570 C. Chu et al. / NeuroImage 58 (2011) 560–571

havehigher degrees of freedom, leading tomore accurate estimates of thenoise variance. From Fig. 6, we can see that when the number of trainingsamples increases, the accuracy gap between MCKR and SVC with beta-maps or averages becomes smaller. Perhaps with more repeats and alonger experiment, SVC may achieve the same performance as MCKR.

The MCKR did not show strong prediction bias, but SVC seemed tohave a strong bias toward mistaking the condition of unpleasant aspleasant (Table 2). This result shows that the temporally compressedfMRI pattern under unpleasant stimuli weremore similar to the patternunder pleasant stimuli than neutral stimuli. Conversely, the unbiasedresult from MCKR implies that the patterns between the three stimuliare equivalently distinct when more temporal information was used.

After evaluating maps generated from each individual subject andmaps from training all subjects,we found that althoughmaps generatedby both regression machine and SVC are different, the main clusters inthe maps are very similar. The shapes and sizes of the clusters aredifferent, but the locations are the same. In other words, onewould findthe clusters in the same brain regions using either map and probablywouldmake the same inference in brain functions using eithermethod.

The improvement of SVC in the LOSOCV (i.e. inter-subject cross-validation) can be explained by the increase of training samples(Fig. 9). We suspect that the reason why the MCKR machine did notwork well between subjects, as opposed to in single subjects, was dueto the inter-subject variability of activation size (different numbers ofactivated and deactivated voxels) and strength. TheMCKR approach ismore sensitive to inter-subject variability than is SVC with temporalcompression, because regressionmachines have cost functions of leastsquares, and the inter-subject MCKR assumes all subjects have thesame magnitude of multivariate-activation. Although we normalizedevery voxel of each subject, some inter-subject variance still remained.For example, some subjects had uniform activation among the threestimuli, whereas some subjects had relatively low activation for theneutral stimuli. Also, some subjects havemore activated voxels (biggerclusters) than others subjects. Consider a subject with slightly moreactivated voxels in one condition, the multivariate response of thatcondition for that subject would be stronger than others. This strongerresponse would have less impact on SVM because such effects wouldonly result the training samples from the class/condition movingfurther away from the decision boundaries, and not being consideredas support vectors. Such confounds would sabotage the training ofMCKR with least-square objective function. If we ignored the neutralstimuli,MCKR and SVCwould performequallywell. On the other hand,if we could apply feature selections to limit the number of voxels andregions, the prediction performance of the MCKR may also be as goodas SVC.

The experimental design of dataset 2 was very similar to dataset 1.The duration of the block in both experiments was around 20 s. Theonly difference was that experiment in dataset 2 had no restingcondition. Because we applied LOROCV, the prediction should not beconfounded by the potential temporal correlation in the training. Theempirical results from dataset 2 further support the proposedmethod.

Although dataset 3 was from an event-related experiment, theproposedmethod still performedwell.We applied LOROCV to preventconfounds due to overlapping events in the training. This datasetyielded the lowest p-value, probably because there was no ceilingeffect. In experiments with high CNR (datasets 1 and 2), SVC couldoften achieve 100% classification.

In addition to introducing MCKR, we also presented an efficientmethod toperform temporal compaction and reformulation (Fig. 3). Sucha formulation can bring a different perspective to the process. In ourexperimental design, because of the long block design, the block operatorfor generating beta-maps is very similar to that for generating the averageover volumes (Fig. 3). Thiswaswhyboth compressions resulted in similaraccuracy. Both beta-map and average of volumes employed off-diagonalinformation from the kernel, whereas the spatial–temporal only used thediagonal components. The inferior performance of spatial–temporal with

low training samplesmay imply that the covariancebetweenneighboringvolumes temporally contains useful information to discriminate theconditions. This kernel formulation of temporal manipulation can also beextended to frequencydecomposition, for example, by applying adiscretecosine transformmatrix. The effect of temporalfiltering on fMRI decodingefficiency can thenbe studied. Also, in a fast event related experiment, it isoften suggested to average over a few trials to create a beta-map (i.e.multiple trials inonecolumnof thedesignmatrix). Thenumberof trials toaverage is often arbitrary. When a new number is selected, the generallinear model needs to be re-fitted. It is more computationally efficientto apply the matrix operation on the kernel matrix directly using theformulation we proposed.

Regarding theweightmaps, wemake no attempt at inference aboutthe significance of a given regionwith highweights on aweightmapnordo we point out any given region. This is because it is not the intent inthe proposed method to draw conclusions about regionally specificeffects. The proposed method was not designed to generate maps thatwere comparable because the final classification also depends on thematching between the template (canonical profile) and the predictedtemporal profile. If the template is changed, the same “trainedregressors and maps” may yield different classification accuracies. Thematching function makes the inference between voxel weights and theclassification results non-linear.

In summary, our proposed method, MCKR, performed significantlybetter than the conventional multiclass SVC and binary SVC when thenumber of training samples was low and the training and testing wereperformed within the same subject. Many fMRI studies have performeddecoding only at the level of a single subject (Hassabis et al., 2009;Haynes et al., 2007). Our method is ideal for experiments that havelimited repeats of stimuli, and this implies one can design an experimentwith fewer repeats and more conditions, or shorten the duration ofexperiments. Also, if the prediction system is going to be trained online,e.g., real time fMRI decoding, our MCKR should be the recommendedoption, especiallywhen nearly-perfect classification is needed. Using thecanonical HRF would result sub-optimal solutions and the commonlyused classification approachwith temporal compacting also suffers fromthe samedisadvantage. Onedirection for future research is to try to learnthe optimal HRF profile from the training set.

Acknowledgments

This research was supported in part by the Intramural ResearchProgramof theNIH,NIMH. JMMwas fundedby aWellcomeTrust CareerDevelopment Fellowship. JA is funded by the Wellcome Trust. Theauthors thank Prof.Michael Brammer for providing thefirst dataset. Theauthors thank Prof. Steven Yantis for providing the second dataset,which was supported by NIH grant R01-DA013165.

Appendix A. Supplementary data

Supplementary data to this article can be found online at doi:10.1016/j.neuroimage.2011.06.053.

References

Ashburner, J., 2007. A fast diffeomorphic image registration algorithm. Neuroimage 38,95–113.

Ashburner, J., Friston, K.J., 2005. Unified segmentation. Neuroimage 26, 839–851.Bergouignan, L., Chupin, M., Czechowska, Y., Kinkingnehun, S., Lemogne, C., Le Bastard, G.,

Lepage, M., Garnero, L., Colliot, O., Fossati, P., 2009. Can voxel based morphometry,manual segmentation and automated segmentation equally detect hippocampalvolume differences in acute depression? Neuroimage 45, 29–37.

Bishop, C.B., 2006. Pattern Recognition and Machine Learning. Springer.Chang, C.C., Lin, C.J., 2001. LIBSVM: a library for support vector machines.Chapelle, O., 2007. Training a support vector machine in the primal. Neural Comput. 19,

1155–1178.Chiu, Y.C., 2010. Moment-by-moment tracking of attentional fluctuation in the human

brain. Department of psychological & brain sciences. Johns Hopkins University.

Page 12: Utilizing temporal information in fMRI decoding: …staging.csml.ucl.ac.uk/.../paper_27.pdfUtilizing temporal information in fMRI decoding: Classifier using kernel regression methods

571C. Chu et al. / NeuroImage 58 (2011) 560–571

Chu, C., Bandettini, P., Ashburner, J., Marquand, A., Kloeppel, S., 2010a. Classification ofneurodegenerative diseases using Gaussian process classification with automaticfeature determination. IEEE 17–20.

Chu, C., Ni, Y., Tan, G., Saunders, C.J., Ashburner, J., 2011. Kernel regression for fMRIpattern prediction. Neuroimage 56, 662–673.

Cox, D.D., Savoy, R.L., 2003. Functional magnetic resonance imaging (fMRI) “brainreading”: detecting and classifying distributed patterns of fMRI activity in humanvisual cortex. Neuroimage 19, 261–270.

Craddock, R.C., Holtzheimer III, P.E., Hu, X.P., Mayberg, H.S., 2009. Disease state predictionfrom resting state functional connectivity. Magn. Reson. Med. 62, 1619–1628.

Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines andOther Kernel-based Learning Methods. Cambridge University Press.

Davatzikos, C., Ruparel, K., Fan, Y., Shen, D.G., Acharyya, M., Loughead, J.W., Gur, R.C.,Langleben, D.D., 2005. Classifying spatial patterns of brain activity with machinelearning methods: application to lie detection. Neuroimage 28, 663–668.

De Martino, F., Valente, G., Staeren, N., Ashburner, J., Goebel, R., Formisano, E., 2008.Combining multivariate voxel selection and support vector machines for mappingand classification of fMRI spatial patterns. Neuroimage 43, 44–58.

Dietterich, T.G., Bakiri, G., 1995. Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res. 2, 263–286.

Eger, E., Ashburner, J., Haynes, J.D., Dolan, R.J., Rees, G., 2008. fMRI activity patterns inhuman LOC carry information about object exemplars within category. J. Cogn.Neurosci. 20, 356–370.

Friston, K.J., Glaser, D.E., Henson, R.N., Kiebel, S., Phillips, C., Ashburner, J., 2002. Classicaland Bayesian inference in neuroimaging: applications. Neuroimage 16, 484–512.

Friston, K.J., Ashburner, J., Kiebel, J.S., Nichols, T.E., Penny, W.D., 2007. StatisticalParametric Mapping, the Analysis of Functional Brain Images. Academic press.

Friston, K., Chu, C., Mourao-Miranda, J., Hulme, O., Rees, G., Penny, W., Ashburner, J.,2008. Bayesian decoding of brain images. Neuroimage 39, 181–205.

Fu, C.H.Y., Mourao-Miranda, J., Costafreda, S.G., Khanna, A., Marquand, A.F., Williams, S.C.R., Brammer, M.J., 2008. Pattern classification of sad facial processing: toward thedevelopment of neurobiological markers in depression. Biol. Psychiatry 63,656–662.

Girolami, M., Rogers, S., 2006. Variational Bayesian multinomial probit regression withGaussian process priors. Neural Comput. 18, 1790–1817.

Guyon, I., Elisseeff, A.e., 2003. An introduction to variable and feature selection. J. Mach.Learn. Res. 3, 1157–1182.

Hardoon, D.R., Mourao-Miranda, J., Brammer, M., Shawe-Taylor, J., 2007. Unsupervisedanalysis of fMRI data using kernel canonical correlation. Neuroimage 37, 1250–1259.

Harville, D.A., 1977. Maximum likelihood approaches to variance componentestimation and to related problems. J. Am. Stat. Assoc. 72, 320–338.

Hassabis, D., Chu, C., Rees, G., Weiskopf, N., Molyneux, P.D., Maguire, E.A., 2009.Decoding neuronal ensembles in the human hippocampus. Curr. Biol.

Haxby, J.V., Gobbini, M.I., Furey, M.L., Ishai, A., Schouten, J.L., Pietrini, P., 2001. Distributedand overlapping representations of faces and objects in ventral temporal cortex.Science 293, 2425–2430.

Haynes, J.D., Rees, G., 2006. Decoding mental states from brain activity in humans. Nat.Rev. Neurosci. 7, 523–534.

Haynes, J.D., Sakai, K., Rees, G., Gilbert, S., Frith, C., Passingham, R.E., 2007. Readinghidden intentions in the human brain. Curr. Biol. 17, 323–328.

Klein, A., Andersson, J., Ardekani, B.A., Ashburner, J., Avants, B., Chiang, M.C.,Christensen, G.E., Collins, D.L., Gee, J., Hellier, P., Song, J.H., Jenkinson, M., Lepage,

C., Rueckert, D., Thompson, P., Vercauteren, T., Woods, R.P., Mann, J.J., Parsey, R.V.,2009. Evaluation of 14 nonlinear deformation algorithms applied to human brainMRI registration. Neuroimage 46, 786–802.

Kriegeskorte, N., Mur, M., Ruff, D.A., Kiani, R., Bodurka, J., Esteky, H., Tanaka, K., Bandettini,P.A., 2008. Matching categorical object representations in inferior temporal cortex ofman and monkey. Neuron 60, 1126–1141.

LaConte, S., Strother, S., Cherkassky, V., Anderson, J., Hu, X., 2005. Support vectormachines for temporal classification of block design fMRI data. Neuroimage 26,317–329.

Logothetis, N.K., Pauls, J., Augath, M., Trinath, T., Oeltermann, A., 2001. Neurophysio-logical investigation of the basis of the fMRI signal. Nature 412, 150–157.

Mackay, D.J.C., 1992. The evidence framework applied to classification networks. NeuralComput. 4, 720–736.

MacKay, D.J.C., 1995. Probable networks and plausible predictions a review of practicalBayesian methods for supervised neural networks. Netw.: Comput. Neural Syst. 6,469–505.

Misaki, M., Kim, Y., Bandettini, P.A., Kriegeskorte, N., 2010. Comparison of multivariateclassifiers and response normalizations for pattern-information fMRI. Neuroimage53, 103–118.

Mitchell, T.M., Hutchinson, R., Niculescu, R.S., Pereira, F.,Wang, X., Just,M., Newman, S., 2004.Learning to decode cognitive states from brain images. Mach. Learn. 57, 145–175.

Mourao-Miranda, J., Bokde, A.L., Born, C., Hampel, H., Stetter, M., 2005. Classifying brainstates and determining the discriminating activation patterns: Support VectorMachine on functional MRI data. Neuroimage 28, 980–995.

Mourao-Miranda, J., Reynaud, E., McGlone, F., Calvert, G., Brammer, M., 2006. Theimpact of temporal compression and space selection on SVM analysis of single-subject and multi-subject fMRI data. Neuroimage 33, 1055–1065.

Mourao-Miranda, J., Friston, K.J., Brammer, M., 2007. Dynamic discrimination analysis:a spatial–temporal SVM. Neuroimage 36, 88–99.

Op de Beeck, H.P., 2010. Against hyperacuity in brain reading: spatial smoothing doesnot hurt multivariate fMRI analyses? Neuroimage 49, 1943–1948.

Platt, J., 1999. Sequential minimal optimization: a fast algorithm for training supportvector machines. Advances in Kernel Methods-Support Vector Learning 208.

Quinlan, J.R., 1993. C4. 5: Programs for Empirical Learning. Morgan Kaufmann, SanFrancisco, CA.

Rasmussen, C.E., Williams, C.K.I., 2006. Gaussian Processes for Machine Learning TheMIT Press.

Shawe-Taylor, J., Cristianini, N., 2004. Kernel Methods for Pattern Analysis. CambridgeUniversity Press.

Strother, S., La Conte, S., Kai Hansen, L., Anderson, J., Zhang, J., Pulapura, S., Rottenberg,D., 2004. Optimizing the fMRI data-processing pipeline using prediction andreproducibility performance metrics: I. A preliminary group analysis. Neuroimage23 (Suppl 1), S196–S207.

Tipping, M.E., 2000. The Relevance Vector Machine. Advances in Neural InformationProcessing Systems 12.

Tipping, M.E., 2001. Sparse Bayesian learning and the relevance vector machine. J. Mach.Learn. Res. 1, 211–244.

Vapnik, V., 1995. The Nature of Statistical Learning Theory. NY Springer.Vapnik, V.N., 1998. Statistical Learning Theory. Wiley.Weiskopf, N., Sitaram, R., Josephs, O., Veit, R., Scharnowski, F., Goebel, R., Birbaumer, N.,

Deichmann, R., Mathiak, K., 2007. Real-time functional magnetic resonanceimaging: methods and applications. Magn. Reson. Imaging 25, 989–1003.