tw pattern classification - sbirc.ed.ac.uk · pdf filespm course edinburgh 2010 ... advantages...

Multivariate Pattern Multivariate Pattern ClassificationClassification

Thomas WolbersThomas WolbersSpace and Aging LaboratorySpace and Aging LaboratoryCentre for Cognitive and Neural SystemsCentre for Cognitive and Neural Systems

SPM Course Edinburgh 2010

WHY PATTERN CLASSIFICATION?WHY PATTERN CLASSIFICATION?

PROCESSING STREAMPROCESSING STREAM

PREPROCESSING / FEATURE REDUCTIONPREPROCESSING / FEATURE REDUCTION

CLASSIFICATIONCLASSIFICATION

EVALUATING RESULTSEVALUATING RESULTS

APPLICATIONSAPPLICATIONS

OutlineOutline








== ++

Tim

e (

Tim

e ( s

can

scan

))

WhyWhy patternpattern classclass.?.?datadata parameterparameter errorerrordesigndesign matrixmatrix

β1β2β3β4β5β6β7β8β9β10β0

εε== ββ ++yy X ••

••

GLM: separate GLM: separate modelmodel fittingfitting forfor eacheach voxel voxel massmass‐‐univariateunivariate analysisanalysis!!

SPM Course Edinburgh 2010WhyWhy patternpattern classclass.?.?

Key idea behind pattern classificationKey idea behind pattern classificationGLM analysis relies exclusively on the information contained in the time course of individual voxelsMultivariate analyses take advantage of the information contained in activity patterns across space, frommultiple voxels Cognitive/Sensorimotor states are expressed in the brain as distributed patterns of brain activity

GLM GLM

SPM Course Edinburgh 2010WhyWhy patternpattern classclass.?.?

Advantages of multivariate pattern classificationAdvantages of multivariate pattern classification

increaseincrease in in sensitivitysensitivity: : weakweak informationinformation in in singlesingle voxels voxels isisaccumulatedaccumulated acrossacross manymany voxelsvoxels

multiple multiple regionsregions/voxels /voxels maymay onlyonly carry carry infoinfo aboutabout brainbrainstatesstates whenwhen jointlyjointly analyzedanalyzed

can preventcan prevent informationinformation lossloss duedue to to spatialspatial smoothing smoothing (but see Op de Beeck, 2009 / Kamitani & Sawahata 2010)(but see Op de Beeck, 2009 / Kamitani & Sawahata 2010)

cancan preservepreserve temporal temporal resolutionresolution insteadinstead of of characterizingcharacterizingaverageaverage responseresponse acrossacross manymany trialstrials

SPM Course Edinburgh 2010OutlineOutline








Haynes & Rees (2005). Current Biology

Can spontaneous changes in conscious experience be decodedfrom fMRI signals in early visual cortex?

BINOCULAR RIVALRYBINOCULAR RIVALRY

SPM Course Edinburgh 2010ProcessingProcessing streamstream

1. Acquire fMRI data while subject is viewing blue and red gratings


1. Acquire fMRI data

2. Preprocess fMRI data

ProcessingProcessing streamstream


1. Acquire fMRI data2. Preprocess fMRI data

3. Select relevant features (i.e. voxels)



1. Acquire fMRI data2. Preprocess fMRI data3. Select features

4. Convert each fMRI volume into a vector that reflects the pattern of activity across voxels at that point in time.


SPM Course Edinburgh 2010ProcessingProcessing streamstream

1. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns

5. Label fMRI patterns according to whether the subject was perceiving blue vs. red (adjusting for hemodynamic lag)


1. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns

6. Train a classifier to discriminate between blue patterns and red patterns



1. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns6. Train the classifier

7. Apply the trained classifier to new fMRI patterns (not presented at training).



1. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns6. Train the classifier7. Apply the trained classifier

to new fMRI patterns (not presented at training).

8. Crossvalidation



1. Acquire fMRI data2. Preprocess fMRI data3. Select features4. Generate fMRI patterns5. Label fMRI patterns6. Train the classifier7. Apply the trained classifier

to new fMRI patterns (not presented at training).

8. Crossvalidation

9. Statistical inference



Haynes & Rees (2005). Current Biology

SPM Course Edinburgh 2010PreprocessingPreprocessing

1.1. ((SliceSlice Timing +) Realignment Timing +) Realignment (SPM,(SPM, FSL FSL ……))

2.2. HighHigh‐‐pass filtering / Detrendingpass filtering / Detrendingremoveremove linear (and linear (and quadraticquadratic) ) trendstrends ((i.ei.e. . scannerscannerdrift)drift)removeremove lowlow‐‐frequencyfrequency artifactsartifacts ((i.ei.e. . biosignalsbiosignals))

3.3. ZZ‐‐ScoringScoringremoveremove baselinebaseline shiftsshifts betweenbetween scanningscanning runsrunsreducereduce impactimpact of of outliersoutliers

SPM Course Edinburgh 2010Feature Feature ReductionReduction

TheThe problemproblemfMRI fMRI datadata areare typicallytypically sparsesparse, , highhigh‐‐dimensionaldimensional and and noisynoisy

ClassificationClassification isis sensitive to sensitive to informationinformation contentcontent in all voxelsin all voxels

manymany uninformative voxels = uninformative voxels = poorpoor classificationclassification ((i.ei.e. . duedueto to overfittingoverfitting))

number of features

performan

ce

Solution 1: Feature Solution 1: Feature selectionselection

selectselect subsetsubset withwith thethe mostmost informative informative featuresfeaturesoriginal original featuresfeatures remainremain unchangedunchanged

SPM Course Edinburgh 2010Feature Feature SelectionSelection

‘‘ExternalExternal‘‘ SolutionsSolutionsAnatomicalAnatomical regionsregions of of interestinterestIndependent Independent functionalfunctional localizerlocalizer (Haynes & (Haynes & ReesRees: : retinotopicretinotopic mappingmapping to to identifyidentify earlyearly visualvisual areas)areas)Searchlight classification: define region of interest (i.e. Searchlight classification: define region of interest (i.e. sphere) and move it across the search volume sphere) and move it across the search volume exploratory analysisexploratory analysis

‘‘InternalInternal‘‘ univariateunivariate solutionssolutionsactivationactivation vs. vs. baselinebaseline (t(t‐‐Test)Test)meanmean differencedifference betweenbetween conditionsconditions (ANOVA)(ANOVA)singlesingle voxel voxel classificationclassification accuracyaccuracy

SPM Course Edinburgh 2010Feature Feature SelectionSelection

PeekingPeeking #1 (ANOVA and #1 (ANOVA and classificationclassification onlyonly))testingtesting a a trainedtrained classifierclassifier needsneeds to to bebe performedperformed on on independentindependent test test datasetsdatasetsifif entireentire datasetdataset isis usedused forfor featurefeature selectionselection, , classificationclassification estimatesestimates becomebecome overlyoverly optimisticoptimisticnestednested crosscross‐‐validationvalidation!!

Pereira et al. (2009)

SPM Course Edinburgh 2010Feature Feature ExtractionExtraction

Solution 1: Feature Solution 1: Feature selectionselection

selectselect subsetsubset fromfrom all all availableavailable featuresfeaturesoriginal original featuresfeatures remainremain unchangedunchanged

createcreate newnew featuresfeatures as a as a functionfunction of of existingexisting featuresfeaturesLinear Linear functionsfunctions (PCA,(PCA, ICA,ICA,……))NonlinearNonlinear functionsfunctions duringduringclassificationclassification ((i.ei.e. . hiddenhidden unitsunits in a in a neuralneural networknetwork))

Solution 2: Feature Solution 2: Feature extractionextraction

SPM Course Edinburgh 2010ClassificationClassification

Linear Linear classificationclassification

voxel 1

voxel 2

volume in t1

volume in t2

volume in t4

volume in t32

4

trainingtraining datadata

volume in t25

independent independent test test datadata

hyperplanehyperplane

ourour tasktask: find a : find a hyperplanehyperplane thatthat separates separates bothboth conditionsconditions



voxel 1

voxel 2

volume in t1

volume in t2

volume in t4

volume in t32

4

bxwxwxwxfy nn ++++== ...)( 2211decisiondecision functionfunction::

•• ifif y y << 0, 0, predictpredict redred // // ifif y y >> 0, 0, predictpredict blueblue•• predictionprediction = = linearlinear functionfunction of of featuresfeatures

trainingtraining datadata

volume in t25

independent independent test test datadata




Project data on a new axis that maximes the class separabilityProject data on a new axis that maximes the class separability


Project data on a new axis that maximes the class separabilityProject data on a new axis that maximes the class separability

Hyperplane is orthogonal to the best projection axisHyperplane is orthogonal to the best projection axis


Simplest Approach: Fisher Linear Simplest Approach: Fisher Linear DiscriminantDiscriminant (FLD)(FLD)

FLD classifies by projecting the training set on the axis that iFLD classifies by projecting the training set on the axis that is defined s defined by the difference between the center of mass for both classes, by the difference between the center of mass for both classes, corrected by the within class scattercorrected by the within class scatter

separation is separation is maximisedmaximised for:for:21

21

covcov classclass

mmw+−

=


weightvector w

voxel 1

voxel 2

volume in t1

volume in t2

volume in t4

volume in t3volume in t25



by += wx hyperplane defined by weight vector w and hyperplane defined by weight vector w and offset boffset b


How to interpret the How to interpret the weight vector?weight vector?

weightvector w

voxel 1

voxel 2

volume in t1 volume in t2


volume in t25


Weight vector (Discriminating Volume)W = [0.45 0.89] 0.45 0.89

The value of each voxel in the weight vector indicates its imporThe value of each voxel in the weight vector indicates its importance in tance in discriminating between the two classes (i.e. cognitive states).discriminating between the two classes (i.e. cognitive states).



Support Vector Machine (SVM)

Which of the linear separators is the optimal one?

voxel 2

voxel 1


Support Vector Machine (SVM)

SVM = maximum margin classifier

margin

support vectors

voxel 2

voxel 1If classes have overlapping distributions), SVM’s are modified to account for misclassification errors by introducing additional slack variables


Linear Linear classifiersclassifiersFisher Linear Fisher Linear DiscriminantDiscriminantSupport Vector Support Vector MachineMachine (SVM)(SVM)LogisticLogistic RegressionRegressionGaussianGaussian Naive Naive BayesBayes……

NonlinearNonlinear classifiersclassifiersSVM SVM withwith kernelkernelNeuralNeural NetworksNetworks……

How to choose the right classifier?


Situation 1: Situation 1: scansscans ↓↓, , featuresfeatures ↑↑ ((i.ei.e. . wholewhole brainbrain datadata))

FLD FLD unsuitableunsuitable: : dependsdepends on on reliablereliable estimationestimation of of covariancecovariance matrixmatrix

GNBGNB inferior to SVM and LR inferior to SVM and LR thethe latterlatter come come withwith regularisationregularisationthatthat helphelp weighweigh down down thethe effectseffects of of noisynoisy and and highlyhighly correlatedcorrelatedfeaturesfeatures

Cox & Savoy (2003). NeuroImage


Situation 2: Situation 2: scansscans ↓↓, , featuresfeatures ↓↓ ((i.ei.e. . featurefeature selectionselection ororfeaturefeature extractionextraction))

GNB, SVM and LR: GNB, SVM and LR: oftenoften similarsimilar performanceperformanceSVM SVM originallyoriginally designeddesigned forfor twotwo‐‐classclass problemsproblems onlyonlySVM SVM forfor multiclassmulticlass problemsproblems: multiple : multiple binarybinarycomparisonscomparisons, , votingvoting schemescheme to to identifyidentify classesclasses

accuracyaccuracy of SVM of SVM increasesincreases fasterfaster thanthan GNB GNB whenwhen thethenumbernumber of of scansscans increaseincreaseseesee Mitchell et al. (2005) Mitchell et al. (2005) forfor furtherfurther comparisonscomparisonsbetweenbetween different different classifiersclassifiers


PeekingPeeking #2#2classifierclassifier performanceperformance = = unbiasedunbiased estimateestimate of of classificationclassification accuracyaccuracyhowhow well well wouldwould thethe classifierclassifier labellabel a a newnew exampleexamplerandomlyrandomly drawndrawn fromfrom thethe samesame distributiondistribution??testingtesting a a trainedtrained classifierclassifier needsneeds to to bebe performedperformed on a on a datasetdataset thethe classifierclassifier has has nevernever seenseen beforebeforeifif entireentire datasetdataset isis usedused forfor trainingtraining a a classifierclassifier, , classificationclassification estimatesestimates becomebecome overlyoverly optimisticoptimistic

Solution: Solution: leaveleave--oneone--outout crossvalidationcrossvalidation


CrossvalidationCrossvalidationstandardstandard approachapproach: : leaveleave‐‐oneone‐‐outoutcrossvalidationcrossvalidationsplitsplit datasetdataset intointo n n foldsfolds ((i.ei.e. . runsruns))traintrain classifierclassifier on 1:non 1:n‐‐1 1 foldsfoldstest test thethe trainedtrained classifierclassifier on on foldfold nnrerunrerun trainingtraining//testingtesting whilewhilewithholdingwithholding a different a different foldfoldrepeatrepeat procedureprocedure untiluntil eacheach foldfold has has beenbeen withheldwithheld onceonceClassificationClassification accuracyaccuracy usuallyusuallycomputedcomputed as as meanmean accuracyaccuracy

training set test set

SPM Course Edinburgh 2010EvaluatingEvaluating resultsresults

Independent test Independent test datadataClassificationClassification accuracyaccuracy = = unbiasedunbiased estimateestimate of of thethe truetrue accuracyaccuracyof of thethe classifierclassifierQuestionQuestion: : whatwhat isis thethe probabilityprobability of of obtainingobtaining 57% 57% accuracyaccuracyunderunder thethe null null hypothesishypothesis (no (no informationinformation aboutabout thethe variable of variable of interestinterest in in mymy datadata)?)?Binary classification: pBinary classification: p‐‐valuevalue cancan bebe calculatedcalculated underunder a a binomialbinomialdistributiondistribution withwith N N trialstrials ((i.ei.e. 100) and P . 100) and P probabilityprobability of of successsuccess((i.ei.e. 0.5) . 0.5) MatlabMatlab: p = 1 : p = 1 ‐‐ binocdf(X,N,Pbinocdf(X,N,P) = 0.067 () = 0.067 (hmmhmm……))

X = X = numbernumber of of correctlycorrectly labeledlabeled examplesexamples ((i.ei.e. 57). 57)

CanCan I I publishpublish mymy datadata withwith 57% 57% classificationclassification accuracyaccuracy in in Science Science oror Nature?Nature?

SPM Course Edinburgh 2010EvaluatingEvaluating resultsresults

NonparametricNonparametric approachesapproachesPermutation Permutation teststests ((i.ei.e. . PolynPolyn et al, 2005):et al, 2005):

createcreate a null a null distributiondistribution of of performanceperformance valuesvalues byby repeatedlyrepeatedlygeneratinggenerating scrambledscrambled versionsversions of of thethe classifierclassifier outputoutputMVPA: MVPA: waveletwavelet basedbased scramblingscrambling techniquetechnique ((BullmoreBullmore et al., 2004) et al., 2004)

cancan accomodateaccomodate nonnon‐‐independentindependent datadata

BootstrappingBootstrappingestimateestimate thethe variancevariance and and distributiondistribution of a of a statisticstatistic ((i.ei.e. voxel . voxel weightsweights))Multiple Multiple iterationsiterations of of datadata resamplingresampling byby drawingdrawing withwith replacementreplacementfromfrom thethe datasetdataset

MulticlassMulticlass problemsproblems: : accuracyaccuracy cancan bebe painfulpainfulaverageaverage rank of rank of thethe correctcorrect labellabelaverageaverage of all of all pairwisepairwise comparisonscomparisons

SPM Course Edinburgh 2010GettingGetting resultsresults

Design Design considerationsconsiderationsacquireacquire as as manymany trainingtraining examplesexamples as as possiblepossible classifierclassifier needsneeds to to bebe ableable to to „„seesee throughthrough thethe noisenoise““

averagingaveraging consecutiveconsecutive TRTR‘‘ss cancan helphelp to to reducereduce thethe impactimpact of of noisenoise((butbut maymay also also eliminateeliminate naturalnatural, informative , informative variation)variation)

alternative to averaging: use beta weights from a GLM analysis (alternative to averaging: use beta weights from a GLM analysis (i.e. i.e. based on FIR or HRF) based on FIR or HRF) requires many runs / trialsrequires many runs / trials

avoidavoid usingusing consecutiveconsecutive scansscans forfor trainingtraining a a classifierclassifier lots of lots of highlyhighly similarsimilar datapointsdatapoints do do notnot givegive newnew informationinformation

acquireacquire as as manymany testtest examplesexamples as as possiblepossible increasesincreases thethe powerpower of of significancesignificance testtest

balancebalance conditionsconditions ifif notnot, , classifierclassifier maymay tendtend to to focusfocus on on predominantpredominant conditioncondition

SPM Course Edinburgh 2010ApplicationsApplications

Pattern Pattern discriminationdiscriminationQuestionQuestion 1: do 1: do thethe selectedselected fMRI fMRI datadata containcontain informationinformationaboutabout a variable of a variable of interestinterest ((i.ei.e. . consciousconscious perceptpercept in Haynes & in Haynes & ReesRees)?)?

Pattern Pattern localizationlocalizationQuestionQuestion 2: 2: wherewhere in in thethebrainbrain isis informationinformation aboutaboutthethe variable of variable of interestinterestrepresentedrepresented??weightweight vectorvector containscontains infoinfoon on thethe importanceimportance of of eacheachvoxel voxel forfor differentiatingdifferentiatingbetweenbetween classesclasses

weightvector w

voxel 1

voxel 2

volume in t1

volume in t2

volume in t4




Pattern localization Pattern localization ‐‐ SpaceSpace

Polyn et al. (2005), Science.


Pattern Pattern localization localization ‐‐ SpaceSpaceSearchlightSearchlight analysisanalysis: : classificationclassification//crossvalidationcrossvalidation isisperformedperformed on a voxel and on a voxel and itsits (spherical)(spherical) neighbourhoodneighbourhoodclassificationclassification accuracyaccuracy isis assignedassigned to to centrecentre voxelvoxelsearchlightsearchlight isis movedmoved acrossacross entireentire datasetdataset to to obtainobtain accuracyaccuracyestimatesestimates forfor eacheach voxelvoxelcancan bebe usedused forfor featurefeature selectionselection oror to to generategenerate a a brainbrain mapmap of of pp‐‐valuesvalues

Hassabis et al. (2009), Current Biology.

positionclass.


Motor intention

Pattern Pattern localization localization ‐‐ TimeTime

QuestionQuestion 3: 3: whenwhen doesdoes thethe brainbrain representrepresent informationinformation aboutaboutdifferentdifferent classesclasses??

Soon et al. (2008), Nature Neuroscience.


Pattern Pattern characterizationcharacterizationQuestionQuestion 4: 4: HowHow areare stimulusstimulus classesclasses representedrepresented in in thethe brainbrain??goalgoal: : characterizingcharacterizing thethe relationshiprelationship betweenbetween stimulusstimulus classesclasses and and BOLD BOLD patternspatternsKay et al. (2008): Kay et al. (2008): trainingtraining of a of a receptivereceptive fieldfield modelmodel forfor eacheach voxel in voxel in V1, V2 and V3 V1, V2 and V3 basedbased on on locationlocation, , spatialspatial frequencyfrequency and and orientationorientation(1750 (1750 naturalnatural imagesimages))

subsequentsubsequent classificationclassification of of completelycompletelynewnew stimulistimuli (120 (120 naturalnatural imagesimages))

SPM Course Edinburgh 2010TopicsTopics

Useful literatureUseful literatureHaynes JD, Rees G (2006) Decoding mental states from brain activity in humans. Nat Rev Neurosci 7:523‐534.Formisano E, De Martino F, Valente G (2008) Multivariate analysis of fMRI time series: classification and regression of brain responses using machine learning. Magn Reson Imaging 26(7):921‐34.Kriegeskorte N, Goebel R, Bandettini P (2006) Information‐based functional brain mapping. Proc Natl Acad Sci U S A 103:3863‐3868.Mitchell TM, et al. (2004) Learning to Decode Cognitive States from Brain Images. Machine Learning 57:145‐175.Norman KA, Polyn SM, Detre GJ, Haxby JV (2006) Beyond mind‐reading: multi‐voxel pattern analysis of fMRI data. Trends Cogn Sci 10:424‐430.O’Toole et al. (2007). Theoretical, statistical, and practical perspectives on pattern‐based classification approaches to the analysis of functional neuroimaging data. J Cogn Neurosci.19(11):1735‐52Pereira F, Mitchell TM, Botvinick M (2009) Machine Learning Classifiers and fMRI: a tutorial overview. Neuroimage 45(1 Suppl):S199‐209.

tw pattern classification - sbirc.ed.ac.uk · pdf filespm course edinburgh 2010 ... advantages...

Documents