multivariate models for fmri data klaas enno stephan (with 90% of slides kindly contributed by kay...

Multivariate models for fMRI data

Klaas Enno Stephan(with 90% of slides kindly contributed by Kay H. Brodersen)

Translational Neuromodeling Unit (TNU)Institute for Biomedical EngineeringUniversity of Zurich & ETH Zurich

2

Univariate approaches are excellent for localizing activations in individual voxels.

Why multivariate?

v1 v2 v1 v2

reward no reward

*

n.s.

3

Multivariate approaches can be used to examine responses that are jointly encoded in multiple voxels.

Why multivariate?

v1 v2 v1 v2

n.s.

orange juice apple juice v1

v2

n.s.

4

Multivariate approaches can utilize ‘hidden’ quantities such as coupling strengths.

Why multivariate?

activity

observed BOLD signal

hidden underlying neural activity and coupling strengths

t

driving input modulatory input

t

activity

activity

signal

signal

signal

Friston, Harrison & Penny (2003) NeuroImage; Stephan & Friston (2007) Handbook of Brain Connectivity; Stephan et al. (2008) NeuroImage

5

Overview

1 Modelling principles

2 Classification

3 Multivariate Bayes

4 Generative embedding

6

Overview


2 Classification



7

Encoding vs. decoding

context (cause or consequence) BOLD signal

condition stimulusresponse prediction error

encoding model

decoding model

𝑔 : 𝑋 𝑡→𝑌 𝑡

h :𝑌 𝑡→ 𝑋𝑡

8

Regression vs. classification

Regression model

independentvariables(regressors)

continuousdependent variable

Classification model

independentvariables(features)

categoricaldependent variable(label)

𝑓

𝑓 vs.

9

Univariate vs. multivariate models

BOLD signal

context

A univariate model considers a single voxel at a time.

A multivariate model considers many voxels at once.

Spatial dependencies between voxels are only introduced afterwards, through random field theory.

BOLD signalcontext

Multivariate models enable inferences on distributed responses without requiring focal activations.

10

Prediction vs. inference

The goal of prediction is to find a generalisable encoding or decoding function.

The goal of inference is to decide between competing hypotheses.

predicting a cognitive state using a

brain-machine interface

predicting asubject-specific

diagnostic status

predictive density marginal likelihood (model evidence)

comparing a model that links distributed neuronal

activity to a cognitive state with a model that

does not

weighing the evidence for

sparse vs. distributed coding

11

Goodness of fit vs. complexity

Goodness of fit is the degree to which a model explains observed data.

Complexity is the flexibility of a model (including, but not limited to, its number of parameters).

4 parameters 9 parameters

Bishop (2007) PRML

1 parameter

𝑋

𝑌

We wish to find the model that optimally trades off goodness of fit and complexity.

underfitting overfittingoptimal

truthdatamodel

13

Overview


2 Classification



14

A principled way of designing a classifier would be to adopt a probabilistic approach:

Constructing a classifier

Generative classifiers

use Bayes’ rule to estimate

• Gaussian naïve Bayes• Linear discriminant

analysis

Discriminative classifiers

estimate directlywithout Bayes’ theorem

• Logistic regression• Relevance vector machine• Gaussian process classifier

Discriminant classifiers

estimate directly

• Fisher’s linear discriminant

• Support vector machine

𝑓𝑌 𝑡 that which maximizes

In practice, classifiers differ in terms of how strictly they implement this principle.

15

Searchlight approach

A sphere is passed across the brain. At each location, the classifier is evaluated using only the voxels in the current sphere map of t-scores.

Whole-brain approach

A constrained classifier is trained on whole-brain data. Its voxel weights are related to their empirical null distributions using a permutation test map of t-scores.

Common types of fMRI classification studies

Nandy & Cordes (2003) MRMKriegeskorte et al. (2006) PNAS

Mourao-Miranda et al. (2005) NeuroImage

16

Support vector machine (SVM)

Vapnik (1999) Springer; Schölkopf et al. (2002) MIT Press

Nonlinear SVMLinear SVM

v1

v2

17

Stages in a classification analysis

Feature extraction

Classification using cross-validation

Performance evaluation

Bayesian mixed-effects inference

mixed effects

18

We can obtain trial-wise estimates of neural activity by filtering the data with a GLM.

Feature extraction for trial-by-trial classification

¿

data design matrix

𝛽1𝛽2⋮𝛽𝑝

× +𝑒

coefficients

Boxcar regressor for trial 2

Estimate of this coefficient reflects activity on trial 2

19

The generalization ability of a classifier can be estimated using a resampling procedure known as cross-validation. One example is 2-fold cross-validation:

Cross-validation

examples123

99100

?training example

test examples

folds

??

1

...

???

2

...

performance evaluation

20

A more commonly used variant is leave-one-out cross-validation.

Cross-validation

examples123

99100

?training example

test example?...

98

?

...99

?

...

100

...

folds

?

1

...

?

2

...

performance evaluation

21

Single-subject study with trials

The most common approach is to assess how likely the obtained number of correctly classified trials could have occurred by chance.


number of correctly classified trialstotal number of trialschance level (typically 0.5)binomial cumulative density function

Binomial test

In MATLAB:p = 1 - binocdf(k,n,pi_0)

subject

-+--+

trial

trial 1… [0111

0]

22


subject 1

-+--+

trial

trial 1…

population

[01110]

subject 2

[11011]

subject 3

[01100]

subject 4

[11111]

subject

…

… [01110]

23

Group study with subjects, trials each

In a group setting, we must account for both within-subjects (fixed-effects) and between-subjects (random-effects) variance components.


sample mean of sample accuracies chance level (typically 0.5)sample standard deviation cumulative Student’s -distribution

t-test onsummary statistics

Binomial test on concatenated data

Binomial test on averaged data

Bayesian mixed-effects inference

fixed effects random effectsfixed effects mixed effects

Brodersen, Mathys, Chumbley, Daunizeau, Ong, Buhmann, Stephan (2012) JMLRBrodersen, Daunizeau, Mathys, Chumbley, Buhmann, Stephan (2013) NeuroImage

available for MATLAB and R

24

Research questions for classification

Temporal evolution of discriminability Model-based classification

accuracy

50 %

100 %

within-trial time

Accuracy rises above chance

Participant indicates decision

Overall classification accuracy Spatial deployment of discriminative regions

80%

55%

accuracy

50 %

100 %

classification task

Truthor

lie?

Left or right button?

Healthy or ill?

Pereira et al. (2009) NeuroImage, Brodersen et al. (2009) The New Collection

{ group 1, group 2 }

25

Multivariate classification studies conduct group tests on single-subject summary statistics that discard the sign or

direction of underlying effects

do not necessarily take into account confounding effects (e.g. correlation of task conditions with difficulty etc.)

Therefore, in some analyses confounds rather than distributed representations may have produced positive results.

Potential problems

Todd et al. 2013, NeuroImage

26

Simulation: Experiment condition (rule A vs. rule B) does not affect voxel activity, but difficulty does. Moreover, experiment condition and difficulty are confounded at the individual-subject level in random directions across 500 subjects.

Potential problems


27

Empirical example: searchlight analysis: fMRI study on rule representations (flexible stimulus–response mappings

Standard MVPA: rule representations in prefrontal regions.

GLM: no significant results. Controlling for a variable

that is confounded with rule at the individual-subject level but not the group level (reaction time differences across rules) eliminates the MVPA results.

Potential problems


28

Overview


2 Classification



29

Multivariate Bayes

Mike West

SPM brings multivariate analyses into the conventional inference framework of hierarchical Bayesian models and their inversion.

30

Multivariate Bayes

some cause or consequence decoding model

Multivariate analyses in SPM rest on the central notion that inferences about how the brain represents things can be reduced to model comparison.

vs.

sparse coding in orbitofrontal cortex

distributed coding in prefrontal cortex

31

From encoding to decoding

Encoding model: GLM Decoding model: MVB

𝑌

¿ 𝑋 𝛽

¿𝑇𝐴+𝐺𝛾+𝜀𝛾𝜀

𝛽

𝛽𝑋

𝐴

𝑌

¿ 𝐴 𝛽

¿𝑇𝐴+𝐺𝛾+𝜀

𝑋

𝐴

𝑌=𝑇𝑋 𝛽+𝐺𝛾+𝜀 𝑇𝑋=𝑌 𝛽−𝐺𝛾𝛽−𝜀𝛽

𝛾𝜀

32

Multivariate Bayes

Friston et al. 2008 NeuroImage

33

Encoding model: neuronal activity = linear mixture of causes:

A=X𝛽 BOLD data matrix is a temporal convolution of underlying neuronal activity:

𝑌=𝑇A+𝐺𝛾+𝜀 Decoding model:

mental state is a linear mixture of voxel-wise activity: 𝑋=A𝛽 thus A=X𝛽-1

substitution into gives:𝑌 Y =𝑇X𝛽-1 +𝐺𝛾+𝜀 ⇒ Y𝛽 =𝑇X+𝐺𝛾𝛽+𝜀𝛽 ⇒ 𝑇X= Y𝛽-𝐺𝛾𝛽-𝜀𝛽

Importantly, we can remove the influence of confounds by premultiplying with the appropriate residual forming matrix: RTX= RY𝛽+𝜍

MVB: details

34

To make inversion tractable: imposing priors on 𝛽 (i.e., assumption about how mental states are

represented by voxel-wise activity)

Scientific inquiry: model selection: comparing different priors to identify which neuronal

representation of mental states is most plausible

MVB: details

35

Specifying the prior for MVB

To make the ill-posed regression problem tractable, MVB uses a prior on voxel weights. Different priors reflect different anatomical and/or coding hypotheses.

For example:

voxels

patterns

Voxel 3 is allowed to play a role, but only if its neighbours play similar roles.

Voxel 2 is allowed to play a role.

Friston et al. 2008 NeuroImage

36

MVB can be illustrated using SPM’s attention-to-motion example dataset.

This dataset is based on a simple block design. There are three experimental factors:

photic – display shows random dots

motion – dots are moving attention – subjects asked to pay

attention

Example: decoding motion from visual cortex

scan

s

photic motion attention const

Büchel & Friston 1999 Cerebral CortexFriston et al. 2008 NeuroImage

During these scans, for example, subjects were passively viewing moving dots.

37

Multivariate Bayes in SPM

Step 1After having specified and estimated a model, use the Results button.

Step 2Select the contrast to be decoded.

38


Step 3Pick a region of interest.

39


Step 4Multivariate Bayes can be invoked from within the Multivariate section.

Step 5Here, the region of interest is specified as a sphere around the cursor. The spatial prior implements a sparse coding hypothesis.

anatomical hypothesis

coding hypothesis

40


Step 6Results can be displayed using the BMS button.

41

Model evidence and voxel weights

log BF = 3

42

Summary: research questions for MVB

How does the brain represent things?Evaluating competing coding hypotheses

Where does the brain represent things?Evaluating competing anatomical hypotheses

43

Overview


2 Classification



44

Model-based classification

Brodersen, Haiss, Ong, Jung, Tittgemeyer, Buhmann, Weber, Stephan (2011) NeuroImageBrodersen, Schofield, Leff, Ong, Lomakina, Buhmann, Stephan (2011) PLoS Comput Biol

step 2 —embedding

step 1 —modelling

measurements from an individual subject

subject-specificgenerative model

subject representation in model-based feature space

A → BA → CB → BB → C

A

CB

step 3 —classification

A

CB

jointly discriminativeconnection strengths?

step 5 —interpretation

classification model

1

0discriminability of

groups?

accu

racy step 4 —

evaluation

45

Model-based classification: model specification

medialgeniculate

body

planumtemporale

Heschl’sgyrus(A1)

medialgeniculate

body

planumtemporale

Heschl’sgyrus(A1)

stimulus input

anatomicalregions of interest

y = –26 mm

L R

46

Model-based classification

patientscontrols 16218917734729133230781 243389360

50

60

70

80

90

100

bal

anc

ed a

ccu

racy

activation-based

correlation-based

model-based

a c s p m e z o f l r

bala

nced

acc

urac

y

100%

50%

90%

80%

70%

60%

n.s. n.s.

*

47

Generative embedding

-10

0

10

-0.5

0

0.5

-0.1

0

0.1

0.2

0.3

0.4

-0.4-0.2

0 -0.5

0

0.5-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

-10

0

10

-0.5

0

0.5

-0.1

0

0.1

0.2

0.3

0.4

-0.4-0.2

0 -0.5

0

0.5-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

generative em

bedding

Para

met

er 3

Voxe

l 3

Parameter 1Voxel 1Voxel 2 Parameter 2

controlspatients

Voxel-based activity space Model-based parameter space

classification accuracy

98%classification accuracy

75%

48

Model-based classification: interpretation

MGB

PT

HG(A1)

MGB

PT

HG(A1)

stimulus input

L R

49

Model-based classification: interpretation

MGB

PT

HG(A1)

MGB

PT

HG(A1)

stimulus input

L R

highly discriminativesomewhat discriminativenot discriminative

50

Model-based clustering

Deserno, Sterzer, Wüstenberg, Heinz, & Schlagenhauf (2012) J Neurosci

PC dLPFC

VC

WM

stimulus

fMRI data acquired during working-memory task & modelled using a three-region DCM

42 patients diagnosedwith schizophrenia

41 healthy controls

51


model selection interpretation validation

Brodersen et al. 2014, NeuroImage: Clinical

52

Summary

Classification• to assess whether a cognitive state is linked

to patterns of activity• to visualize the spatial deployment of

discriminative activity

Multivariate Bayes• to evaluate competing anatomical

hypotheses• to evaluate competing coding hypotheses

Generative embedding• to assess whether groups differ in terms of

model parameter estimates (connectivity)• to generate mechanistic subgroup

hypotheses

53

Thank you

55

Classification Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: A tutorial

overview. NeuroImage, 45(1, Supplement 1), S199-S209. O'Toole, A. J., Jiang, F., Abdi, H., Penard, N., Dunlop, J. P., & Parent, M. A. (2007). Theoretical,

Statistical, and Practical Perspectives on Pattern-based Classification Approaches to the Analysis of Functional Neuroimaging Data. Journal of Cognitive Neuroscience, 19(11), 1735-1752.

Haynes, J., & Rees, G. (2006). Decoding mental states from brain activity in humans. Nature Reviews Neuroscience, 7(7), 523-534.

Norman, K. A., Polyn, S. M., Detre, G. J., & Haxby, J. V. (2006). Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences, 10(9), 424-30.

Brodersen, K. H., Haiss, F., Ong, C., Jung, F., Tittgemeyer, M., Buhmann, J., Weber, B., et al. (2010). Model-based feature construction for multivariate decoding. NeuroImage (2010).

Brodersen, K. H., Ong, C., Stephan, K. E., Buhmann, J. (2010). The balanced accuracy and its posterior distribution. ICPR, 3121-3124.

Multivariate Bayes Friston, K., Chu, C., Mourao-Miranda, J., Hulme, O., Rees, G., Penny, W., et al. (2008). Bayesian

decoding of brain images. NeuroImage, 39(1), 181-205.

Further reading

56

Multivariate approaches can exploit a sampling bias in voxelized imagesto reveal interesting activity on a subvoxel scale.

Why multivariate?

Boynton (2005) Nature Neuroscience

¿

57

The last 10 years have seen a notable increase in decoding analyses in neuroimaging.

Why multivariate?

Haxby et al. (2001) ScienceLautrup et al. (1994) Supercomputing in Brain Research

PET

prediction

58

Searchlight approach

A sphere is passed across the brain. At each location, the classifier is evaluated using only the voxels in the current sphere map of t-scores.

Whole-brain approach

A constrained classifier is trained on whole-brain data. Its voxel weights are related to their empirical null distributions using a permutation test map of t-scores.

Which brain regions are jointly informative of a cognitive state of interest?

Spatial deployment of informative regions

Nandy & Cordes (2003) MRMKriegeskorte et al. (2006) PNAS

Mourao-Miranda et al. (2005) NeuroImage

59

Classification induces constraints on the experimental design. When estimating trial-wise Beta values, we need longer ITIs (typically 8 – 15

s). At the same time, we need many trials (typically 100+). Classes should be balanced. If they are imbalanced, we can resample the

training set, constrain the classifier, or report the balanced accuracy. Construction of examples

Estimation of Beta images is the preferred approach. Covariates should be included in the trial-by-trial design matrix.

Temporal autocorrelation In trial-by-trial classification, exclude trials around the test trial from the

training set. Avoiding double-dipping

Any feature selection and tuning of classifier settings should be carried out on the training set only.

Performance evaluation Do random-effects or mixed-effects inference. Correct for multiple tests.

Issues to be aware of (as researcher or reviewer)

60

Evaluating the performance of a classification algorithm critically requires a measure of the degree to which unseen examples have been identified with their correct class labels.

The procedure of averaging across accuracies obtained on individual cross-validation folds is flawed in two ways. First, it does not allow for the derivation of a meaningful confidence interval. Second,it leads to an optimistic estimate when a biased classifier is tested on an imbalanced dataset.

Both problems can be overcome by replacing the conventional point estimate of accuracy by an estimate of the posterior distribution of the balanced accuracy.


Brodersen, Ong, Buhmann, Stephan (2010) ICPR

61

Pattern characterization

Example – decoding the identity of the person speaking to the subject in the scanner

Formisano et al. (2008) Science

voxe

l 1

voxe

l 2

...

fingerprint plot(one plot per class)

62

Is there a link between and ?

To test for a statistical dependency between a contextual variable and the BOLD signal , we compare : there is no dependency : there is some dependency

Which statistical test?

1. define a test size (the probability of falsely rejecting , i.e., specificity),

2. choose the test with the highest power (the probability of correctly rejecting , i.e., sensitivity).

Lessons from the Neyman-Pearson lemma

The Neyman-Pearson lemma

The most powerful test of size is:to reject when the likelihood ratio exceeds a criticial value ,

with chosen such that

The null distribution of the likelihood ratio can be determined non-parametrically or under parametric assumptions.

This lemma underlies both classical statistics and Bayesian statistics (where is known as a Bayes factor).

Neyman & Person (1933) Phil Trans Roy Soc London

63

Lessons from the Neyman-Pearson lemma

In summary

1. Inference about how the brain represents things reduces to model comparison.

2. To establish that a link exists between some context and activity , the direction of the mapping is not important.

3. Testing the accuracy of a classifier is not based on and is therefore suboptimal.

Neyman & Person (1933) Phil Trans Roy Soc LondonKass & Raftery (1995) J Am Stat AssocFriston et al. (2009) NeuroImage

64

Temporal evolution of discriminability

Soon et al. (2008) Nature Neuroscience

Example – decoding which button the subject pressed

motor cortex

frontopolar cortex

classificationaccuracy

decision response

65

Approach1. estimation of an encoding model2. nearest-neighbour classification or voting

Identification / inferring a representational space

Mitchell et al. (2008) Science

66

Approach1. estimation of an encoding model2. model inversion

Reconstruction / optimal decoding

Paninski et al. (2007) Progr Brain Res Pillow et al. (2008) Nature Miyawaki et al. (2009) Neuron

67

Recent MVB studies

68

Specifying the prior for MVB

1st level – spatial coding hypothesis

2nd level – pattern covariance structure

𝜂×voxels

patterns

Thus: and

𝑈 𝑈 𝑈Voxel 3 is allowed to play a role, but only if its neighbours play weaker roles.

Voxel 2 is allowed to play a role.

69

Inverting the model

Partition #1

Partition #2

Partition #3(optimal)

Σ=𝜆1×

Σ=𝜆1×

Σ=𝜆1×

+𝜆2×

+𝜆2× +𝜆3×

Model inversion involves finding the posterior distribution over voxel weights .

In MVB, this includes a greedy search for the optimal covariance structure that governs the prior over .

subset subset

subset subset subset

subset

Pattern 8 is allowed to make some contribution, independently of the contributions of other patterns.

Pattern 3 makes an important positive contribution / is activated.

70

Observations vs. predictions

𝑹𝑿 𝑐

(moti

on)

71

MVB may outperform conventional point classifiers when using a more appropriate coding hypothesis.

Using MVB for point classification

Support Vector Machine

72

Model-based analyses by data representation

Model-basedanalyses

How do patterns ofhidden quantities (e.g., connectivity among brain regions) differ between groups?

Structure-basedanalyses

Which anatomical structures allow us to separate patients and healthy controls?

Activation-basedanalyses

Which functionaldifferences allow us toseparate groups?

73

0

0.2

0.4

0.6

0.8

1

bala

nced

acc

urac

y

1 2 3 4 5 6 7 80

20

40

60

80

100

120

log

mod

el e

vide

nce

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

bala

nced

pur

ity


0

0.2

0.4

0.6

0.8

1

bala

nced

acc

urac

y

1 2 3 4 5 6 7 80

20

40

60

80

100

120

log

mod

el e

vide

nce

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

bala

nced

pur

ity

0

0.2

0.4

0.6

0.8

1

bala

nced

acc

urac

y

1 2 3 4 5 6 7 80

20

40

60

80

100

120

log

mod

el e

vide

nce

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

bala

nced

pur

ity

functional connectivity

effective connectivity

78%71%

(model selection) (model validation)

best model

71%

supervised learning:SVM classification

unsupervised learning:GMM clustering (using effective connectivity)

62%

78%

55%

regionalactivity

Brodersen, Deserno, Schlagenhauf, Penny, Lin, Gupta, Buhmann, Stephan (in preparation)

74

Question 1 – What do the data tell us about hidden processes in the brain? compute the posterior

Generative embedding and DCM

?

?

Question 2 – Which model is best w.r.t. the observed fMRI data? compute the model evidence

Question 3 – Which model is best w.r.t. an external criterion? compute the classification accuracy

{ patient, control }

75

Model-based classification using DCM

structure-basedclassification

activation-basedclassification

model-basedclassification

model selection

inference onmodelparameters

vs.

?

{ group 1, group 2 }

multivariate models for fmri data klaas enno stephan (with 90% of slides kindly contributed by kay...

Documents

neuroimage slide

univariate model

multivariate bayes

multivariate models

multivariate approaches

generative embedding

distributed coding slide

decoding encoding model