multivariate pattern analysis

MSC IN NEUROIMAGING – STATE OF THE ART ESSAY

Multivoxel Pattern Analysis (MVPA) in fMRI settings : Fundamentals & Case of study

[Escriba el subtítulo del documento]

Mario B.Pérez

12/12/2013

Multi-voxel Pattern Analysis (MVPA) in fMRI settings:

Fundamentals & Case of study.

by Mario B. Perez

Introduction

The rise of MVPA as analysis technique for fMRI BOLD data is yet to come. Authors like

Haxby (2012) have pointed out the initial complexity and uniqueness of the MVPA perspective

upon brain response, and the slow adaptation of the researcher community to this new way of

thinking which involves knowledge from machine learning methods.

Unlike univariate techniques, that normally address where a cognitive process is

localized, MVPA (which has many synonyms; mutivariate pattern analysis, information-pattern

analysis...) can give an additional answer on how it is coded. An additionally interesting feature

is its ability to clarify the common situations in which two different processes overlap in their use

of brain areas, sharing the same resources for divergent purposes (Peelen and Downing, 2006).

The main aspect that makes MVPA a qualitative jump in fMRI BOLD processing is that it

accounts for the interactions between individual voxels and, as its name announces, detects

and refines this interactions in patterns of activation. This activation patterns can be aroused

due to any given process in the brain, and can then labeled and recognised when they will

appear again (Tong and Pratte, 2012). Although as we will see there are many possible flaws in

this process, with this basis many impressive and eccentric applications have flourished

gradually. Since the famous “brain reading” or “brain decoding” (Reddy et al,2010), to lying

detection (Davatzikos et al., 2005) or even natural scenes (Nishimoto et al.,2011).

Given the fact that the usage of MVPA normally entails the selection of regions of

interest (ROIs), the visual system has been the main target of studies undertook up to date, due

to its relatively well-known functional structure.

While reviewing the literature about this topic, the pioneering work of Haxby (2001) upon

visual category recognition and Kamitani and Tong´s (2005) prediction study upon of grating

orientation are quoted very often, and are considered as responsibles for the spreading of

MVPA. The remarkable work of Kay et al., (2008) on image identification is also influential.

Although it has been applied to other sources of data like EEG (Rosenberg et al., 2012)

MVPA has been mainly applied to data from fMRI signal. Its novelty and distinction from other

conventional ways of analysis, together with the promising perspectives of its use and striking

preliminar applications make of MVPA the main topic of this essay. In the following sections we

try to set the basis that underlies MVPA, take a glance to the most common flaws while using it,

and finally reviewing a case of study in which this methodology was used.

Characterizing Multi-voxel Pattern Analysis

Like with everything else, probably the best way of introducing a new technique is to

compare it with the preexisting ones. Traditionally, when a researcher wants to know which

areas are involved in a particular task or setting, the analysis of the fMRI data considers each

voxel individually, although the signal is acquired at once from the whole brain (Haynes and

Rees, 2006). Thus, the activity course of each voxel is conceptualized as unrelated from others,

and its analysis is carried out without considering the possibility that other active or inactive

voxels may be relevant in it. Other voxels behaviour may provide a sense for that voxel

activation, and therefore contribute to draw the whole picture (Haxby, 2012). Although praised

and recognised as extremely useful (Norman et al., 2006) This mass univariate analysis has

shown its limitations (O´Toole et al., 2007) as there are limits to examine voxels in isolation.

By contrast, MVPA performs a multivariate analysis that takes into account those

relations and differences of activity between voxels that arise from complex stimulation settings

Then, MVPA is not exclusively aimed to determine which voxels are active, but how the

activation of different voxels is related, the so-called activity patterns. These activity patterns

portray valuable information about how an stimuli or percept is coded, and provide to fMRI

analysis an enhanced sensitivity to cognitive processes (Tong and Pratte, 2012). MVPA is

dedicated to this pattern-recognition activity.

This enhanced sensitivity is extracted by avoiding certain steps that take part in

conventional fMRI analysis settings (e.g block designs), such as spatial smoothing and

averaging to intensify the differences between experimental conditions (Norman et al., 2006)

Standard studies try to show that the average activity during one condition is significantly

different than other condition in all time points, and therefore the information about the brain

activity in a specific time point is lost (Haynes and Rees, 2006). This is specially relevant in

experiment that use complex stimulation , such as natural scenes, because averaging discards

the fine-grained activity that among with a certain amount of noise might carry valuable

information about how the stimuli is processed (Speirs and Maguire, 2007).Other processing

elements such spatial smoothing are also responsible of this “blurring effect” making fine-

grained activity unavailable (Mur et al.,2009). However, as Etzel et al. (2009) point out, spatial

smoothing can be useful in between-subjects analysis, where MVPA has shown difficulties due

to the great degree of specificity of the within-subject signal (Cox and Savoy,2003), that entails

generalization issues.

Haynes and Rees (2006) argued that this signal loss due to MVPA-unfriendly processing

steps could account for important features, because voxels that show a weak or inconsistent

response might do carry vital information when analyzed together and not separately. This

exposes an interesting idea about how some cognitive processes could work, the weak choral

activation of many voxels might be potentially as useful as a strong (significant) activation of a

individual voxel.

As an example of this Kamitani and Tong (2005), in their famous study upon orientation

detection, have revealed that many voxels show patterns of weak activation in a consistent

basis across repetitions of the same condition, proportioning a reasonable basis to this

argument. More importantly, the remarkable work about category recognition carried out by

Haxby (2001) showed that weaker or “submaximal” voxels are representative of each category

(in this case shoes or bottles). More importantly, this study showed that even if voxels the

highest consistent activation are removed, the categorial “fingerprint” can be identified above

chance. This accuracy in recognition when category-key voxels were put away implies that

areas showing lower levels of activation can be used to discriminate between two categories.

Because these categories share some high-activity areas, overlapping of two functions could be

resolved by selecting low level features (Peelen and Downing, 2006). Downing et al. (2005) also

provide a relation of overlapping category areas in which MVPA could be useful. Although

distinction between two activity patterns based on activation intensity is possible (Hanson et al.,

2004) research on overlapping issues needs to carry on.

MVPA has been also described as a major advance in information extraction from the

fMRI signal (Norman et al., 2006) and a necessary tool to avoid data wasting from neuroimaging

data, which is normally expensive and difficult to register (O'Toole, 2007). Thus, as we said

previously pattern analysis does not use processing steps that “rule out” potentially crucial

information.Instead of using those strategies, MVPA tries to make the most of that “fine-

grained”1 activity by defining what is the activation pattern of a voxel ensemble in a given

example . Examples are presentations of our stimuli that will provide activation patterns to our

classifier algorithm. Once our classifier has been trained with several examples, it will be

theoretically ready to recognise which example has been presented to him. In a sense, the

classifier holds a “weighted model” of the activation pattern characteristic of an object category

for example.

All this information might be a little unclear while compressed in such a brief proceeding

description, but in the next sections we will address what is a classifier and how the process

takes place.

MVPA basic procedure

The graph below summarizes the procedure for carrying out an MVPA experiment.

There a few remarks that have to be addressed before getting into detail, specifically and

foremost about training data, testing data and feature selection. Preprocessing steps, as well as

scanning details are not taken into account in this essay2.

1 *The exploitation of these “fine-grained” activity represents one of the central features of MVPA, as low

level activity or activity that does not achieve significance might be lost if this fine-grained activity would be disregarded. 2 Brief note on data preprocessing: As Etzel et al. (2009) indicate, many steps that are used in

univariate analysis that take part as well in MVPA. Correction for motion and normalization are typically used, while voxel-wise detrending (to correct scanner drifting for example) might be controversial due to the delicate nature of the data needed for MVPA. The will find a great review about how to undertake a classification analysis upon fMRI data at the quoted article.

The overall majority of reviews and articles consulted on technical aspects of MVPA

make particularly clear the necessity of splitting training data and testing data as the first step to

be made (Pereira et al., 2009; Kriegeskorte et al., 2009; O'Toole et al., 2007; Mur et al.,2009).

Also important is to not to use the testing data as part of the feature selection. The reasons that

explain these precautions will be addressed later on, however, it must be clarified as the

illustration fails to reflect this particular aspect evident.

3

3Illustration from Norman et al., 2006. It includes a fourth step (“b” or pattern assembly) that is rarely

mentioned. It does not appear in other reviews as Mur et al.,2009, and involves the “labelling” of activity patterns. Since patterns are caused by discrete stimuli and we have set that stimuli, there is no need to label the patterns. It may be pertinent in exploratory analysis, where the source of the pattern may be unknown. Believe it or not, this is the best illustration of the process available and its repeatedly used.

Overview

The process outlined in the graph above includes three main steps that will be described

next (remember that data splitting has already took part):

1. Feature selection: (or voxel selection): This first step tries to delimitate a set of

voxels that will be used further. It can be done following different techniques. In

this case , bottles and shoes were showed to select the pertinent voxels

2. Classifier training: In these stage the training data is used to train our classifier

algorithm, so it will establish a successful function between our examples (or

stimuli) and its characteristic activity pattern. Once training has finished the

classifier generates a decision plane.

3. Classifier or generalization testing: The classifier is exposed to new data (testing

data set) that belongs to the same category. In our example, it will be images of

shoes and bottles not presented previously. The activation patterns will be

submitted through the classifier that will assign them a position on the decision

plane. Based on where they “fall” and the identity of the example, the

classification was successful or not.

Feature Selection

Also called “voxel selection”, this is a capital step in MVPA, because it will define the

framework and extent of the analysis. It is as well one of the steps that portrays many pitfalls, as

we will see in the section for that purpose.

First of all, Why is feature selection necessary?

Many articles disregard this fundamental question that arises easily. The mere presence

of a voxel selection appears looks to contradict the foundation of MVPA. If taking into account

the interactions between voxels is the goal to achieve, to narrow down the amount of voxels that

we are going to account for in our analysis seems nonsense. Nonetheless, as Cox and Savoy

(2003) point out, many classifiers experience an inherent loss of accuracy when the number of

voxels included into analysis is very high. While MVPA´s power mainly resides in taking into

account voxels which activity is not necessarily significant, adding irrelevant voxels whose

activity mainly reflects noise or is very low affects significantly the performance of the classifier.

In spite of all, methods and applications that allow the usage of whole-brain activity have been

described at Tong and Pratte (2012).

Normally, these whole-brain studies deal with high-level cognition processes, which are

not easy to narrow down to a specific set of ROIs. Therefore, these researchers use

independent component analysis (ICA) or Principal Component Analysis (PCA) to narrow down

the number of dimensions.

In a simple way, is a method that allows reducing the number of variables to take care of

by grouping them around linear solutions, that are unrelated as much as is possible.

So,given the necessity to do so, feature selection will provide us with the voxels we are

going to include in our analysis. First of all, it is very common to select a region of interest (ROI)

relevant for our study in which our feature selection is going to take place (Haxby et al., 2001;

Mur et al., 2009; Chadwick et al.,2010). Following Pereira et al.(2008), we can distinguish

between filtering and wrapper feature selection methods. Wrapper methods carry out operations

adding and subtracting voxels taking in consideration the impact they have in the classificator´s

performance, however, these methods involve some combinatorial issues that make computing

complicated (Norman et al., 2006) and filtering methods are normally preferred. Filtering

involves creating a voxel ranking based on a specific criteria. We can then rank voxels based on

how active are they are, how high is their discrimination power between conditions, their

prediction accuracy, the consistency of their activation and so on.

It is important nonetheless to realize that by doing filtering, we are considering voxels as

separate identities again (so we are performing univariate analysis) . A popular option is to

focus on voxels which show maximum activity and hold a good discriminant power (Polyn et. al,

2005, as quoted in Norman et. al, 2006). With certain classifiers, a multivariate feature selection

called “Searchlight accuracy” can be used (Kriegeskorte et al, 2006). This method tries to add

the information from the voxel´s environment (neighbouring voxels) defining a spherical cluster

which is a “ball of voxels” of “x” radius. The testing data is used repeatedly so the useful voxels

can be detected within the radius of the spherical cluster

Classifier Training

This step involves to use those trials that we saved for training to supply examples of the

activation patterns (characteristic of our experimental conditions) to a multivariate algorithm.

This algorithm will learn the statistically representative features of our stimuli, and will generate

a decision function (1c in the previous graph) that will be used to make a call each time a stimuli

will be presented in our testing phase.

The algorithm choice is one of the decisive steps when it comes to MVPA..There is a

great variety of classifiers available and a not-so-exhaustive discussion about them will take all

the length of this essay. However, we cannot pursue with almost a slight discussion of this point.

What is a classifier algorithm?

The easiest way to introduce this notion is to describe the task that classifiers perform.

Classifiers have to identify the relationship between voxel´s activity and the stimulus

appearance, and being able to recognize that relationship with unpresented stimuli of the same

category in the future.

Thus, classifiers obtain a parametric profile of the activity pattern elicited by the stimulus

or example. This parameters are acquired during the training phase with the data reserved for

that purpose. When the training is finished, the classifier is supposed to be able to give us a

prediction (or identify, discriminate; that lies on the researcher's assumption4) of which stimulus

has been presented to a given subject. This prediction must be based on a different set of

examples than the one used for training, if otherwise, there would exist a problem of overfitting

(see section of limitations).

Classifiers differ in the type of function they learn (Pereira et. al, 2008). Primarily,

algorithms can be divided between linear and non-linear ones. The overall majority of MVPA

studies have used linear classifiers due to their success according to Mur et. al (2009).

Additionally, non-linear classifiers have not consistently demonstrated a superior performance in

any case to date according to Mur et al (2009) while the same authors consider the solutions

offered by these classifiers as difficult to interpret. Sheng (2011) suggests that one of the key of

linear classifiers is their simplicity and their ability to balance the influence of specific voxels

between examples or stimulus. All linear classifiers will elaborate a weighted model, that will

reflect the importance of the different voxel activity values . In the illustration below each voxel

(represented by x) has assigned a specific weight (w). In a hypothetical situation, category

“shoe” could be defined by xw>0 and class “bottle” could be defined by xw<0.

5

Aspects to consider when choosing for a classifier are the number of features or voxels

and the number of categories, among other factors not described here. Popular linear classifiers

are GNB, LR or SVM. Compared between them, GNB tends to have a poor performance in

settings with many voxels and LR has better results in situations with more than two conditions

or categories than SVM (Pereira et al., 2008). SVM requires of additional modifications for

working with more than two categories (our case of study uses on of these classifiers).

Decision-making threshold

Taken to the simplest situation in which we have two features or voxels, we can observe

how the decision-making process will take part.

These voxels (there is a slightly similar example at Mur et al., 2009) could work as

coordinates to define a point in a plane, and we would only need to build up a line to separate

our conditions.

4 Note that despite left to the researcher´s intention, identifying and discriminating are not the same thing

whatsoever. Identifying means positively specify to which category an activity pattern belongs to without having presented it before to the classifier (Kay et al., 2008), while discriminating could mean just distinguish between two examples (Chadwick et al., 2010) 5 Illustration from Pereira et al.,(2008)

This line will be constructed with the feedback that the classifier is provided with as

being trained with training data examples. Thus, given an example during the testing phase , the

“xw” model will be submitted to the decision function that has already been constructed

throughout the training phase ,and that will work as our linear threshold to determine which

category has been presented.

6

The illustration above give us a chance to explain part of the of the potentiality of MVPA.

In the first situation (a) , we see how the two distributions are completely segregated in a rather

simple way, when voxel “X1” (let´s say blue) and voxel “X2”(let's say red) have opposed

activations. When condition “A” is presented, voxel “X1” shows activation while “X2 is inactive.

In this situation the usage of univariate analysis would yield optimal results. There is no overlap

between conditions.

However, the situation at the right displays a more complex situation (b). It can be

approached nonetheless by using MVPA with a linear classifier, that by assigning weights to

each voxel will be able to code the influence of them. Then, given an specific point on the plane

the decision threshold will allow us to determine what condition or example was more probable

to have occurred.

The last of the three situations in our illustration (c.) will be tackled only with the help of a

non-linear classifier. The idea is the same as with linear ones, but in this case the decision

threshold is more complex.

Although non-linear classifiers might be more powerful, most of the texts are not very

enthusiastic about their utilization in one way or another (Kamitani and Tong, 2005, Pereira et.

al, 2008, O'Toole et. al, 2007, Norman et al., 2006 and others).

6 illustration from Cox and Savoy (2003)

As we said, it is considered that this methods yield results very difficult to interpret, and

that the gain in performance due to usage of non-linear classifiers is unclear.

Classification by Nearest-neighbour

This method is one of the simplest, as it does not imply the learning of a function

properly speaking. The example presented is compared to the ones already seen in the training

stage, so a decision is made based on the likeness between the training and the testing

examples. There are ways that can improve the performance of nearest-neighbour by averaging

the pattern left by the testing examples, but again this will remove variability that might be

valuable for making a decision. According to Pereira (2008), nearest neighbour works well as

long as the number of voxels remains relatively low. This classification system was used in

Haxby (2001).

Generalization Testing

Up to this point, the last step is just to test the classifier by exposing it to new,

unpresented data. The comparison between the presentation template and the predictions

yielded by our classifier will yield an accuracy percentage.The classifier has therefore made a

judgement in each case saying which of the conditions has been presented. If it achieves values

beyond chance, training has been successful.

Limitations and Pitfalls using MVPA

Like every method, MVPA has several weaknesses, some of which are more avoidable

than others. Technical limitations due to spatial or temporal resolution are difficult to avoid

(temporal resolution of MVPA is inevitably limited by the dispersion of the hemodynamic

response Norman et al.,2006), whereas others like feature selection or classifier choice are

likely to be controlled with the help of a good decision´-making process.

Capacity to deal with overlapping states

For example, as we have seen one of the strengths of MVPA is to disentangle the

activation patterns (spatial patterns) produced by two different stimuli or mental states that have

take part (Cox and Savoy, 2003). By contrast, as Haynes and Rees (2006) point out there is

currently no evidence supporting that MVPA could distinguish between two stimulus that

happen at the same time and whose spatial representation share the same conjunct of neurons.

It can be argued that this limitation might be solved with the appearance of higher spatial

resolution, but as Haxby (2012) states, there is a necessary limit in the number of modules that

only can support one kind of processing.

A limited number of categories for an unlimited world

In a logical extension of this reasoning, Haynes and Rees (2006) claim that while

percepts or stimulation ways are virtually infinite, the number of training categories has to be

obviously limited. Hence, our classifier will be always limited to a certain number of

discriminations. Attempts to work in this issue could came from studies that deal with the

generalization problems of MVPA like the one carried out by Kay et al. (2008). The classification

in this report shows remarkable generalization skills while exposed to numerous unpresented

images reaching high prediction rates based in a training set of 1750 images.

The presence of previous knowledge

Although as we said to obtain whole-brain analysis is possible (Polyn et al.,2005, as

quoted in Haynes and Rees, 2006) it certainly involves many challenges difficult to resolve

(combinatorial limitations, overfitting….). A plausible alternative could be the usage of

searchlight feature analysis, which is supposed to alleviate the potential computational issues

(Tong and Pratte, 2012) . Thus, using MVPA implies to have a reasonable knowledge of the

features to study and almost some guidance to know where to find them. As Pereira et al.

(2008) mention, the definition of ROIs is a common step in the overwhelming majority of MVPA

studies. This particular issue is supposed to have a lower impact in systems which functional

architecture is relatively known (visual system according to Haynes and Rees, 2006) but stands

as a remarkable issue with other cognitive functions whose functioning basis has not been

described properly yet. Feature selection stands as one of the biggest causes of issues in

MVPAs studies. According to Tong and Pratte (2012) studies on higher-level cognition have

difficulties to define a coherent set of regions of interest, and to therefore to target correctly

relevant voxel arrays.

Generalization issues

This a topic related at the same time with the strengths of MVPA. Pattern recognition

involves the exploitation of the so-called fine grained activity (Norman et. al, 2006).

Consequently, response patterns are highly characteristic and difficult to extrapolate to other

subjects. The pattern aroused by the stimulus “X” in the subject “1” should be, in ideal situation,

the same as the one aroused by the same stimulus in subject “2”. Currently and while some

extrapolations has been successful (Haxby 2011 developed “hyperalignment” which includes

tuning functions , as quoted in Haxby, 2012), this is an unresolved problem.

Studies normally conduct MVPA analysis in a within-individual basis. However,

generalization problems doesn´t die there. As Haynes and Rees (2006) indicate, even more

complicated is the generalization across different contexts. That is, in our dummy example,

subjects “1” and “2” receive the same stimulation assuming a similar setting, but what would

happen if the context surrounding that presentations would not be the same?.

While classification accuracy does not drop uncontrollably, a severe worsening even

when the setting is the same but scans are carried out in different days (Cox and Savoy,2003).

Generalization across different stimulus was nonetheless achieved in a working memory study

(Harrison and Tong, 2009; as quoted in Mur et al, 2009) and between subjects in an auditory

perception one (Formisano et al., 2008 as quoted in Tong and Pratte, 2012).Once again the

study of Kay et al., (2008) stands as example, as it demonstrated successful generalization

across time. Haynes and Rees (2006) point out that to improve generalization normally takes is

cut out of the individual discriminatory power.

Finally, an important aspect is to interpret carefully the mere differential activity. Poldrack

(2009) found that when subjects carry out a cognitive task that varies enough almost all cortex

can show discrimination power. Differential activity rates can take part due to a myriad of

reasons like slight differences in memory processing load, difficulty, time to process or language

requirements (Tong and Pratte, 2012).Future directions have a great deal in developing

calibration and adjustment protocols between subjects and situations.

Circular Analysis or Overfitting

The danger of circular analysis is a common cause of concern in MVPA studies.

Reviewing the recent literature, it is evident that most of researchers are aware of this issue in

almost its simpler form, which we explain now.

Most of MVPA studies have as their main goal to train a classifier able to identify a

specific activity pattern and distinguish it from others of the same or other categories (examples

of this are Haxby, 2001;Kay et al., 2008;Chadwick et al.,;2010).To achieve this, we have stated

that separated training and test data have to exist, the first one to train the classifier and the

later to test the classifier´s performance in terms of correct guesses proportion. The

independence of these two data sets is understood as crucial, and they must be splitted in two

before proceeding to the feature selection phase, as the first step of all process (Pereira et al.,

2009)

O´Toole et al., (2009) explains the reason why the training data cannot be used to test

the classifier. In fMRI studies, we have normally a large number of parameters (that are the

voxels we take into our analysis) compared to the number of examples (stimulus presentations)

presentations. Thus, that the voxels contained in our ROIs outnumber largely our number of

examples is relevant because the number of parameters that will characterize each activation

pattern will be enormous, and some of these parameters will contain noise (probable systematic

and unsystematic sources of error).

This overfitting or large parameter characterization leads to a situation in which a

perfect classification for the training test is possible. By contrast, the same classifier will obtain

poor results while tested with new data due to the same reason.

Overfitting leads to lower accuracy on the test set and therefore to lower generalization

skills for the classifier (Mur et al,2009)

Overfitting can lead to classifier overestimation as well if the training data will be used to

assess the classifiers accuracy during the testing stage (Kriegeskorte et al., 2009). In this

situation, the classifier built with many parameters can fit and identify a significant part of the

testing material, regardless of the algorithm skills to classify other stimuli of the same category.

Put simplistically, classification will be successful segregating the stimuli that has already been

presented to him, but there are no guarantees that it will be successful with other stimuli, even if

they are of the same category (Mitchell, 2010).

It has been stated then that overfitting is then related with voxels selection, because the

more the voxels we select in our analysis, the bigger the number of our parameters, and the

higher the chances that low level parameters take a leading part in our classification

performance. There is a second variant of overfitting described by Pereira et al., (2009) that

operates in a more subtle basis. It has been settled that feature selection can be carried out in

different ways, normally involving a set of data that will help us to delimitate our voxel selection,

probably into an beforehand selected ROI. The illustration below (from Kriegeskorte,2009)

shows the prediction results when only the training set or all the data is used to delimitate the

voxels selection (ignore the dark bars labelled as “task”). When all the data is used,

classification was almost at 100%, whereas using only the training data lowers down the

previous percentage to a rough 75%:

The same table provides the comparison with random data, when all the information is used to

delimitate our voxels selection and when only training data is used .Although still above chance,

the results have changed significantly. The reason for this is that it is a subtle way to let the

classifier to learn about our testing data. Some of the voxels time series seem to be compatible

between the training and the testing sets due to their belonging to a particular common

category. In a way, it can be said that the training and the testing data sets are no longer

independent (Kriegeskorte et al., 2009).

Case of study: Chadwick, M. J., Hassabis, D., Weiskopf, N., & Maguire, E. A.

(2010). Decoding individual episodic memory traces in the human hippocampus.Current

Biology, 20(6), 544-547.

Overview

The following is just a little summary which intention is to orientate the reader. The article

itself is easy to find and I encourage its reading (including the Supplementary Information). To

read this article may be useful to understand the following comments and remarks, as space

limitations make impossible a full detailed description that could have been biased anyway.

The theoretical background of this article lies on the principles about how the

hippocampus (HC) stores representations of episodic memories. More specifically, the authors

mention that the HC is supposed to store an “index” (Marr et al., 1971 as quoted) of the episodic

memory, that would contain the guidelines to reconstruct a complex, multimodal memory.

Across the article and by the claims made, it is evident that the authors suggest an identification

between the activity pattern aroused when the memory is recalled and the “index” previously

mentioned.

The experimental task consisted primarily in to determinate if the representations (that

is, activation patterns) of different memories from three different videos can be guessed and

therefore distinguished with a trained classifier.The videos show a different woman each one,

performing basic actions (post a letter, throw a can to a trash bin…) and were viewed 15 times

by each subject prior to scanning.

Basically, the subjects performed three recalling tasks with measurements taken from

three regions of interest (ROIs), HC, entorhinal cortex (EC), and parahippocampal gyrus

(PHG).In the first modality, they were instructed which of the three memories they were

supposed to recall, while in the second the recall was free. Subjects were instructed to

randomize between the three memories.

In both conditions the multivariate analysis yielded significant decoding rates, with no

statistical differences between conditions, and eventually the data from both conditions was

collapsed for further analysis. The illustration below (taken from Chadwick et al.,2010)

summarizes the decoding rates for each area [hippocampus accuracy of 44% (p= 0.000001;

chance level = 33%), mean entorhinal cortex accuracy of 38.5% (p = 0.009), and mean

parahippocampal gyrus accuracy of 41% (p = 0.0004)]

Multivariate Classification Procedure

The above graph summarizes the multivariate classification procedure. In the illustration

only two representations videos are shown “for the sake of simplicity” as the authors said.

(taken from Chadwick et al., 2010, Supplementary Information). The image “A” shows two

image captures from two of the videos (each video of 7 sec.), “B” shows an stimuli template. It is

important to bear in mind that the stimuli are not the videos themselves, but the recall of them.

Thus, ABBBAA.. implies, video A evocation, video B evocation, and so on. “C” describes the

process of feature selection using the searchlight multivariate method for each ROI (it was

described in feature selection methods of this essay). The data was splitted between training set

and testing set, and only the testing set was used for voxel selection. They used a k-fold cross-

validation strategy, which involves the selection of new features by searchlight feature selection

each time (in each fold one example is left for testing and the rest are used), as authors say,

with different training data.”D” once the voxel selection is completed, the linear SVM classifier is

feeded with the examples to afterwards be tested using the example saved for testing purposes.

Finally, in “E” the test data (all the examples are used as test data almost once according to the

k-feature testing regime) which will be used to determine the classificator´s accuracy.

Predictions are then compared with the real video presentation to establish the accuracy

percentages.

Criticism and Remarks

If I may, let me first point out that as far as I know this is the first critic that is done upon

this article. I wanted to try to do my best for applying the knowledge that I have obtained while

writing this essay, that I have enjoyed doing as much as I have suffered. Thus, the attempt is to

go a bit further than the mere introduction of the technique, but also to take a glance of how its

applied out of the strictly machine-learning environment. Finally, I shall apologize for the

reckless criticisms that I might be doing.

Multivariate Classification Procedure & Claims of functional differentiation

This experiment is particularly difficult to picture and it is important to keep a set of

assumptions in mind. Firstly, examples used to test and train the classifier are activity patterns

which come from the recalling of the three videos.

Therefore, when the data set was splitted in testing and training data in each k-fold , it

means that activity patterns from the three videos were separated in testing data (only one

activity pattern) and training data (all the rest). While all of the activity patterns can be regarded

as “different” because are evoked in different times and due to the reconstructional nature that

characterizes memory recall. This characteristic of the memory recall makes the training data

(that in the illustration above showed only 9 examples and we could therefore think about 14

with three videos) clearly insufficient due to the highly similarity between videos.

The limited number of examples per stimuli together with the similarity between them

probably made the linear classifier to acquire a few amount of overlapping strong parameters,

while adding a great number of voxels whose activity was barely consistent each time. That

barely consistent activity contained the fine aspects that could have distinguish between the

three stimuli, and that may have helped to raise the classification accuracy rating by periodically

saturating a series of parameters. It seems illogical why the authors selected three “memories”

such similar between them if their intention was to distinguish memory traits (all of them with a

different woman that performs a similar action each time and walks away).For all what was

exposed, a singular variety of overfitting took part when accuracy ratings barely surpassed

chance levels.

Possible additional reasons for this relatively poor performance is the unequal number of

examples. When one example is reserved for testing, the other two have one example extra.

While this might look trivial, Pereira et al., (2008) point out that the classifier can tend to prime

the category with more examples, and therefore tend to predict it more frequently.

Our second issue concerns the following statement “Our data provide further evidence

for functional differentiation within the medial temporal lobe, in that we show the hippocampus

contains significantly more episodic information than adjacent structures” . The authors claim

functional attachments to the classification percentages, that show a slight better performance

in the hippocampal area. There is no doubt that when memories are evoked some neurons in

the hippocampus show activation. Even going further, there is a weak evidence that they can

discriminate between memories when a trained classifier is used, but there is no evidence which

supports that those neurons carry episodic information. This is as an example of reverse

inference (Poldrack, 2006).

Finally, a similar and more suggestive setting could have been the usage of a classifier

to try to distinguish between unpresented memories. Such experiment would have entailed the

memorization of the examples previously to feature selection, as they would be used only for

testing purposes. Feature and training data could be similar to ensure that the voxel selection

will contain pertinent voxels.

References

Chadwick, M. J., Hassabis, D., Weiskopf, N., & Maguire, E. A. (2010). Decoding

individual episodic memory traces in the human hippocampus.Current Biology, 20(6),

544-547.

Cox, D. D., & Savoy, R. L. (2003). Functional magnetic resonance imaging (fMRI)“brain

reading”: detecting and classifying distributed patterns of fMRI activity in human visual

cortex. Neuroimage, 19(2), 261-270.

Davatzikos, C., Ruparel, K., Fan, Y., Shen, D. G., Acharyya, M., Loughead, J. W., ... &

Langleben, D. D. (2005). Classifying spatial patterns of brain activity with machine

learning methods: application to lie detection. Neuroimage, 28(3), 663-668.

Downing, P. E., Wiggett, A. J., & Peelen, M. V. (2007). Functional magnetic resonance

imaging investigation of overlapping lateral occipitotemporal activations using multi-voxel

pattern analysis. The Journal of neuroscience,27(1), 226-233.

Downing, P. E., Chan, A. Y., Peelen, M. V., Dodds, C. M., & Kanwisher, N. (2006).

Domain specificity in visual cortex. Cerebral cortex, 16(10), 1453-1461.

Etzel, J. A., Gazzola, V., & Keysers, C. (2009). An introduction to anatomical ROI-based

fMRI classification analysis. Brain Research, 1282, 114-125.

Hanson, S. J., Matsuka, T., & Haxby, J. V. (2004). Combinatorial codes in ventral

temporal lobe for object recognition: Haxby (2001) revisited: is there a “face”

area?. Neuroimage, 23(1), 156-166.

Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., & Pietrini, P. (2001).

Distributed and overlapping representations of faces and objects in ventral temporal

cortex. Science, 293(5539), 2425-2430

Haxby, J. V. (2012). Multivariate pattern analysis of fMRI: The early

beginnings.NeuroImage, 62(2), 852-855.

Haynes, J. D., & Rees, G. (2006). Decoding mental states from brain activity in humans.

Nature Reviews Neuroscience, 7(7), 523-534.

Kamitani, Y., & Tong, F. (2005). Decoding the visual and subjective contents of the

human brain. Nature neuroscience, 8(5), 679-685.

Kay, K. N., Naselaris, T., Prenger, R. J., & Gallant, J. L. (2008). Identifying natural

images from human brain activity. Nature, 452(7185), 352-355.

Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S., & Baker, C. I. (2009). Circular

analysis in systems neuroscience: the dangers of double dipping.Nature neuroscience,

12(5), 535-540.

Kriegeskorte, N., Goebel, R., & Bandettini, P. (2006). Information-based functional brain

mapping. Proceedings of the National Academy of Sciences of the United States of

America, 103(10), 3863-3868.

Norman, K. A., Polyn, S. M., Detre, G. J., & Haxby, J. V. (2006). Beyond mind-reading:

multi-voxel pattern analysis of fMRI data. Trends in cognitive sciences, 10(9), 424-430.

Nishimoto, S., Vu, A. T., Naselaris, T., Benjamini, Y., Yu, B., & Gallant, J. L. (2011).

Reconstructing visual experiences from brain activity evoked by natural movies. Current

Biology, 21(19), 1641-1646.

Mur, M., Bandettini, P. A., & Kriegeskorte, N. (2009). Revealing representational content

with pattern-information fMRI—an introductory guide.Social cognitive and affective

neuroscience, 4(1), 101-109

Mitchell, T. M. (2008, January). Computational models of neural representations in the

human brain. In Discovery Science (pp. 26-27). Springer Berlin Heidelberg.

O'Toole, A. J., Jiang, F., Abdi, H., Pénard, N., Dunlop, J. P., & Parent, M. A. (2007).

Theoretical, statistical, and practical perspectives on pattern-based classification

approaches to the analysis of functional neuroimaging data.Journal of cognitive

neuroscience, 19(11), 1735-1752.

Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: a

tutorial overview. Neuroimage, 45(1), S199-S209

Reddy, L., Tsuchiya, N., & Serre, T. (2010). Reading the mind's eye: decoding category

information during mental imagery. Neuroimage, 50(2), 818-825.

Spiers, H. J., & Maguire, E. A. (2007). Decoding human brain activity during real-world

experiences. Trends in cognitive sciences, 11(8), 356-365.

Poldrack, R. A., Halchenko, Y. O., & Hanson, S. J. (2009). Decoding the large-scale

structure of brain function by classifying mental states across individuals. Psychological

Science, 20(11), 1364-1372.

Polyn, S. M., Kragel, J. E., Morton, N. W., McCluey, J. D., & Cohen, Z. D. (2012). The

neural dynamics of task context in free recall. Neuropsychologia,50(4), 447-457.

Rosenberg, M., List, A., Sherman, A., Grabowecky, M., Suzuki, S., & Esterman, M.

(2012). Decoding EEG data reveals dynamic spatiotemporal patterns in perceptual

processing. Journal of Vision, 12(9), 1173-1173.

Sheng, L. I. (2011). Multivariate pattern analysis in functional brain imaging.Acta

Physiologica Sinica, 63(5), 472-476.

Tong, F., & Pratte, M. S. (2012). Decoding patterns of human brain activity.Annual

review of psychology, 63, 483-509.

multivariate pattern analysis

Documents